Text Clustering Using Statistical and Semantic Data

被引：0

作者：

Benghabrit, Asmaa ^{[1
]}

Ouhbi, Brahim ^{[1
]}

Behja, Hicham ^{[1
]}

Frikh, Bouchra ^{[2
]}

机构：

[1] Moulay Ismail Univ, Lab LM2I, ENSAM, Marjane 2,BP 4024, Meknes, Morocco

[2] Moulay Abdellah Univ, EST Fes, LTTI Lab, Atlas Fes, Morocco

来源：

WORLD CONGRESS ON COMPUTER & INFORMATION TECHNOLOGY (WCCIT 2013) | 2013年

关键词：

Text mining; clustering; feature selection methods; performance analysis; FEATURE-SELECTION;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The explosive growth of information stored in unstructured texts created a great demand for new and powerful tools to acquire useful information, such as text mining. Document clustering is one of its the powerful methods and by which document retrieval, organization and summarization can be achieved. However, it represents a challenge when dealing with a big number of data due to high dimensionality of the feature space and to the semantic correlation between features. In this paper, we propose a new sequential document clustering algorithm that uses a statistical and semantic feature selection methods. The semantic process was proposed to improve the frequency mechanism with the semantic relations of the text documents. The proposed algorithm selects iteratively relevant features and performs clustering until convergence. To evaluate its performance, experiments on two corpora have been conducted. The obtained results show that the performance of our algorithm is superior to that obtained by the existing algorithms.

引用

页数：6

共 24 条

[1]

Ahmad R., 2010, INT J COMPUTER SCI S, V4, P176

[2]

[Anonymous], 2004, SIGKDD EXPLOR, DOI DOI 10.1145/1007730.1007731

[3]

[Anonymous], 2000, WORKSHOP ARTIFICIAL

[4]

Berkhin P, 2006, GROUPING MULTIDIMENSIONAL DATA: RECENT ADVANCES IN CLUSTERING, P25

[5] CONTEXTUAL WORD SIMILARITY AND ESTIMATION FROM SPARSE DATA [J].

DAGAN, I ;

MARCUS, S ;

MARKOVITCH, S .

COMPUTER SPEECH AND LANGUAGE, 1995, 9 (02) :123-152

[6]

Djaanfar A.S., 2012, 2 INT C INN COMP TEC

[7] A NEW METHODOLOGY FOR DOMAIN ONTOLOGY CONSTRUCTION FROM THE WEB [J].

Frikh, Bouchra ;

Djaanfar, Ahmed Said ;

Ouhbi, Brahim .

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2011, 20 (06) :1157-1170

[8] Text clustering with feature selection by using statistical data [J].

Li, Yanjun ;

Luo, Congnan ;

Chung, Soon M. .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (05) :641-652

[9]

Liu T., 2003, P INT C MACH LEARN

[10] Supervised feature selection by clustering using conditional mutual information-based distances [J].

Martinez Sotoca, Jose ;

Pla, Filiberto .

PATTERN RECOGNITION, 2010, 43 (06) :2068-2081

← 1 2 3 →