Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering

被引:132
作者
Bharti, Kusum Kumari [1 ]
Singh, Pramod Kumar [1 ]
机构
[1] ABV Indian Inst Informat Technol & Management Gwa, Computat Intelligence & DataMin Res Lab, Gwalior, Madhya Pradesh, India
关键词
Text clustering; Feature selection; Feature extraction; Term variance; Document frequency; Principal component analysis; COMPONENT ANALYSIS; GENETIC ALGORITHM; CATEGORIZATION; OPTIMIZATION; INFORMATION;
D O I
10.1016/j.eswa.2014.11.038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensionality of the feature space is one of the major concerns owing to computational complexity and accuracy consideration in the text clustering. Therefore, various dimension reduction methods have been introduced in the literature to select an informative subset (or sublist) of features. As each dimension reduction method uses a different strategy (aspect) to select a subset of features, it results in different feature sublists for the same dataset. Hence, a hybrid approach, which encompasses different aspects of feature relevance altogether for feature subset selection, receives considerable attention. Traditionally, union or intersection is used to merge feature sublists selected with different methods. The union approach selects all features and the intersection approach selects only common features from considered features sublists, which leads to increase the total number of features and loses some important features, respectively. Therefore, to take the advantage of one method and lessen the drawbacks of other, a novel integration approach namely modified union is proposed. This approach applies union on selected top ranked features and applies intersection on remaining features sublists. Hence, it ensures selection of top ranked as well as common features without increasing dimensions in the feature space much. In this study, feature selection methods term variance (TV) and document frequency (DF) are used for features' relevance score computation. Next, a feature extraction method principal component analysis (PCA) is applied to further reduce dimensions in the feature space without losing much information. The effectiveness of the proposed method is tested on three benchmark datasets namely Reuters-21,578, Classic4, and WebKB. The obtained results are compared with TV, DF, and variants of the proposed hybrid dimension reduction method. The experimental studies clearly demonstrate that our proposed method improves clustering accuracy compared to the competitive methods. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3105 / 3114
页数:10
相关论文
共 56 条
[1]  
[Anonymous], 2006, 2006 5 INT C MACH LE
[2]   Empirical study of feature selection methods based on individual feature evaluation for classification problems [J].
Arauzo-Azofra, Antonio ;
Aznarte, Jose Luis ;
Benitez, Jose M. .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) :8170-8177
[3]   A three-stage unsupervised dimension reduction method for text clustering [J].
Bharti, Kusum Kumari ;
Singh, P. K. .
JOURNAL OF COMPUTATIONAL SCIENCE, 2014, 5 (02) :156-169
[4]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[5]  
Bradley P. S., 1998, Machine Learning. Proceedings of the Fifteenth International Conference (ICML'98), P82
[6]  
Buckley C., 1995, Proceedings of 1995 Text REtrieval Conference (TREC-3), P25
[7]  
Burges CJC, 2005, DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK, P59, DOI 10.1007/0-387-25465-X_4
[8]   INDEPENDENT COMPONENT ANALYSIS, A NEW CONCEPT [J].
COMON, P .
SIGNAL PROCESSING, 1994, 36 (03) :287-314
[9]  
CUTTING DR, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P318
[10]  
DEERWESTER S, 1988, P ASIS ANNU MEET, V25, P36