Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering

被引:132
作者
Bharti, Kusum Kumari [1 ]
Singh, Pramod Kumar [1 ]
机构
[1] ABV Indian Inst Informat Technol & Management Gwa, Computat Intelligence & DataMin Res Lab, Gwalior, Madhya Pradesh, India
关键词
Text clustering; Feature selection; Feature extraction; Term variance; Document frequency; Principal component analysis; COMPONENT ANALYSIS; GENETIC ALGORITHM; CATEGORIZATION; OPTIMIZATION; INFORMATION;
D O I
10.1016/j.eswa.2014.11.038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensionality of the feature space is one of the major concerns owing to computational complexity and accuracy consideration in the text clustering. Therefore, various dimension reduction methods have been introduced in the literature to select an informative subset (or sublist) of features. As each dimension reduction method uses a different strategy (aspect) to select a subset of features, it results in different feature sublists for the same dataset. Hence, a hybrid approach, which encompasses different aspects of feature relevance altogether for feature subset selection, receives considerable attention. Traditionally, union or intersection is used to merge feature sublists selected with different methods. The union approach selects all features and the intersection approach selects only common features from considered features sublists, which leads to increase the total number of features and loses some important features, respectively. Therefore, to take the advantage of one method and lessen the drawbacks of other, a novel integration approach namely modified union is proposed. This approach applies union on selected top ranked features and applies intersection on remaining features sublists. Hence, it ensures selection of top ranked as well as common features without increasing dimensions in the feature space much. In this study, feature selection methods term variance (TV) and document frequency (DF) are used for features' relevance score computation. Next, a feature extraction method principal component analysis (PCA) is applied to further reduce dimensions in the feature space without losing much information. The effectiveness of the proposed method is tested on three benchmark datasets namely Reuters-21,578, Classic4, and WebKB. The obtained results are compared with TV, DF, and variants of the proposed hybrid dimension reduction method. The experimental studies clearly demonstrate that our proposed method improves clustering accuracy compared to the competitive methods. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3105 / 3114
页数:10
相关论文
共 56 条
[11]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[12]   A two-stage gene selection scheme utilizing MRMR filter and GA wrapper [J].
El Akadi, Ali ;
Amine, Aouatif ;
El Ouardighi, Abdeljalil ;
Aboutajdine, Driss .
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 26 (03) :487-500
[13]   Efficient feature selection filters for high-dimensional data [J].
Ferreira, Artur J. ;
Figueiredo, Mario A. T. .
PATTERN RECOGNITION LETTERS, 2012, 33 (13) :1794-1804
[14]   Word co-occurrence features for text classification [J].
Figueiredo, Fabio ;
Rocha, Leonardo ;
Couto, Thierson ;
Salles, Thiago ;
Goncalves, Marcos Andre ;
Meira, Wagner, Jr. .
INFORMATION SYSTEMS, 2011, 36 (05) :843-858
[15]   Hybrid feature selection by combining filters and wrappers [J].
Hsu, Hui-Huang ;
Hsieh, Cheng-Wei ;
Lu, Ming-Da .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (07) :8144-8150
[16]  
Hull DA, 1996, J AM SOC INFORM SCI, V47, P70, DOI 10.1002/(SICI)1097-4571(199601)47:1<70::AID-ASI7>3.0.CO
[17]  
2-#
[18]   A new hybrid ant colony optimization algorithm for feature selection [J].
Kabir, Md. Monirul ;
Shahjahan, Md. ;
Murase, Kazuyuki .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (03) :3747-3763
[19]  
KIRA K, 1992, AAAI-92 PROCEEDINGS : TENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P129
[20]  
Koller D., 1997, HIERARCHICALLY CLASS, VVolume 223