A three-stage unsupervised dimension reduction method for text clustering

被引:33
作者
Bharti, Kusum Kumari [1 ]
Singh, P. K. [1 ]
机构
[1] ABV Indian Inst Informat Technol & Management Gwa, Computat Intelligence & Data Min Res Lab, Gwalior, MP, India
关键词
Feature selection; Feature extraction; Dimension reduction; Sparsity; Three-stage model; Text clustering; FEATURE-SELECTION; MUTUAL INFORMATION; ALGORITHM;
D O I
10.1016/j.jocs.2013.11.007
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Dimension reduction is a well-known pre-processing step in the text clustering to remove irrelevant, redundant and noisy features without sacrificing performance of the underlying algorithm. Dimension reduction methods are primarily classified as feature selection (FS) methods and feature extraction (FE) methods. Though FS methods are robust against irrelevant features, they occasionally fail to retain important information present in the original feature space. On the other hand, though FE methods reduce dimensions in the feature space without losing much information, they are significantly affected by the irrelevant features. The one-stage models, FS/FE methods, and the two-stage models, a combination of FS and FE methods proposed in the literature are not sufficient to fulfil all the above mentioned requirements of the dimension reduction. Therefore, we propose three-stage dimension reduction models to remove irrelevant, redundant and noisy features in the original feature space without loss of much valuable information. These models incorporates advantages of the FS and the FE methods to create a low dimension feature subspace. The experiments over three well-known benchmark text datasets of different characteristics show that the proposed three-stage models significantly improve performance of the clustering algorithm as measured by micro F-score, macro F-score, and total execution time. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:156 / 169
页数:14
相关论文
共 43 条
  • [1] [Anonymous], 1981, PATTERN RECOGNITION
  • [2] [Anonymous], 2005, International Journal of Advance Research in Computer Science and Management Studies
  • [3] [Anonymous], 1978, MULTIDIMENSIONAL SCA, DOI DOI 10.4135/9781412985130
  • [4] Bala R., 2009, IADIS EUR C DAT MIN, P127
  • [5] USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING
    BATTITI, R
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04): : 537 - 550
  • [6] Church K. W., 1990, Computational Linguistics, V16, P22
  • [7] DEERWESTER S, 1988, P ASIS ANNU MEET, V25, P36
  • [8] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [9] A two-stage gene selection scheme utilizing MRMR filter and GA wrapper
    El Akadi, Ali
    Amine, Aouatif
    El Ouardighi, Abdeljalil
    Aboutajdine, Driss
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 26 (03) : 487 - 500
  • [10] Research on collaborative negotiation for e-commerce.
    Feng, YQ
    Lei, Y
    Li, Y
    Cao, RZ
    [J]. 2003 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-5, PROCEEDINGS, 2003, : 2085 - 2088