Optimizing text classification through efficient feature selection based on quality metric

被引:23
作者
Lamirel, Jean-Charles [1 ]
Cuxac, Pascal [2 ]
Chivukula, Aneesh Sreevallabh [3 ]
Hajlaoui, Kafil [3 ]
机构
[1] LORIA, INRIA Nancy Grand Est, SYNALP Team, Vandoeuvre Les Nancy, France
[2] INIST CNRS, Vandoeuvre Les Nancy, France
[3] Int Inst Informat Technol, Ctr Data Engn, Gachibowli Hyderabad, Andhra Pradesh, India
关键词
Feature maximization; Clustering quality index; Feature selection; Supervised learning; Unbalanced data; Text;
D O I
10.1007/s10844-014-0317-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. The method is experienced on different types of textual datasets. The paper illustrates that the proposed method provides a very significant performance increase, as compared to state of the art methods, in all the studied cases even when a single bag of words model is exploited for data description. Interestingly, the most significant performance gain is obtained in the case of the classification of highly unbalanced, highly multidimensional and noisy data, with a high degree of similarity between the classes.
引用
收藏
页码:379 / 396
页数:18
相关论文
共 37 条
  • [1] AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
  • [2] [Anonymous], 2011, International Journal on Computer Science and Engineering
  • [3] [Anonymous], CLASS ADD PROCEDURE
  • [4] [Anonymous], 1971, AUTOMATIC PROCESSING
  • [5] [Anonymous], 2006, P IASTED INT C DAT A
  • [6] [Anonymous], P IJCNN 2011 SAN JOS
  • [7] [Anonymous], P 19 INT C COMP STAT
  • [8] [Anonymous], P 4 INT C WEB INF SC
  • [9] Bache K., 2013, UCI Machine Learning Repository
  • [10] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669