An enhanced feature selection method for text classification

被引:0
|
作者
Kang, Jinbeom [1 ]
Lee, Eunshil [1 ]
Hong, Kwanghee [1 ]
Park, Jeahyun [1 ]
Kim, Taehwan [1 ]
Park, Juyoung [1 ]
Choi, Joongmin [1 ]
Yang, Jaeyoung [1 ]
机构
[1] Hanyang Univ, Dept Comp Sci & Engn, Ansan, Kunngi Do, South Korea
关键词
feature selection; impurity of words; unbalanced distribution; machine learning; text classification;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection in machine learning is a task of identifying a set of representative terms or features from a document collection that are mainly used in text classification. Existing feature selection methods including information gain and X chi(2)-test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a particular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classification accuracy because real-world document collections rarely have a balanced distribution, and also it is difficult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classification that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and combines feature values with the weight factors that,reflect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.
引用
收藏
页码:36 / 41
页数:6
相关论文
共 50 条
  • [1] Efficient Method for Feature Selection in Text Classification
    Sun, Jian
    Zhang, Xiang
    Liao, Dan
    Chang, Victor
    2017 INTERNATIONAL CONFERENCE ON ENGINEERING AND TECHNOLOGY (ICET), 2017,
  • [2] A new feature selection method for text classification
    Uchyigit, Gulden
    Clark, Keith
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
  • [3] Text feature selection method for hierarchical classification
    Zhu, Cui-Ling
    Ma, Jun
    Zhang, Dong-Mei
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (01): : 103 - 110
  • [4] Feature Selection Method of Text Tendency Classification
    Li, Yanling
    Dai, Guanzhong
    Li, Gang
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 34 - +
  • [5] A New Filter Feature Selection Method for Text Classification
    Cekik, Rasim
    IEEE ACCESS, 2024, 12 : 139316 - 139335
  • [6] A parallel feature selection method study for text classification
    Li, Zhao
    Lu, Wei
    Sun, Zhanquan
    Xing, Weiwei
    NEURAL COMPUTING & APPLICATIONS, 2017, 28 : S513 - S524
  • [7] Statera: A Balanced Feature Selection Method for Text Classification
    Gama Bispo, Braian Varjao
    Rios, Tatiane Nogueira
    2018 7TH BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 2018, : 260 - 265
  • [8] A Hybrid Feature Selection Method For Vietnamese Text Classification
    Nguyen Tri Hai
    Tuan Dinh Le
    Nguyen Hoang Nghia
    Vu Thanh Nguyen
    2015 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2015, : 91 - 96
  • [9] A parallel feature selection method study for text classification
    Zhao Li
    Wei Lu
    Zhanquan Sun
    Weiwei Xing
    Neural Computing and Applications, 2017, 28 : 513 - 524
  • [10] A novel probabilistic feature selection method for text classification
    Uysal, Alper Kursat
    Gunal, Serkan
    KNOWLEDGE-BASED SYSTEMS, 2012, 36 : 226 - 235