An enhanced feature selection method for text classification

被引:0
|
作者
Kang, Jinbeom [1 ]
Lee, Eunshil [1 ]
Hong, Kwanghee [1 ]
Park, Jeahyun [1 ]
Kim, Taehwan [1 ]
Park, Juyoung [1 ]
Choi, Joongmin [1 ]
Yang, Jaeyoung [1 ]
机构
[1] Hanyang Univ, Dept Comp Sci & Engn, Ansan, Kunngi Do, South Korea
关键词
feature selection; impurity of words; unbalanced distribution; machine learning; text classification;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection in machine learning is a task of identifying a set of representative terms or features from a document collection that are mainly used in text classification. Existing feature selection methods including information gain and X chi(2)-test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a particular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classification accuracy because real-world document collections rarely have a balanced distribution, and also it is difficult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classification that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and combines feature values with the weight factors that,reflect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.
引用
收藏
页码:36 / 41
页数:6
相关论文
共 50 条
  • [41] Text Guide: Improving the Quality of Long Text Classification by a Text Selection Method Based on Feature Importance
    Fiok, Krzysztof
    Karwowski, Waldemar
    Gutierrez-Franco, Edgar
    Davahli, Mohammad Reza
    Wilamowski, Maciej
    Ahram, Tareq
    Al-Juaid, Awad
    Zurada, Jozef
    IEEE ACCESS, 2021, 9 (09): : 105439 - 105450
  • [42] Comparison on Feature Selection Methods for Text Classification
    Liu, Wenkai
    Xiao, Jiongen
    Hong, Ming
    2020 THE 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2020), 2020, : 82 - 86
  • [43] A Bayesian feature selection paradigm for text classification
    Feng, Guozhong
    Guo, Jianhua
    Jing, Bing-Yi
    Hao, Lizhu
    INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (02) : 283 - 302
  • [44] Effective feature selection technique for text classification
    Seetha, Hari
    Murty, M. Narasimha
    Saravanan, R.
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (03) : 165 - 184
  • [45] A feature selection and classification technique for text categorization
    Girgis, MR
    Aly, AA
    INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2003, 12 (04) : 441 - 454
  • [46] Feature selection improves text classification accuracy
    不详
    IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75
  • [47] A new approach to feature selection in text classification
    Wang, Y
    Wang, XJ
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
  • [48] Composite Feature Extraction and Selection for Text Classification
    Wan, Chuan
    Wang, Yuling
    Liu, Yaoze
    Ji, Jinchao
    Feng, Guozhong
    IEEE ACCESS, 2019, 7 : 35208 - 35219
  • [49] Higher order feature selection for text classification
    Jan Bakus
    Mohamed S. Kamel
    Knowledge and Information Systems, 2006, 9 : 468 - 491
  • [50] Optimal Feature Selection for Imbalanced Text Classification
    Khurana A.
    Verma O.P.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147