Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming

被引:4
|
作者
Lim, Hyunki [1 ]
Kim, Dae-Won [2 ]
机构
[1] Korea Inst Sci & Technol, Image & Media Res Ctr, 5 Hwarang Ro 14 Gil, Seoul 02792, South Korea
[2] Chung Ang Univ, Sch Comp Sci & Engn, 221 Heukseok Dong, Seoul 06974, South Korea
基金
新加坡国家研究基金会;
关键词
text categorization; information gain; mutual information; chi-square statistic; quadratic programming;
D O I
10.3390/e22040395
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Quadratic Programming Feature Selection
    Rodriguez-Lujan, Irene
    Huerta, Ramon
    Elkan, Charles
    Santa Cruz, Carlos
    JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 1491 - 1516
  • [2] Quadratic programming feature selection
    Rodriguez-Lujan, Irene
    Huerta, Ramon
    Elkan, Charles
    Cruz, Carlos Santa
    Journal of Machine Learning Research, 2010, 11 : 1491 - 1516
  • [3] Feature Selection in Text Classification
    Sahin, Durmus Ozkan
    Ates, Nurullah
    Kilic, Erdal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1777 - 1780
  • [4] Feature Selection For Text Classification Using Genetic Algorithms
    Bidi, Noria
    Elberrichi, Zakaria
    PROCEEDINGS OF 2016 8TH INTERNATIONAL CONFERENCE ON MODELLING, IDENTIFICATION & CONTROL (ICMIC 2016), 2016, : 806 - 810
  • [5] Feature Selection by Using Heuristic Methods for Text Classification
    Sel, Ilhami
    Yeroglu, Celalettin
    Hanbay, Davut
    2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP 2019), 2019,
  • [6] Feature Selection for Text Classification Using Mutual Information
    Sel, Ilhami
    Karci, Ali
    Hanbay, Davut
    2019 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP 2019), 2019,
  • [7] Feature selection based on long short term memory for text classification
    Ming Hong
    Heyong Wang
    Multimedia Tools and Applications, 2024, 83 : 44333 - 44378
  • [8] Feature selection based on term frequency deviation rate for text classification
    Hongfang Zhou
    Yiming Ma
    Xiang Li
    Applied Intelligence, 2021, 51 : 3255 - 3274
  • [9] Feature selection based on long short term memory for text classification
    Hong, Ming
    Wang, Heyong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (15) : 44333 - 44378
  • [10] Feature selection based on term frequency deviation rate for text classification
    Zhou, Hongfang
    Ma, Yiming
    Li, Xiang
    APPLIED INTELLIGENCE, 2021, 51 (06) : 3255 - 3274