On entropy-based term weighting schemes for text categorization

被引:5
作者
Wang, Tao [1 ]
Cai, Yi [2 ,3 ]
Leung, Ho-fung [4 ]
Lau, Raymond Y. K. [5 ]
Xie, Haoran [6 ]
Li, Qing [7 ]
机构
[1] Kings Coll London, Dept Biostat & Hlth Informat, London, England
[2] South China Univ Technol, Sch Software Engn, Guangzhou, Peoples R China
[3] South China Univ Technol, Key Lab Big Data & Intelligent Robot, Minist Educ, Guangzhou, Peoples R China
[4] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Hong Kong, Peoples R China
[5] City Univ Hong Kong, Dept Informat Syst, Hong Kong, Peoples R China
[6] Lingnan Univ, Dept Comp & Decis Sci, Hong Kong, Peoples R China
[7] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Entropy; Normalization; Smoothing; Term weighting; Text categorization; FEATURE-SELECTION; INFORMATION; CLASSIFICATION; RELEVANCE;
D O I
10.1007/s10115-021-01581-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document's semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme's performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term's occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.
引用
收藏
页码:2313 / 2346
页数:34
相关论文
共 89 条
[71]   PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks [J].
Tang, Jian ;
Qu, Meng ;
Mei, Qiaozhu .
KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, :1165-1174
[72]  
Thomas MCover Joy A Thomas., 2012, ELEMENTS INFORM THEO
[73]   Entropy-based Term Weighting Schemes for Text Categorization in VSM [J].
Wang, Tao ;
Cai, Yi ;
Leung, Ho-fung ;
Cai, Zhiwei ;
Min, Huaqing .
2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, :325-332
[74]   On similarity coefficients for 2 x 2 tables and correction for chance [J].
Warrens, Matthijs J. .
PSYCHOMETRIKA, 2008, 73 (03) :487-502
[75]  
Wei, 2011, P INT C HUM CENTR CO
[76]  
Wu H., 1981, SIGIR Forum, V16, P30, DOI 10.1145/1013228.511759
[77]  
Wu H. Tan, 2016, ARXIV PREPRINT ARXIV
[78]  
Wu LF, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P4524
[79]   Self-inhibition Residual Convolutional Networks for Chinese Sentence Classification [J].
Xiong, Mengting ;
Li, Ruixuan ;
Li, Yuhua ;
Yang, Qi .
NEURAL INFORMATION PROCESSING (ICONIP 2018), PT I, 2018, 11301 :425-436
[80]  
Yang Y, 1997, ICML, P412