Improving Term Weighting Schemes for Short Text Classification in Vector Space Model

被引:26
作者
Samant, Surender Singh [1 ]
Murthy, N. L. Bhanu [1 ]
Malapati, Aruna [1 ]
机构
[1] Birla Inst Technol & Sci, Dept Comp Sci & Informat Syst, Pilani Hyderabad Campus, Hyderabad 500078, India
关键词
Text classification; text categorization; term weighting; twitter; FEATURE-SELECTION;
D O I
10.1109/ACCESS.2019.2953918
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Short text is one of the predominant forms of communication with unique characteristics such as short length, high sparsity, and lack of shared context and word co-occurrence. These characteristics distinguish short text from general text and make short text classification a challenging task. Term weighting is an important pre-processing step for text classification in the vector space model. In this paper, we propose three modifications to existing state-of-the-art term weighting schemes: ifn-tp-icf, RFR and modOR and a new term weighting scheme: ifn-modRF. We compare the proposed schemes with ten existing unsupervised and supervised schemes using three datasets of informally written short text: a self-labelled dataset of real-world events from Twitter, a Yahoo! questions dataset and a dataset of product reviews. Based on the experimental results using three popular classifiers, we observe that the proposed scheme ifn-modRF achieves the best F1-scores on the Twitter dataset, while the proposed modification modOR is a consistent performer with the best scores in most of the experiments. The proposed modification ifn-tp-icf also outperform the original scheme in most experiments.
引用
收藏
页码:166578 / 166592
页数:15
相关论文
共 45 条
[1]   Can We Predict a Riot? Disruptive Event Detection Using Twitter [J].
Alsaedi, Nasser ;
Burnap, Pete ;
Rana, Omer .
ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2017, 17 (02)
[2]   Term weighting scheme for short-text classification: Twitter corpuses [J].
Alsmadi, Issa ;
Hoon, Gan Keng .
NEURAL COMPUTING & APPLICATIONS, 2019, 31 (08) :3819-3831
[3]  
[Anonymous], 2003, P ACM S APPL COMP
[4]  
[Anonymous], IJIMAI
[5]  
[Anonymous], 2012, IUI, DOI DOI 10.1145/2166966.2166999
[6]  
[Anonymous], P ACM IEEE JOINT C D
[7]  
[Anonymous], REUTERS 21578 DATASE
[8]  
Becker H., 2011, Icwsm
[9]   Turning from TF-IDF to TF-IGM for term weighting in text classification [J].
Chen, Kewen ;
Zhang, Zuping ;
Long, Jun ;
Zhang, Hao .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :245-260
[10]  
Chen QX, 2016, PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), P749, DOI 10.1109/ICALIP.2016.7846525