Term weighting scheme for short-text classification: Twitter corpuses

被引:41
作者
Alsmadi, Issa [1 ]
Hoon, Gan Keng [1 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, Gelugor 11800, Pulau Pinang, Malaysia
关键词
Short text; Classification; Term weighting; Social networks; Twitter; Machine learning; FEATURE-SELECTION; CATEGORIZATION;
D O I
10.1007/s00521-017-3298-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.
引用
收藏
页码:3819 / 3831
页数:13
相关论文
共 35 条
  • [21] An Empirical Study on the Existence of Bubble in Chinese Stock Market: Based on TGARCH Model
    Nan, Lin
    Hong, Lu
    Zheng, Qin
    [J]. 2010 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND FINANCIAL ENGINEERING (ICIFE), 2010, : 87 - 90
  • [22] Ni Y, 2012, PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON MANAGEMENT INNOVATION AND PUBLIC POLICY (ICMIPP 2012), VOLS 1-6, P1879
  • [23] Term Weighting Schemes for Question Categorization
    Quan, Xiaojun
    Liu, Wenyin
    Qiu, Bite
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (05) : 1009 - 1021
  • [24] Class-indexing-based term weighting for automatic text classification
    Ren, Fuji
    Sohrab, Mohammad Golam
    [J]. INFORMATION SCIENCES, 2013, 236 : 109 - 125
  • [25] Machine learning in automated text categorization
    Sebastiani, F
    [J]. ACM COMPUTING SURVEYS, 2002, 34 (01) : 1 - 47
  • [26] Shi K., 2011, The Journal of China Universities of Posts and Telecommunications, V18, P131, DOI DOI 10.1016/S1005-8885(10)60196-3
  • [27] Soucy P, 2005, 19TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI-05), P1130
  • [28] Speriosu M., 2011, P 1 WORKSH UNS LEARN, P53
  • [29] Comparison of text feature selection policies and using an adaptive framework
    Tasci, Serafettin
    Gungor, Tunga
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (12) : 4871 - 4886
  • [30] Timonen M., 2013, Term Weighting in Short Documents for Document Categorization, Keyword Extraction and Query Expansion