An improved term weighting scheme for text classification

被引:13
|
作者
Tang, Zhong [1 ]
Li, Wenqiang [1 ]
Li, Yan [1 ]
机构
[1] Sichuan Univ, Sch Mech Engn, Sichuan Prov Key Lab Innovat Methodol & Creat Des, Chengdu, Sichuan, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
feature selection; term weighting; text classification; text representation; TF-IEF; FEATURE-SELECTION METHOD; REPRESENTATION; IMPACT;
D O I
10.1002/cpe.5604
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Text representation is a necessary and primary procedure in performing text classification (TC), which first needs to be obtained through an information-rich term weighting scheme to achieve higher TC performance. So far, term frequency-inverse document frequency (TF-IDF) is the most widely used term weighting scheme, but it suffers from two deficiencies. First, the global weighting factors IDF in TF-IDF approaches infinity if a certain term does not occur in a text. Second, the IDF is equal to zero if a certain term appears in any text. To offset these drawbacks, we first conduct an in-depth analysis of the current term weighting schemes, and subsequently, an improved term weighting scheme called term frequency-inverse exponential frequency (TF-IEF) and its various variants are proposed. The proposed method replaces IDF with the new global weighting factor IEF to characterize the global weighting factor log-like IDF in the corpus, which can greatly reduce the effect of feature (term) with high local weighting factor TF in term weighting. As a result, a more representative feature can be generated. We carried out a series of experiments on two commonly used data sets (corpora) utilizing Naive Bayes and support vector machine classifiers to validate the performance of our proposed schemes. Experimental results explicitly reveal that the proposed term weighting schemes come with better performance than the compared schemes.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] An improved supervised term weighting scheme for text representation and classification
    Tang, Zhong
    Li, Wenqiang
    Li, Yan
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 189
  • [2] An improved method of term weighting for text classification
    Jiang, Hua
    Li, Ping
    Hu, Xin
    Wang, Shuyan
    2009 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INTELLIGENT SYSTEMS, PROCEEDINGS, VOL 1, 2009, : 294 - 298
  • [3] An Improved Term Weighting Scheme for Sentiment Classification
    Zhang, Pu
    Wang, Yinghao
    Wang, Junxia
    Zeng, Xianhua
    Wang, Yong
    2017 IEEE 2ND ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2017, : 462 - 466
  • [4] A Term Weighting Scheme Approach for Vietnamese Text Classification
    Vu Thanh Nguyen
    Nguyen Tri Hai
    Nguyen Hoang Nghia
    Tuan Dinh Le
    FUTURE DATA AND SECURITY ENGINEERING, FDSE 2015, 2015, 9446 : 46 - 53
  • [5] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    Informatica (Slovenia), 2022, 46 (02): : 259 - 268
  • [6] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (02): : 259 - 268
  • [7] A New Improved Term Weighting Scheme for Text Categorization
    Nguyen Pham Xuan
    Hieu Le Quang
    KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2013), VOL 1, 2014, 244 : 261 - 270
  • [8] RANDOM WALK TERM WEIGHTING FOR IMPROVED TEXT CLASSIFICATION
    Hassan, Samer
    Mihalcea, Rada
    Banea, Carmen
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (04) : 421 - 439
  • [9] A probabilistic model derived term weighting scheme for text classification
    Feng, Guozhong
    Li, Shaoting
    Sun, Tieli
    Zhang, Bangzuo
    PATTERN RECOGNITION LETTERS, 2018, 110 : 23 - 29
  • [10] Random-walk term weighting for improved text classification
    Hassan, Samer
    Mihalcea, Rada
    Banea, Carmen
    ICSC 2007: INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, PROCEEDINGS, 2007, : 242 - +