An improved term weighting scheme for text classification

被引:13
|
作者
Tang, Zhong [1 ]
Li, Wenqiang [1 ]
Li, Yan [1 ]
机构
[1] Sichuan Univ, Sch Mech Engn, Sichuan Prov Key Lab Innovat Methodol & Creat Des, Chengdu, Sichuan, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
feature selection; term weighting; text classification; text representation; TF-IEF; FEATURE-SELECTION METHOD; REPRESENTATION; IMPACT;
D O I
10.1002/cpe.5604
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Text representation is a necessary and primary procedure in performing text classification (TC), which first needs to be obtained through an information-rich term weighting scheme to achieve higher TC performance. So far, term frequency-inverse document frequency (TF-IDF) is the most widely used term weighting scheme, but it suffers from two deficiencies. First, the global weighting factors IDF in TF-IDF approaches infinity if a certain term does not occur in a text. Second, the IDF is equal to zero if a certain term appears in any text. To offset these drawbacks, we first conduct an in-depth analysis of the current term weighting schemes, and subsequently, an improved term weighting scheme called term frequency-inverse exponential frequency (TF-IEF) and its various variants are proposed. The proposed method replaces IDF with the new global weighting factor IEF to characterize the global weighting factor log-like IDF in the corpus, which can greatly reduce the effect of feature (term) with high local weighting factor TF in term weighting. As a result, a more representative feature can be generated. We carried out a series of experiments on two commonly used data sets (corpora) utilizing Naive Bayes and support vector machine classifiers to validate the performance of our proposed schemes. Experimental results explicitly reveal that the proposed term weighting schemes come with better performance than the compared schemes.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] A Comparative Study on Term Weighting Schemes for Text Classification
    Mazyad, Ahmad
    Teytaud, Fabien
    Fonlupt, Cyril
    MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 100 - 108
  • [32] A Study on Text Classification: Term Weighting Algorithm Analysis
    Tseng, Kuan-Hua
    Lin, Chun-Hung Richard
    Liu, Jain-Shing
    Huang, Chih-Ming Andrew
    Wang, Yue-Han
    JOURNAL OF INTERNET TECHNOLOGY, 2021, 22 (02): : 311 - 325
  • [33] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Dogan, Turgut
    Uysal, Alper Kursat
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2019, 44 (11) : 9545 - 9560
  • [34] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Turgut Dogan
    Alper Kursat Uysal
    Arabian Journal for Science and Engineering, 2019, 44 : 9545 - 9560
  • [35] Using modified term frequency to improve term weighting for text classification
    Chen, Long
    Jiang, Liangxiao
    Li, Chaoqun
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 101
  • [36] A new term-weighting scheme for text classification using the odds of positive and negative class probabilities
    Ko, Youngjoong
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (12) : 2553 - 2565
  • [37] An improved term weighting scheme for vector space model
    Sun, YH
    He, PL
    Chen, ZG
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1692 - 1695
  • [38] A Symmetric Term Weighting Scheme for Text Categorization Based on Term Occurrence Probabilities
    Erenel, Zafer
    Altincay, Hakan
    Varoglu, Ekrem
    2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 215 - 218
  • [39] PU text classification enhanced by term frequency-inverse document frequency-improved weighting
    Peng, Tao
    Liu, Lu
    Zuo, Wanli
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (03): : 728 - 741
  • [40] An Extension of Topic Models for Text Classification: a Term Weighting Approach
    Lee, Seonggyu
    Kim, Jinho
    Myaeng, Sung-Hyon
    2015 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2015, : 217 - 224