Model-induced term-weighting schemes for text classification

被引:14
|
作者
Kim, Hyun Kyung [1 ]
Kim, Minyoung [1 ]
机构
[1] Seoul Natl Univ Sci & Technol, Dept Elect & IT Media Engn, Seoul 139743, South Korea
基金
新加坡国家研究基金会;
关键词
Document/text classification; Feature/term weighting; Feature selection; Supervised learning; SENTIMENT ANALYSIS; CATEGORIZATION;
D O I
10.1007/s10489-015-0745-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The bag-of-words representation of text data is very popular for document classification. In the recent literature, it has been shown that properly weighting the term feature vector can improve the classification performance significantly beyond the original term-frequency based features. In this paper we demystify the success of the recent term-weighting strategies as well as provide possibly more reasonable modifications. We then propose novel term-weighting schemes that can be induced from the well-known document probabilistic models such as the Naive Bayes and the multinomial term model. Interestingly, some of the intuition-based term-weighting schemes coincide exactly with the proposed derivations. Our term-weighting schemes are tested on large-scale text classification problems/datasets where we demonstrate improved prediction performance over existing approaches.
引用
收藏
页码:30 / 43
页数:14
相关论文
共 50 条
  • [41] A Term Weighting Scheme Approach for Vietnamese Text Classification
    Vu Thanh Nguyen
    Nguyen Tri Hai
    Nguyen Hoang Nghia
    Tuan Dinh Le
    FUTURE DATA AND SECURITY ENGINEERING, FDSE 2015, 2015, 9446 : 46 - 53
  • [42] RANDOM WALK TERM WEIGHTING FOR IMPROVED TEXT CLASSIFICATION
    Hassan, Samer
    Mihalcea, Rada
    Banea, Carmen
    INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2007, 1 (04) : 421 - 439
  • [43] A possibilistic-logic-based information retrieval model with various term-weighting approaches
    Kacprzyk, Janusz
    Nowacka, Katarzyna
    Zadrozny, Slawomir
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING - ICAISC 2006, PROCEEDINGS, 2006, 4029 : 1120 - 1129
  • [44] A Study on Text Classification: Term Weighting Algorithm Analysis
    Tseng, Kuan-Hua
    Lin, Chun-Hung Richard
    Liu, Jain-Shing
    Huang, Chih-Ming Andrew
    Wang, Yue-Han
    JOURNAL OF INTERNET TECHNOLOGY, 2021, 22 (02): : 311 - 325
  • [45] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    Informatica (Slovenia), 2022, 46 (02): : 259 - 268
  • [46] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (02): : 259 - 268
  • [47] Using modified term frequency to improve term weighting for text classification
    Chen, Long
    Jiang, Liangxiao
    Li, Chaoqun
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 101
  • [48] Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization
    Badawi, Dima
    Altincay, Hakan
    APPLIED INTELLIGENCE, 2017, 47 (02) : 456 - 472
  • [49] Entropy-based Term Weighting Schemes for Text Categorization in VSM
    Wang, Tao
    Cai, Yi
    Leung, Ho-fung
    Cai, Zhiwei
    Min, Huaqing
    2015 IEEE 27TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2015), 2015, : 325 - 332
  • [50] Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization
    Dima Badawi
    Hakan Altınçay
    Applied Intelligence, 2017, 47 : 456 - 472