Supervised term weighting centroid-based classifiers for text categorization

被引:0
|
作者
Tam T. Nguyen
Kuiyu Chang
Siu Cheung Hui
机构
[1] Nanyang Technological University,School of Computer Engineering
来源
Knowledge and Information Systems | 2013年 / 35卷
关键词
Centroid classification; Support vector machines; Kullback–Leibler divergence; Jensen–Shannon divergence;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we study the theoretical properties of the class feature centroid (CFC) classifier by considering the rate of change of each prototype vector with respect to individual dimensions (terms). We show that CFC is inherently biased toward the larger (dominant majority) classes, which invariably leads to poor performance on class-imbalanced data. CFC also aggressively prune terms that appear across all classes, discarding some non-exclusive but useful terms. To overcome these CFC limitations while retaining its intrinsic and worthy design goals, we propose an improved centroid-based classifier that uses precise term-class distribution properties instead of presence or absence of terms in classes. Specifically, terms are weighted based on the Kullback–Leibler (KL) divergence measure between pairs of class-conditional term probabilities; we call this the CFC–KL centroid classifier. We then generalize CFC–KL to handle multi-class data by replacing the KL measure with the multi-class Jensen–Shannon (JS) divergence, called CFC–JS. Our proposed supervised term weighting schemes have been evaluated on 5 datasets; KL and JS weighted classifiers consistently outperformed baseline CFC and unweighted support vector machines (SVM). We also devise a word cloud visualization approach to highlight the important class-specific words picked out by our KL and JS term weighting schemes, which were otherwise obscured by unsupervised term weighting. The experimental and visualization results show that KL and JS term weighting not only notably improve centroid-based classifiers, but also benefit SVM classifiers as well.
引用
收藏
页码:61 / 85
页数:24
相关论文
共 50 条
  • [41] A new Chinese text feature selection method in centroid-based classifier
    Gu, Yijun
    Wang, Rong
    Wang, Jianhua
    Yu, Jiangde
    2008 INTERNATIONAL SYMPOSIUM ON INFORMATION PROCESSING AND 2008 INTERNATIONAL PACIFIC WORKSHOP ON WEB MINING AND WEB-BASED APPLICATION, 2008, : 88 - +
  • [42] A comparative study of centroid-based, Neighborhood-based and statistical approaches for effective document categorization
    Tam, V
    Santoso, A
    Setiono, R
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITON, VOL IV, PROCEEDINGS, 2002, : 235 - 238
  • [43] Centroid-Based Clustering with -Divergences
    Sarmiento, Auxiliadora
    Fondon, Irene
    Duran-Diaz, Ivan
    Cruces, Sergio
    ENTROPY, 2019, 21 (02)
  • [44] A New Supervised Term Ranking Method for Text Categorization
    Mammadov, Musa
    Yearwood, John
    Zhao, Lei
    AI 2010: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2010, 6464 : 102 - 111
  • [45] A supervised term selection technique for effective text categorization
    Basu, Tanmay
    Murthy, C. A.
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2016, 7 (05) : 877 - 892
  • [46] A supervised term selection technique for effective text categorization
    Tanmay Basu
    C. A. Murthy
    International Journal of Machine Learning and Cybernetics, 2016, 7 : 877 - 892
  • [47] Supervised Graph-Based Term Weighting Scheme for Effective Text Classification
    Shanavas, Niloofer
    Wang, Hui
    Lin, Zhiwei
    Hawe, Glenn
    ECAI 2016: 22ND EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, 285 : 1710 - 1711
  • [48] An improved centroid classifier for text categorization
    Tan, Songbo
    EXPERT SYSTEMS WITH APPLICATIONS, 2008, 35 (1-2) : 279 - 285
  • [49] RANDOM CENTROID INITIALIZATION FOR IMPROVING CENTROID-BASED CLUSTERING
    Romanuke V.V.
    Decision Making: Applications in Management and Engineering, 2023, 6 (02): : 734 - 746
  • [50] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Dogan, Turgut
    Uysal, Alper Kursat
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2019, 44 (11) : 9545 - 9560