Combining supervised term-weighting metrics for SVM text classification with extended term representation

被引:65
|
作者
Haddoud, Mounia [1 ,2 ]
Mokhtari, Aicha [1 ]
Lecroq, Thierry [2 ]
Abdeddaim, Said [2 ]
机构
[1] USTHB, RIIMA, BP 32, Algiers 16111, Algeria
[2] Univ Rouen, LITIS, F-76821 Mont St Aignan, France
关键词
Text classification; Term weighting; Text representation; Support vector machines; Classifier combination; SCHEME;
D O I
10.1007/s10115-016-0924-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The accuracy of a text classification method based on a SVM learner depends on the weighting metric used in order to assign a weight to a term. Weighting metrics can be classified as supervised or unsupervised according to whether they use prior information on the number of documents belonging to each category. A supervised metric should be highly informative about the relation of a document term to a category, and discriminative in separating the positive documents from the negative documents for this category. In this paper, we propose 80 metrics never used for the term-weighting problem and compare them to 16 functions of the literature. A large number of these metrics were initially proposed for other data mining problems: feature selection, classification rules and term collocations. While many previous works have shown the merits of using a particular metric, our experience suggests that the results obtained by such metrics can be highly dependent on the label distribution on the corpus and on the performance measures used (microaveraged or macroaveraged -Score). The solution that we propose consists in combining the metrics in order to improve the classification. More precisely, we show that using a SVM classifier which combines the outputs of SVM classifiers that utilize different metrics performs well in all situations. The second main contribution of this paper is an extended term representation for the vector space model that improves significantly the prediction of the text classifier.
引用
收藏
页码:909 / 931
页数:23
相关论文
共 50 条
  • [21] Term Similarity and Weighting Framework for Text Representation
    Sani, Sadiq
    Wiratunga, Nirmalie
    Massie, Stewart
    Lothian, Robert
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, ICCBR 2011, 2011, 6880 : 304 - 318
  • [22] Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval
    Maisonnave, Mariano
    Delbianco, Fernando
    Tohme, Fernando
    Maguitman, Ana
    INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (03)
  • [23] A Flexible Supervised Term-Weighting Technique and its Application to Variable Extraction and Information Retrieval
    Maisonnave, Mariano
    Delbianco, Fernando
    Tohme, Fernando
    Maguitman, Ana
    INTELIGENCIA ARTIFICIAL-IBEROAMERICAL JOURNAL OF ARTIFICIAL INTELLIGENCE, 2019, 22 (63): : 61 - 80
  • [24] Effective Text Classification Through Supervised Rough Set-Based Term Weighting
    Cekik, Rasim
    SYMMETRY-BASEL, 2025, 17 (01):
  • [25] Adaptable Term Weighting Framework for Text Classification
    Huynh, Dat
    Dat Tran
    Ma, Wanli
    Sharma, Dharmendra
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT II, 2011, 6609 : 254 - 265
  • [26] A survey of term weighting schemes for text classification
    Alsaeedi, Abdullah
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2020, 12 (02) : 237 - 254
  • [27] Imbalanced text classification: A term weighting approach
    Liu, Ying
    Loh, Han Tong
    Sun, Aixin
    EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (01) : 690 - 701
  • [28] An improved method of term weighting for text classification
    Jiang, Hua
    Li, Ping
    Hu, Xin
    Wang, Shuyan
    2009 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INTELLIGENT SYSTEMS, PROCEEDINGS, VOL 1, 2009, : 294 - 298
  • [29] An improved term weighting scheme for text classification
    Tang, Zhong
    Li, Wenqiang
    Li, Yan
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (09):
  • [30] A new document representation based on global policy for supervised term weighting schemes in text categorization
    Jia, Longjia
    Zhang, Bangzuo
    MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2022, 19 (05) : 5223 - 5240