An HMM-based over-sampling technique to improve text classification

被引:15
作者
Iglesias, E. L. [1 ]
Seara Vieira, A.
Borrajo, L.
机构
[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain
关键词
Hidden Markov Model; Text classification; Oversampling techniques; CATEGORIZATION; MODELS;
D O I
10.1016/j.eswa.2013.07.036
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with. To demonstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases. (C) 2013 Elsevier Ltd. All rights reserved.
引用
收藏
页码:7184 / 7192
页数:9
相关论文
共 29 条
  • [1] [Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
  • [2] [Anonymous], 2006, Data mining introduction
  • [3] Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463
  • [4] A NOTE ON THE GENERATION OF RANDOM NORMAL DEVIATES
    BOX, GEP
    MULLER, ME
    [J]. ANNALS OF MATHEMATICAL STATISTICS, 1958, 29 (02): : 610 - 611
  • [5] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [6] SMOTE: Synthetic minority over-sampling technique
    Chawla, Nitesh V.
    Bowyer, Kevin W.
    Hall, Lawrence O.
    Kegelmeyer, W. Philip
    [J]. 2002, American Association for Artificial Intelligence (16)
  • [7] Exploiting probabilistic topic models to improve text categorization under class imbalance
    Chen, Enhong
    Lin, Yanggang
    Xiong, Hui
    Luo, Qiming
    Ma, Haiping
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (02) : 202 - 214
  • [8] Hidden markov models for text categorization in multi-page documents
    Frasconi, P
    Soda, G
    Vullo, A
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2002, 18 (2-3) : 195 - 217
  • [9] Freitag D., 1999, A A A I Workshop on Machine Learning for Information Extraction, P31
  • [10] Hersh W., 1994, P 17 ANN INT ACM SIG, P192