An HMM-based over-sampling technique to improve text classification

被引：15

作者：

Iglesias, E. L. ^{[1
]}

Seara Vieira, A.

Borrajo, L.

机构：

[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2013年 / 40卷 / 18期

关键词：

Hidden Markov Model; Text classification; Oversampling techniques; CATEGORIZATION; MODELS;

D O I：

10.1016/j.eswa.2013.07.036

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a novel over-sampling method based on document content to handle the class imbalance problem in text classification. The new technique, COS-HMM (Content-based Over-Sampling HMM), includes an HMM that is trained with a corpus in order to create new samples according to current documents. The HMM is treated as a document generator which can produce synthetical instances formed on what it was trained with. To demonstrate its achievement, COS-HMM is tested with a Support Vector Machine (SVM) in two medical documental corpora (OHSUMED and TREC Genomics), and is then compared with the Random Over-Sampling (ROS) and SMOTE techniques. Results suggest that the application of over-sampling strategies increases the global performance of the SVM to classify documents. Based on the empirical and statistical studies, the new method clearly outperforms the baseline method (ROS), and offers a greater performance than SMOTE in the majority of tested cases. (C) 2013 Elsevier Ltd. All rights reserved.

引用

页码：7184 / 7192

页数：9

共 29 条

[1]

[Anonymous], 2002, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

[2]

[Anonymous], 2006, Data mining introduction

[3]

Baeza-Yates R, 1999, MODERN INFORM RETRIE, V463

[4] A NOTE ON THE GENERATION OF RANDOM NORMAL DEVIATES [J].

BOX, GEP ;

MULLER, ME .

ANNALS OF MATHEMATICAL STATISTICS, 1958, 29 (02) :610-611

[5] LIBSVM: A Library for Support Vector Machines [J].

Chang, Chih-Chung ;

Lin, Chih-Jen .

ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)

[6] SMOTE: Synthetic minority over-sampling technique [J].

Chawla, Nitesh V. ;

Bowyer, Kevin W. ;

Hall, Lawrence O. ;

Kegelmeyer, W. Philip .

2002, American Association for Artificial Intelligence (16)

[7] Exploiting probabilistic topic models to improve text categorization under class imbalance [J].

Chen, Enhong ;

Lin, Yanggang ;

Xiong, Hui ;

Luo, Qiming ;

Ma, Haiping .

INFORMATION PROCESSING & MANAGEMENT, 2011, 47 (02) :202-214

[8] Hidden markov models for text categorization in multi-page documents [J].

Frasconi, P ;

Soda, G ;

Vullo, A .

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2002, 18 (2-3) :195-217

[9]

Freitag D., 1999, A A A I Workshop on Machine Learning for Information Extraction, P31

[10]

Hersh W., 1994, P 17 ANN INT ACM SIG, P192

← 1 2 3 →