Text classification method based on self-training and LDA topic models

被引:148
作者
Pavlinek, Miha [1 ]
Podgorelec, Vili [1 ]
机构
[1] Univ Maribor, Fac Elect Engn & Comp Sci, Inst Informat, Maribor, Slovenia
关键词
Classification; Topic modeling; LDA; Semi-supervised learning; Self-training; UNLABELED DOCUMENTS; ALGORITHM; SOFTWARE; EXAMPLES;
D O I
10.1016/j.eswa.2017.03.020
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data, We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:83 / 93
页数:11
相关论文
共 50 条
[1]  
Arun R, 2010, LECT NOTES ARTIF INT, V6118, P391
[2]  
Beyer K, 1999, LECT NOTES COMPUT SC, V1540, P217
[3]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[4]  
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[5]   A density-based method for adaptive LDA model selection [J].
Cao, Juan ;
Xia, Tian ;
Li, Jintao ;
Zhang, Yongdong ;
Tang, Sheng .
NEUROCOMPUTING, 2009, 72 (7-9) :1775-1781
[6]   Determining the significance and relative importance of parameters of a simulated quenching algorithm using statistical tools [J].
Castillo, P. A. ;
Arenas, M. G. ;
Rico, N. ;
Mora, A. M. ;
Garcia-Sanchez, P. ;
Laredo, J. L. J. ;
Merelo, J. J. .
APPLIED INTELLIGENCE, 2012, 37 (02) :239-254
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]  
Chapelle O, 2010, Semi-supervised learning
[9]   Semi-Supervised Learning via Regularized Boosting Working on Multiple Semi-Supervised Assumptions [J].
Chen, Ke ;
Wang, Shihai .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2011, 33 (01) :129-143
[10]   Text classification using a few labeled examples [J].
Colace, Francesco ;
De Santo, Massimo ;
Greco, Luca ;
Napoletano, Paolo .
COMPUTERS IN HUMAN BEHAVIOR, 2014, 30 :689-697