Instance labeling in semi-supervised learning with meaning values of words

被引:13
作者
Altinel, Berna [1 ]
Ganiz, Murat Can [1 ]
Diri, Banu [2 ]
机构
[1] Marmara Univ, Dept Comp Engn, Fac Engn, Istanbul, Turkey
[2] Yildiz Tech Univ, Dept Comp Engn, Fac Engn, Istanbul, Turkey
关键词
Text classification; Semantic kernel; Semi-supervised learning; Instance labeling; Helmholtz principle; TEXT CLASSIFICATION; SEMANTIC KERNEL;
D O I
10.1016/j.engappai.2017.04.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In supervised learning systems; only labeled samples are used for building a classifier that is then used to predict the class labels of the unlabeled samples. However, obtaining labeled data is very expensive, time consuming and difficult in real-life practical situations as labeling a data set requires the effort of a human expert. On the other side, unlabeled data are often plentiful which makes it relatively inexpensive and easier to obtain. Semi-Supervised Learning methods strive to utilize this plentiful source of unlabeled examples to increase the learning capacity of the classifier particularly when amount of labeled examples are restricted. Since SSL techniques usually reach higher accuracy and require less human effort, they attract a substantial amount of attention both in practical applications and theoretical research. A novel semi-supervised methodology is offered in this study. This algorithm utilizes a new method to predict the class labels of unlabeled examples in a corpus and incorporate them into the training set to build a better classifier. The approach presented here depends on a meaning calculation, which computes the words' meaning scores in the scope of classes. Meaning computation is constructed on the Helmholtz principle and utilized to various applications in the field of text mining like feature extraction, information retrieval and document summarization. Nevertheless, according to the literature, ILBOM is the first work which uses meaning calculation in a semi-supervised way to construct a semantic smoothing kernel for Support Vector Machines (SVM). Evaluation of the proposed methodology is done by performing various experiments on standard textual datasets. ILBOM's experimental results are compared with three baseline algorithms including SVM using linear kernel which is one of the most frequently used algorithms in text classification field. Experimental results show that labeling unlabeled instances based on meaning scores of words to augment the training set is valuable, and increases the classification accuracy on previously unseen test instances significantly.
引用
收藏
页码:152 / 163
页数:12
相关论文
共 53 条
[41]  
Nigam K., 2000, Proceedings of the Ninth International Conference on Information and Knowledge Management. CIKM 2000, P86, DOI 10.1145/354756.354805
[42]   Text classification from labeled and unlabeled documents using EM [J].
Nigam, K ;
McCallum, AK ;
Thrun, S ;
Mitchell, T .
MACHINE LEARNING, 2000, 39 (2-3) :103-134
[43]  
Razis G, 2016, 2016 11TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP), P106, DOI 10.1109/SMAP.2016.7753393
[44]  
Rosenberg Chuck, 2005, WACV
[45]   SPECIFICATION OF TERM VALUES IN AUTOMATIC INDEXING [J].
SALTON, G ;
YANG, CS .
JOURNAL OF DOCUMENTATION, 1973, 29 (04) :351-372
[46]   Partially supervised learning for pattern recognition [J].
Schwenker, Friedhelm ;
Trentin, Edmondo .
PATTERN RECOGNITION LETTERS, 2014, 37 :1-3
[47]   Support Vector Machines based on a semantic kernel for text categorization [J].
Siolas, G ;
d'Alché-Buc, F .
IJCNN 2000: PROCEEDINGS OF THE IEEE-INNS-ENNS INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOL V, 2000, :205-209
[48]  
Steinbach M., 2000, P KDD WORKSH TEXT MI
[49]  
Vapnik V., 1999, The nature of statistical learning theory
[50]  
Wang B, 2008, LECT NOTES ARTIF INT, V5032, P344