Instance labeling in semi-supervised learning with meaning values of words

被引:13
作者
Altinel, Berna [1 ]
Ganiz, Murat Can [1 ]
Diri, Banu [2 ]
机构
[1] Marmara Univ, Dept Comp Engn, Fac Engn, Istanbul, Turkey
[2] Yildiz Tech Univ, Dept Comp Engn, Fac Engn, Istanbul, Turkey
关键词
Text classification; Semantic kernel; Semi-supervised learning; Instance labeling; Helmholtz principle; TEXT CLASSIFICATION; SEMANTIC KERNEL;
D O I
10.1016/j.engappai.2017.04.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In supervised learning systems; only labeled samples are used for building a classifier that is then used to predict the class labels of the unlabeled samples. However, obtaining labeled data is very expensive, time consuming and difficult in real-life practical situations as labeling a data set requires the effort of a human expert. On the other side, unlabeled data are often plentiful which makes it relatively inexpensive and easier to obtain. Semi-Supervised Learning methods strive to utilize this plentiful source of unlabeled examples to increase the learning capacity of the classifier particularly when amount of labeled examples are restricted. Since SSL techniques usually reach higher accuracy and require less human effort, they attract a substantial amount of attention both in practical applications and theoretical research. A novel semi-supervised methodology is offered in this study. This algorithm utilizes a new method to predict the class labels of unlabeled examples in a corpus and incorporate them into the training set to build a better classifier. The approach presented here depends on a meaning calculation, which computes the words' meaning scores in the scope of classes. Meaning computation is constructed on the Helmholtz principle and utilized to various applications in the field of text mining like feature extraction, information retrieval and document summarization. Nevertheless, according to the literature, ILBOM is the first work which uses meaning calculation in a semi-supervised way to construct a semantic smoothing kernel for Support Vector Machines (SVM). Evaluation of the proposed methodology is done by performing various experiments on standard textual datasets. ILBOM's experimental results are compared with three baseline algorithms including SVM using linear kernel which is one of the most frequently used algorithms in text classification field. Experimental results show that labeling unlabeled instances based on meaning scores of words to augment the training set is valuable, and increases the classification accuracy on previously unseen test instances significantly.
引用
收藏
页码:152 / 163
页数:12
相关论文
共 53 条
[1]   A corpus-based semantic kernel for text classification by using meaning values of terms [J].
Altinel, Berna ;
Ganiz, Murat Can ;
Diri, Banu .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2015, 43 :54-66
[2]  
Altinel B, 2014, 2014 IEEE INTERNATIONAL SYMPOSIUM ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (INISTA 2014), P431, DOI 10.1109/INISTA.2014.6873656
[3]  
Altinel B, 2014, LECT NOTES ARTIF INT, V8467, P505, DOI 10.1007/978-3-319-07173-2_43
[4]   A capture-recapture sampling standardization for improving Internet meta-search [J].
Anagnostopoulos, Ioannis .
COMPUTER STANDARDS & INTERFACES, 2010, 32 (1-2) :61-70
[5]  
[Anonymous], 2005, TECHNICAL REPORT
[6]  
[Anonymous], 2006, BOOK REV IEEE T NEUR
[7]  
[Anonymous], 1995, ACL, DOI 10.3115/981658.981684
[8]  
[Anonymous], FLAIRS
[9]  
Balinsky A., 2011, P C KNOWL DISC CHENG
[10]  
Balinsky A., 2010, P 10 ACM DOC ENG DOC