A high performance centroid-based classification approach for language identification

被引:37
作者
Takci, Hidayet [1 ]
Gungor, Tunga [2 ]
机构
[1] GYTE, Dept Comp Engn, TR-41400 Gebze, Turkey
[2] Bogazici Univ, Dept Comp Engn, TR-34342 Istanbul, Turkey
关键词
Language identification; Centroid-based classification; IDF (inverse document frequency); ICF (inverse class frequency); TEXT; SUPPORT;
D O I
10.1016/j.patrec.2012.06.012
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Centroid-based classification is a machine learning approach used in the text classification domain. The main advantage of centroid-based classifiers is their high performance during both the training stage and the classification stage. However, the success rate can be lower than the other classifiers if good centroid values are not used. In this paper, we apply the centroid-based classification method to the language identification problem, which can be considered as a sub-problem of text classification. We propose a novel method named as inverse class frequency to increase the quality of the centroid values, which involves an update of the classical values. We also use a feature set formed of individual characters rather than words or n-gram sequences to decrease the training and classification times. The experiments were performed on the ECI/MCI corpus and the method was compared with other methods and previous studies. The results showed that the proposed approach yields high success rates and works very efficiently for language identification. (c) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:2077 / 2084
页数:8
相关论文
共 50 条
[1]  
Adams G., 1997, P ACL EACL MADR, P43
[2]  
[Anonymous], 2006, 2006 5 INT C MACH LE
[3]  
[Anonymous], 2003, P ACM S APPL COMP
[4]  
[Anonymous], 2010, INT J COMPUTER SCI A
[5]  
Armstrong-Warwick S., 1994, P INT WORKSH SHAR NA, P97
[6]  
Bhargava Aditya, 2010, P N AM CHAPT ASS COM, P693
[7]  
Bosca A., 2010, P CLEF
[8]  
Buckley C., 1996, P 4 TEXT RETR C TREC
[9]  
Cavnar W., 1994, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, V3, P161
[10]  
Chizi B., 2009, ENCY DATA WAREHOUSIN, VSecond, P1888, DOI DOI 10.4018/978-1-60566-010-3.CH289