Wikipedia-based cross-language text classification

被引:13
作者
Mourino Garcia, Marcos Antonio [1 ]
Perez Rodriguez, Roberto [1 ]
Anido Rifon, Luis [1 ]
机构
[1] Univ Vigo, Dept Telemat Engn, Telecommun Engn Sch, Campus Lagoas Marcosende, Vigo 36310, Spain
关键词
Cross-language text classification; Wikipedia Miner; Bag of concepts; Bag of words; Hybrid; Document representation; KNOWLEDGE;
D O I
10.1016/j.ins.2017.04.024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents the application of a Wikipedia-based bag of concepts (WikiBoC) document representation to cross-language text classification (CLTC). Its main objective is to alleviate the major drawbacks of the state-of-the-art CLTC approaches typically based on the machine translation (MT) of documents, which are represented as bags of words (BoW). We propose a technique called cross-language concept matching (CLCM), to convert concept-based representations of documents from one language to another using Wikipedia correspondences between concepts in different languages and thus not relying on automated full-text translations. We describe two proposals: the first proposal consists in the use of the WikiBoC representation in conjunction with the CLCM technique (WikiBoC-CLCM) to classify documents written in a language L-1 by using a SVM algorithm that was trained with documents written in another language L-2; the second proposal consists of a hybrid model for representing documents that combines WikiBoC-CLCM with the classic BoW-MT approach. To evaluate the two proposals we conducted several experiments with three cross-lingual corpora: the JRC-Acquis corpus and two purpose-built corpora composed of Wikipedia articles. The first proposal outperforms state-of-theart approaches when training sequences are short, achieving performance increases up to 233.33%. The second proposal outperforms state-of-the-art approaches in the whole range of training sequences, achieving performance increases up to 23.78%. Results obtained show the benefits of the WikiBoC-CLCM approach, since concepts extracted from documents add useful information to the classifier, thus improving its performance. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:12 / 28
页数:17
相关论文
共 51 条
[1]  
Amini M.-R., 2009, P 22 INT C NEURAL IN, V22, P28
[2]  
[Anonymous], P ICML WORKSH LEARN
[3]  
[Anonymous], 2009, P JOINT C 47 ANN M A
[4]  
[Anonymous], 2011, P 49 ANN M ASS COMPU, DOI DOI 10.5555/2002736.2002823
[5]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Bloehdorn S., 2004, Advances in Web Mining and Web Usage Analysis. 6th International Workshop on Knowledge Discovery on The Web, WebKDD 2004. Revised Selected Papers (Lecture Notes in Artificial Intelligence Vol. 3932), P149
[8]   Improving relevance feedback-based query expansion by the use of a weighted word pairs approach [J].
Colace, Francesco ;
De Santo, Massimo ;
Greco, Luca ;
Napoletano, Paolo .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (11) :2223-2234
[9]  
De Smet W, 2011, LECT NOTES ARTIF INT, V6634, P549, DOI 10.1007/978-3-642-20841-6_45
[10]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO