A co-classification approach to learning from multilingual corpora

被引:25
作者
Amini, Massih-Reza [1 ]
Goutte, Cyril [1 ]
机构
[1] Natl Res Council Canada, Interact Language Technol Grp, Gatineau, PQ J8X 3X7, Canada
关键词
Text categorization; Multilingual data; Logistic regression; Boosting;
D O I
10.1007/s10994-009-5151-5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.
引用
收藏
页码:105 / 121
页数:17
相关论文
共 24 条
[1]  
ADEVA JJG, 2005, UPGRADE EUROPEAN J I, V6, P43
[2]  
Amini M., 2009, ADV NEURAL INFORM PR, V23
[3]  
[Anonymous], 2004, INT C MACH LEARN
[4]  
[Anonymous], 1999, Athena scientific Belmont
[5]  
[Anonymous], 2005, P ICML WORKSH LEARN
[6]  
[Anonymous], 1975, NONPARAMETRIC STAT M
[7]  
Bel N, 2003, LECT NOTES COMPUT SC, V2769, P126
[8]  
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, P92, DOI 10.1145/279943.279962
[9]  
Brefeld U., 2006, Proceedings of the 23rd international conference on Machine learning, P137, DOI [10.1145/1143844.1143862, DOI 10.1145/1143844.1143862]
[10]  
Cavnar W., 1994, Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, V3, P161