A statistical approach to crosslingual natural language tasks

被引:25
作者
Pinto, David [1 ,2 ]
Civera, Jorge [2 ]
Barron-Cedeno, Alberto [2 ]
Juan, Alfons [2 ]
Rosso, Paolo [2 ]
机构
[1] Benemerita Univ Autonoma Puebla, Fac Ciencias Computac, Puebla, Mexico
[2] Univ Politecn Valencia, Dept Sistemas Informat & Computac, Valencia, Spain
来源
JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC | 2009年 / 64卷 / 01期
关键词
Natural language processing; IBM translation models; Crosslingual data; Text classification; Information retrieval; Plagiarism analysis; TEXT; MODELS;
D O I
10.1016/j.jalgor.2009.02.005
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The existence Of huge volumes of documents written in multiple languages on Internet leads to investigate novel algorithmic approaches to deal with information of this kind. However, most crosslingual natural language processing (NLP) tasks consider a decoupled approach in which monolingual NLP techniques are applied along with an independent translation process. This two-step approach is too sensitive to translation errors, and in general to the accumulative effect of errors. To solve this problem, we propose to use a direct probabilistic crosslingual NLP system which integrates both steps, translation and the specific NLP task, into a single one. In order to perform this integrated approach to crosslingual tasks, we propose to use the statistical IBM 1 word alignment model (M1). The M1 model may show a non-monotonic behaviour when aligning words from a sentence in a Source language to words from another sentence in a different, target language. This is the case of languages with different word order. In English, for instance, adjectives appear before nouns, whereas in Spanish it is exactly the opposite. The successful experimental results reported in three different tasks - text classification, information retrieval and plagiarism analysis - highlight the benefits of the statistical integrated approach proposed in this work. (C) 2009 Elsevier Inc. All rights reserved.
引用
收藏
页码:51 / 60
页数:10
相关论文
共 23 条
[1]  
[Anonymous], 2003, Proceedings of HLT-NAACL
[2]  
BARRONCEDENO A, 2008, ECAI 2008 WORKSH UNC, P9
[3]  
Brown P. F., 1993, Computational Linguistics, V19, P263
[4]  
CIVERA J, 2008, P LREC 08
[5]  
DING Y, 2003, P MT SUMMIT, V9, P95
[6]  
*EC, 1995, ANNEX INDEX OFFICIAL, V2
[7]   PROBABILISTIC MODELS IN INFORMATION-RETRIEVAL [J].
FUHR, N .
COMPUTER JOURNAL, 1992, 35 (03) :243-255
[8]   On the use of Bernoulli mixture models for text classification [J].
Juan, A ;
Vidal, E .
PATTERN RECOGNITION, 2002, 35 (12) :2705-2710
[9]  
Lewis D.D., 1998, LECT NOTES COMPUTER, V1398, P4
[10]  
McCallum A, 1998, AAAI 98 WORKSH LEARN, V752, P41