Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs

被引:9
作者
Shakery, Azadeh [1 ]
Zhai, ChengXiang [2 ]
机构
[1] Univ Tehran, Dept Elect & Comp Engn, Coll Engn, Tehran, Iran
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
来源
INFORMATION RETRIEVAL | 2013年 / 16卷 / 01期
关键词
Cross-language information retrieval; Comparable corpora; Probabilistic propagation; Language models; MACHINE TRANSLATION; MODELS;
D O I
10.1007/s10791-012-9194-z
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-language information retrieval (CLIR) has so far been studied with the assumption that some rich linguistic resources such as bilingual dictionaries or parallel corpora are available. But creation of such high quality resources is labor-intensive and they are not always at hand. In this paper we investigate the feasibility of using only comparable corpora for CLIR, without relying on other linguistic resources. Comparable corpora are text documents in different languages that cover similar topics and are often naturally attainable (e.g., news articles published in different languages at the same time period). We adapt an existing cross-lingual word association mining method and incorporate it into a language modeling approach to cross-language retrieval. We investigate different strategies for estimating the target query language models. Our evaluation results on the TREC Arabic-English cross-lingual data show that the proposed method is effective for the CLIR task, demonstrating that it is feasible to perform cross-lingual information retrieval with just comparable corpora.
引用
收藏
页码:1 / 29
页数:29
相关论文
共 53 条
[1]  
Abdul-Rauf Sadaf., 2009, Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, P46
[2]  
Aljlayl M., 2001, Proceedings of the 2001 ACM CIKM. Tenth International Conference on Information and Knowledge Management, P295, DOI 10.1145/502585.502635
[3]  
[Anonymous], 41 ANN M ASS COMP LI
[4]  
[Anonymous], 2005, P 11 ACM SIGKDD INT
[5]  
[Anonymous], 1999, TECH REPORT STANFORD
[6]  
[Anonymous], 1995, P 33 ANN M ASS COMP, DOI DOI 10.3115/981658.981709
[7]  
[Anonymous], P 11 TEXT RETR C TRE
[8]  
[Anonymous], 1999, 199931 STANF INFOLAB
[9]  
[Anonymous], P 18 C COMP LING COL
[10]  
Ballesteros L, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P84, DOI 10.1145/278459.258540