Simulating CLIR Translation Resource Scarcity using High-resource Languages

被引:4
作者
Bonab, Hamed [1 ]
Allan, James [1 ]
Sitaraman, Ramesh [1 ]
机构
[1] Univ Massachusetts Amherst, Coll Informat & Comp Sci, Amherst, MA 01003 USA
来源
PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19) | 2019年
关键词
Low-resource languages; translation resources; language simulation;
D O I
10.1145/3341981.3344236
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We study the impact of translation resource scarcity on the performance of cross-language information retrieval (CLIR) systems. To do that, we develop a contrastive analysis framework that uses high-resource languages to simulate low-resource languages. In the framework, we focus on parallel translation corpora and aim to better understand the factors that impact CLIR performance. We argue that both low- and high-resource corpora are needed to develop that understanding. Hence, we take the approach of starting with a true low-resource language and systematically down-sampling a high-resource language to become an artificial low-resource language-the reverse perspective of existing research. We formalize the problem as the Resource Scarcity Simulation (RSS) problem. We model the problem with a family of set covering problems, formulate with integer linear programming, and prove that the problem is actually NP-hard. To this end, we provide two greedy algorithms with polynomial complexities. We compare and analyze our approach with alternate techniques using four high-resource languages (French, Italian, German, and Finnish) down-sampled to simulate two low-resource languages (Somali and Swahili). Our experimental results suggest that language families are important for the RSS problem. We simulate Somali with German, and Swahili with Finnish, achieving 98% and 97% on the similarity percentage in terms of CLIR performance, respectively.
引用
收藏
页码:128 / 135
页数:8
相关论文
共 32 条
[1]  
Adams O, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P937
[2]  
[Anonymous], 2000, WORKSH CROSS LANG EV
[3]  
[Anonymous], 2004, P ACL INT POST DEM S
[4]  
[Anonymous], WORKSH CROSS LANG EV
[5]  
[Anonymous], P 57 ANN M ASS COMP
[6]  
[Anonymous], 2001, 2 WORKSHOP CROSS LAN
[7]  
[Anonymous], 2019 ACM SIGIR INT C
[8]  
[Anonymous], TREC
[9]  
[Anonymous], LANGUAGE MODELING IN
[10]  
[Anonymous], 2014, 15 ANN C INT SPEECH