Simulating CLIR Translation Resource Scarcity using High-resource Languages

被引:4
作者
Bonab, Hamed [1 ]
Allan, James [1 ]
Sitaraman, Ramesh [1 ]
机构
[1] Univ Massachusetts Amherst, Coll Informat & Comp Sci, Amherst, MA 01003 USA
来源
PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19) | 2019年
关键词
Low-resource languages; translation resources; language simulation;
D O I
10.1145/3341981.3344236
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We study the impact of translation resource scarcity on the performance of cross-language information retrieval (CLIR) systems. To do that, we develop a contrastive analysis framework that uses high-resource languages to simulate low-resource languages. In the framework, we focus on parallel translation corpora and aim to better understand the factors that impact CLIR performance. We argue that both low- and high-resource corpora are needed to develop that understanding. Hence, we take the approach of starting with a true low-resource language and systematically down-sampling a high-resource language to become an artificial low-resource language-the reverse perspective of existing research. We formalize the problem as the Resource Scarcity Simulation (RSS) problem. We model the problem with a family of set covering problems, formulate with integer linear programming, and prove that the problem is actually NP-hard. To this end, we provide two greedy algorithms with polynomial complexities. We compare and analyze our approach with alternate techniques using four high-resource languages (French, Italian, German, and Finnish) down-sampled to simulate two low-resource languages (Somali and Swahili). Our experimental results suggest that language families are important for the RSS problem. We simulate Somali with German, and Swahili with Finnish, achieving 98% and 97% on the similarity percentage in terms of CLIR performance, respectively.
引用
收藏
页码:128 / 135
页数:8
相关论文
共 32 条
[11]  
[Anonymous], TREC
[12]  
Braschler M, 2003, LECT NOTES COMPUT SC, V3237, P44
[13]  
Braschler Martin, 2002, Workshop of the Cross-Language Evaluation Forum for European Languages, P9
[14]   Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments [J].
Egghe, Leo .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (05) :702-709
[15]  
Franz M., 2001, SIGIR Forum, P398
[16]  
Kamholz D, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3145
[17]  
Koehn Phillip, 2005, MT SUMMIT, P79
[18]   Embedding Web-based statistical translation models in cross-language information retrieval [J].
Kraaij, W ;
Nie, JY ;
Simard, M .
COMPUTATIONAL LINGUISTICS, 2003, 29 (03) :381-419
[19]   Modeling under-resourced languages for speech recognition [J].
Kurimo, Mikko ;
Enarvi, Seppo ;
Tilk, Ottokar ;
Varjokallio, Matti ;
Mansikkaniemi, Andre ;
Alumae, Tanel .
LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) :961-987
[20]   Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only [J].
Litschko, Robert ;
Glavas, Goran ;
Ponzetto, Simone Paolo ;
Vulic, Ivan .
ACM/SIGIR PROCEEDINGS 2018, 2018, :1253-1256