Simulating CLIR Translation Resource Scarcity using High-resource Languages

被引：4

作者：

Bonab, Hamed ^{[1
]}

Allan, James ^{[1
]}

Sitaraman, Ramesh ^{[1
]}

机构：

[1] Univ Massachusetts Amherst, Coll Informat & Comp Sci, Amherst, MA 01003 USA

来源：

PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19) | 2019年

关键词：

Low-resource languages; translation resources; language simulation;

D O I：

10.1145/3341981.3344236

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

We study the impact of translation resource scarcity on the performance of cross-language information retrieval (CLIR) systems. To do that, we develop a contrastive analysis framework that uses high-resource languages to simulate low-resource languages. In the framework, we focus on parallel translation corpora and aim to better understand the factors that impact CLIR performance. We argue that both low- and high-resource corpora are needed to develop that understanding. Hence, we take the approach of starting with a true low-resource language and systematically down-sampling a high-resource language to become an artificial low-resource language-the reverse perspective of existing research. We formalize the problem as the Resource Scarcity Simulation (RSS) problem. We model the problem with a family of set covering problems, formulate with integer linear programming, and prove that the problem is actually NP-hard. To this end, we provide two greedy algorithms with polynomial complexities. We compare and analyze our approach with alternate techniques using four high-resource languages (French, Italian, German, and Finnish) down-sampled to simulate two low-resource languages (Somali and Swahili). Our experimental results suggest that language families are important for the RSS problem. We simulate Somali with German, and Swahili with Finnish, achieving 98% and 97% on the similarity percentage in terms of CLIR performance, respectively.

引用

页码：128 / 135

页数：8

共 32 条

[11]

[Anonymous], TREC

[12]

Braschler M, 2003, LECT NOTES COMPUT SC, V3237, P44

[13]

Braschler Martin, 2002, Workshop of the Cross-Language Evaluation Forum for European Languages, P9

[14] Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments [J].

Egghe, Leo .

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2007, 58 (05) :702-709

[15]

Franz M., 2001, SIGIR Forum, P398

[16]

Kamholz D, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3145

[17]

Koehn Phillip, 2005, MT SUMMIT, P79

[18] Embedding Web-based statistical translation models in cross-language information retrieval [J].

Kraaij, W ;

Nie, JY ;

Simard, M .

COMPUTATIONAL LINGUISTICS, 2003, 29 (03) :381-419

[19] Modeling under-resourced languages for speech recognition [J].

Kurimo, Mikko ;

Enarvi, Seppo ;

Tilk, Ottokar ;

Varjokallio, Matti ;

Mansikkaniemi, Andre ;

Alumae, Tanel .

LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (04) :961-987

[20] Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only [J].

Litschko, Robert ;

Glavas, Goran ;

Ponzetto, Simone Paolo ;

Vulic, Ivan .

ACM/SIGIR PROCEEDINGS 2018, 2018, :1253-1256

← 1 2 3 4 →