Parallel sentence extraction to improve cross-language information retrieval from Wikipedia

被引：13

作者：

Cheon, Juryong ^{[1
]}

Ko, Youngjoong ^{[2
]}

机构：

[1] Dong A Univ, Busan, South Korea

[2] Sungkyunkwan Univ, 2066 Seobu Ro, Suwon 16419, Gyeonggi Do, South Korea

来源：

JOURNAL OF INFORMATION SCIENCE | 2021年 / 47卷 / 02期

基金：

新加坡国家研究基金会;

关键词：

Automatic parallel corpus construction; cross-language information retrieval; Wikipedia;

D O I：

10.1177/0165551521992754

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Translation language resources, such as bilingual word lists and parallel corpora, are important factors affecting the effectiveness of cross-language information retrieval (CLIR) systems. In particular, when large domain-appropriate parallel corpora are not available, developing an effective CLIR system is particularly difficult. Furthermore, creating a large parallel corpus is costly and requires considerable effort. Therefore, we here demonstrate the construction of parallel corpora from Wikipedia as well as improved query translation, wherein the queries are used for a CLIR system. To do so, we first constructed a bilingual dictionary, termed WikiDic. Then, we evaluated individual language resources and combinations of them in terms of their ability to extract parallel sentences; the combinations of our proposed WikiDic with the translation probability from the Web's bilingual example sentence pairs and WikiDic was found to be best suited to parallel sentence extraction. Finally, to evaluate the parallel corpus generated from this best combination of language resources, we compared its performance in query translation for CLIR to that of a manually created English-Korean parallel corpus. As a result, the corpus generated by our proposed method achieved a better performance than did the manually created corpus, thus demonstrating the effectiveness of the proposed method for automatic parallel corpus extraction. Not only can the method demonstrated herein be used to inform the construction of other parallel corpora from language resources that are readily available, but also, the parallel sentence extraction method will naturally improve as Wikipedia continues to be used and its content develops.

引用

页码：281 / 293

页数：13

共 26 条

[1]

Adafre SF., P WORKSH NEW TEXT WI

[2]

Callan J. P., 1992, DEXA 92. Database and Expert Systems Applications. Proceedings of the International Conference, P78

[3]

Chen J., 2000, P RIAO 2000, V1, P62

[4] Similarity-based methods for word sense disambiguation [J].

Dagan, I ;

Lee, L ;

Pereira, F .

35TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 8TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 1997, :56-63

[5]

Darwish K., P 26 ANN INT ACM SIG, P338

[6]

Gaillard B., 2010, P EUR ASS MACH TRANS

[7]

Hewavitharana S., 2013, Building and Using Comparable Corpora, P191

[8]

Hull DA., P 19 ANN INT ACM SIG, P49

[9] Combining Lexical and Statistical Translation Evidence for Cross-Language Information Retrieval [J].

Kim, Sungho ;

Ko, Youngjoong ;

Oard, Douglas W. .

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (01) :23-39

[10]

Lachraf R., P 4 ARB NAT LANG PRO, P40

← 1 2 3 →