Manipuri-English comparable corpus for cross-lingual studies

被引:1
作者
Laitonjam, Lenin [1 ,2 ]
Singh, Sanasam Ranbir [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati, Assam, India
[2] Natl Inst Technol Mizoram, Dept Comp Sci & Engn, Aizawl, India
关键词
Manipuri; Low-resource; Comparable corpus; Bilingual dictionary induction; Machine translation; GENERATION;
D O I
10.1007/s10579-021-09576-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Mni-EnCC, a temporal alligned Manipuri-English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri-English, we also plan to release the corpus and supporting Unicode conversion tool.
引用
收藏
页码:377 / 413
页数:37
相关论文
共 70 条
[1]  
Aker A, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P15
[2]  
[Anonymous], 2010, COLING 2010 POSTERS
[3]  
Artetxe M, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3632
[4]  
Artetxe M, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P789
[5]  
Artetxe Mikel, 2018, Unsupervised neural machine translation, DOI DOI 10.18653/V1/D18-1399
[6]  
Bansal Akanksha, 2013, 6 P LANG TECHN C LTC
[7]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[8]  
Buck Christian, 2016, WMT 2016, V2, P672, DOI DOI 10.18653/V1/W16-2365
[9]  
Chaudhury S., 2012, INT J COMPUTER APPL, V58, P35, DOI [10.5120/9376-3852, DOI 10.5120/9376-3852]
[10]  
Chelliah SL., 1990, LINGUISTICS TIBETO B, V13, P27