Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

被引:1
作者
Agyei, Emmanuel [1 ]
Zhang, Xiaoling [1 ]
Bannerman, Stephen [1 ]
Quaye, Ama Bonuah [1 ]
Yussi, Sophyani Banaamwini [1 ]
Agbesi, Victor Kwaku [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China
关键词
Twi-English parallel corpus; Machine translation; Low-resourced language; Linguistic;
D O I
10.1007/s10791-024-09451-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Although Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However, it continues to be seen as the perfect resource for Twi Machine Translation (MT) of IS0 639-3. The issue with the Twi-English parallel corpus is eminent at the multiple domain dataset level, partly due to the complex design structure and scarcity of the digital Twi lexicon. This study introduced Twi-2-ENG, a large-scale multiple domain Twi to English parallel corpus, Twi digital Dictionary, and lexicon version of Twi. Also, it employed the Ghanaian Parliamentary Hansards, crowdsourcing, and digital Ghana News Portals to crawl all the English sentences. Our curled news portals accumulated 5,765 parallel corpus sentences, the Twi New Testament Bible, and social media platforms. The data-gathering method used means of translation, compilation, tokenization, and the final alignments with the Twi-English parallel sentences, including the technology employed in compiling and hosting the corpus, were duly discussed. The results reveal that the role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus. Finally, all the sentences were curated with the help of a corpus manager, sketch engine, linguistics, and professional translators to align and tokenize all texts, allowing the Twi professional linguists to evaluate the corpus.
引用
收藏
页数:13
相关论文
共 56 条
[1]  
Adebara I, 2022, Arxiv, DOI arXiv:2203.08351
[2]   Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation [J].
Adjeisah, Michael ;
Liu, Guohua ;
Nyabuga, Douglas Omwenga ;
Nortey, Richard Nuetey ;
Song, Jinling .
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
[3]  
Afram GK, 2022, TWIENG: a multi-domain Twi-english parallel corpus for machine translation of Twi, a Low-Resource African Language
[4]  
Alabi JO, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P2754
[5]  
Alotaibi HM, 2017, ARAB WORLD ENGL J, V8, P319, DOI 10.24093/awej/vol8no3.21
[6]  
Azunre P, 2021, Arxiv, DOI arXiv:2103.15475
[7]  
Azunre P, 2021, Arxiv, DOI arXiv:2103.15625
[8]  
Azunre P, 2021, Arxiv, DOI arXiv:2103.15963
[9]  
Beermann D, 2018, Sustaining Knowledge Diversity in the Digital Age, P48
[10]   Is There a Core General Vocabulary? Introducing the New General Service List [J].
Brezina, Vaclav ;
Gablasova, Dana .
APPLIED LINGUISTICS, 2015, 36 (01) :1-22