Malayalam Natural Language Processing: Challenges in Building a Phrase-Based Statistical Machine Translation System

被引：5

作者：

Sebastian, Mary Priya ^{[1
]}

Kumar, G. Santhosh ^{[2
]}

机构：

[1] Rajagiri Sch Engn & Technol Kerala, Dept Comp Sci, Kochi, Kerala, India

[2] Cochin Univ Sci & Technol, Nat Language Proc Lab, Dept Comp Sci, Cochin, Kerala, India

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 04期

关键词：

Statistical Machine Translation; Malayalam; Machine Translation; Natural Language Processing; Dravidian language; alignments; ENGLISH;

D O I：

10.1145/3579163

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Statistical Machine Translation (SMT) is a preferred Machine Translation approach to convert the text in a specific language into another by automatically learning translations using a parallel corpus. SMT has been successful in producing quality translations in many foreign languages, but there are only a few works attempted in South Indian languages. The article discusses on experiments conducted with SMT forMalayalam language and analyzes how the methods defined for SMT in foreign languages affect a Dravidian language, Malayalam. The baseline SMT model does not work for Malayalam due to its unique characteristics like agglutinative nature and morphological richness. Hence, the challenge is to identify where precisely the SMT model has to be modified such that it adapts the challenges of the language peculiarity into the baseline model and give better translations for English to Malayalam translation. The alignments between English and Malayalam sentence pairs, subjected to the training process in SMT, plays a crucial role in producing quality output translation. Therefore, this work focuses on improving the translation model of SMT by refining the alignments between English-Malayalam sentence pairs. The phrase alignment algorithms align the verb and noun phrases in the sentence pairs and develop a new set of alignments for the English-Malayalam sentence pairs. These alignment sets refine the alignments formed from Giza++ produced as a result of EM training algorithm. The improved Phrase-Based SMT model trained using these refined alignments resulted in better translation quality, as indicated by the AER and BLUE scores.

引用

页数：51

共 89 条

[1]

Aasha VC, 2015, 2015 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), P1565, DOI 10.1109/ICACCI.2015.7275836

[2]

Ahrenberg L., 1998, 36 ANN M ASS COMPUTA, V1

[3]

Ahsan A., 2010, P 9 C ASS MACH TRANS

[4]

Ali A., 2010, INT J ENG TECHNOLOGY, V10, P31

[5]

Anand KM, 2014, PERTANIKA J SOC SCI, V22, P1045

[6]

Anand Kumar M., 2013, THESIS AMRITA VISHWA

[7]

[Anonymous], 2011, P 5 INT JOINT C NATU

[8]

[Anonymous], 2023, LANG INF PROCESS, V22

[9]

[Anonymous], 2014, P 14 C EUROPEAN CHAP

[10]

[Anonymous], 2010, P NAACL HLT 2010 STU

← 1 2 3 4 5 6 7 8 9 →