Linguistically enhanced word segmentation for better neural machine translation of low resource agglutinative languages

被引:5
作者
Chimalamarri, Santwana [1 ,2 ]
Sitaram, Dinkar [1 ]
机构
[1] PES Univ, Dept Comp Sci, Bangashankari 3rd Stage, Bangalore 560085, Karnataka, India
[2] PES Univ, CO Ctr Cloud Comp & Big Data CCBD, Dept Comp Sci, Tech Pk B Block, Bangalore 560085, Karnataka, India
关键词
Natural language processing; Machine translation; Word segmentation;
D O I
10.1007/s10772-021-09865-5
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
One of the several challenges faced by neural machine translation systems is the lack of standard parallel corpora for several language pairs. Poor translation qualities often result from inadequate data. Aggravating this problem further are the issues of morphological complexity and agglutination, leading to unmanageable vocabulary size, rare words and data sparsity issues. Though this problem has been partly addressed by sub-word algorithms such as BPE, translation systems still lag in their ability to understand sentence and word structures associated with rich morphologies. This paper aims to address these issues by employing linguistically driven sub-word units into NMT systems. This approach is further enhanced by additional POS tag feature inputs. The proposed approach outperforms BPE driven machine translation models by several BLEU points and is also shown to have better recall measures from evaluation by ROUGE metric. The results have been evaluated upon a morphologically complex Dravidian language pair, Kannada and Telugu.
引用
收藏
页码:1047 / 1053
页数:7
相关论文
共 37 条
  • [1] Aharoni R., 2017, ARXIV PREPRINT ARXIV
  • [2] Alexandrescu A., 2006, Proceedings of the human language technology conference of the naacl, companion volume: Short papers, P1
  • [3] [Anonymous], 2016, Proceedings of the Australasian Language Technology Association Workshop 2016
  • [4] [Anonymous], 2007, EMNLP-CoNLL
  • [5] Ataman D., 2018, ARXIV PREPRINT ARXIV
  • [6] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [7] Bengio Y, 2001, ADV NEUR IN, V13, P932
  • [8] Bostrom K, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P4617
  • [9] Chen H., 2018, NAACL HLT
  • [10] Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
    Chimalamarri, Santwana
    Sitaram, Dinkar
    Jain, Ashritha
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)