Using syntax for improving phrase-based SMT in low-resource languages

被引:2
|
作者
Fadaei, Hakimeh [1 ]
Faili, Heshaam [1 ,2 ]
机构
[1] Univ Tehran, Sch Elect & Comp Engn, Coll Engn, Tehran, Iran
[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, Tehran, Iran
基金
美国国家科学基金会;
关键词
MODEL;
D O I
10.1093/llc/fqz033
中图分类号
C [社会科学总论];
学科分类号
03 ; 0303 ;
摘要
Data driven approaches for machine translation, such as statistical and neural machine translation, suffer from sparsity when dealing with low-resource languages. In these cases, using other sources of information including linguistic information could alleviate the problem. In this article, we focus on the problem of word ordering in translation from a high-resource to a low-resource language and try to improve the quality by using syntactic information from the high-resource side. We propose some syntactic features based on Tree Adjoining Grammar (TAG) to be employed in a phrase-based SMT model in order to improve the word ordering. In this work, a set of synchronous TAG rules is extracted and used to estimate the probability of the phrase orders suggested by the phrase-based model. The main idea of the article is to handle the word ordering by using the extended domain of locality property of TAG and abstracting the long distance dependencies into a local view, which is a TAG elementary tree. The experiments on English-Persian and English-German translation showed that, by combining the proposed TAG-based reordering features with lexical and hierarchical reordering models, we gain significant improvements over the baseline and in comparison with a neural reordering model and a pre-reordering model.
引用
收藏
页码:507 / 528
页数:22
相关论文
共 50 条
  • [41] Improving stance detection accuracy in low-resource languages: a deep learning framework with ParsBERT
    Rahimi, Mohammad
    Kiani, Vahid
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2024, : 517 - 535
  • [42] Improved Acoustic Modeling Of Low-Resource Languages Using Shared SGMM Parameters Of High-Resource Languages
    Joy, Neethu Mariam
    Abraham, Basil
    Navneeth, K.
    Umesh, S.
    2016 TWENTY SECOND NATIONAL CONFERENCE ON COMMUNICATION (NCC), 2016,
  • [43] Improving the Minimum Description Length Inference of Phrase-Based Translation Models
    Gonzalez-Rubio, Jesus
    Casacuberta, Francisco
    PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2015), 2015, 9117 : 219 - 227
  • [44] Multilingual Features Based Keyword Search for Very Low-Resource Languages
    Golik, Pavel
    Tueske, Zoltan
    Schlueter, Ralf
    Ney, Hermann
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1260 - 1264
  • [45] Anchor-based Bilingual Word Embeddings for Low-Resource Languages
    Eder, Tobias
    Hangya, Viktor
    Fraser, Alexander
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 227 - 232
  • [46] Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
    Marie, Benjamin
    Fujita, Atsushi
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2018, 17 (03)
  • [47] Syntax- and semantic-based reordering in hierarchical phrase-based statistical machine translation
    Kazemi, Arefeh
    Toral, Antonio
    Way, Andy
    Monadjemi, Amirhassan
    Nematbakhsh, Mohammadali
    EXPERT SYSTEMS WITH APPLICATIONS, 2017, 84 : 186 - 199
  • [48] Unsupervised SMT: an analysis of Indic languages and a low resource language
    Saxena, Shefali
    Chauhan, Shweta
    Arora, Paras
    Daniel, Philemon
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2024, 36 (06) : 865 - 884
  • [49] Detecting Social Media Manipulation in Low-Resource Languages
    Haider, Samar
    Luceri, Luca
    Deb, Ashok
    Badawy, Adam
    Peng, Nanyun
    Ferrara, Emilio
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 1358 - 1364
  • [50] Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages
    Upadhyay, Shyam
    Kodner, Jordan
    Roth, Dan
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 501 - 511