Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages

被引:0
|
作者
Aloka Fernando
Surangika Ranathunga
Dilan Sachintha
Lakmali Piyarathna
Charith Rajitha
机构
[1] University of Moratuwa,Department of Computer Science and Engineering
来源
关键词
Document alignment; Sentence alignment; Low-resource languages; Neural machine translation; Parallel corpus mining;
D O I
暂无
中图分类号
学科分类号
摘要
Neural machine translation systems trained on low-resource languages produce sub-optimal results due to the scarcity of large parallel datasets. To alleviate this problem, parallel corpora can be mined from the web. Two key tasks in a parallel corpus mining pipeline are web document alignment and sentence alignment. Effective approaches for these tasks obtained vector representations of the documents (or sentences) belonging to the two languages and determine the alignment between the documents (or sentences) based on a semantic similarity scoring mechanism. Recently, document or sentence representations obtained from pre-trained multilingual language models (PMLMs) such as LASER, XLM-R and LaBSE have significantly improved the benchmark scores in diverse natural language processing tasks. In this study, we carry out an empirical analysis of the effectiveness of these PMLMs of the document and sentence alignment tasks in the context of the low-resource language pairs Sinhala–English, Tamil–English and Sinhala–Tamil. Further, we introduce a weighting mechanism based on small-scale bilingual lexicons to improve the semantic similarity measurement between sentences and documents. Our results show that both document and sentence alignment can be further improved using our weighting mechanism. We have also compiled a gold-standard evaluation benchmark dataset for document alignment and sentence alignment tasks for the considered language pairs. This dataset (https://github.com/kdissa/comparable-corpus) and the source code (https://github.com/nlpcuom/parallel_corpus_mining) are publicly released.
引用
收藏
页码:571 / 612
页数:41
相关论文
共 17 条
  • [1] Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
    Fernando, Aloka
    Ranathunga, Surangika
    Sachintha, Dilan
    Piyarathna, Lakmali
    Rajitha, Charith
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (02) : 571 - 612
  • [2] Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages
    Kulshreshtha, Devang
    Dingliwal, Saket
    Houston, Brady
    Bodapati, Sravan
    INTERSPEECH 2023, 2023, : 3302 - 3306
  • [3] Sliding Window and Parallel LSTM with Attention and CNN for Sentence Alignment on Low-Resource Languages
    Tan, Tien-Ping
    Lim, Chai Kim
    Rahman, Wan Rose Eliza Abdul
    PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2022, 30 (01): : 97 - +
  • [4] Multilingual Features Based Keyword Search for Very Low-Resource Languages
    Golik, Pavel
    Tueske, Zoltan
    Schlueter, Ralf
    Ney, Hermann
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1260 - 1264
  • [5] Anchor-based Bilingual Word Embeddings for Low-Resource Languages
    Eder, Tobias
    Hangya, Viktor
    Fraser, Alexander
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 227 - 232
  • [6] Exploiting Vocal-Source Features to Improve ASR Accuracy for Low-Resource Languages
    Fernandez, Raul
    Cui, Jia
    Rosenberg, Andrew
    Ramabhadran, Bhuvana
    Cui, Xiaodong
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 805 - 809
  • [7] Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
    Liu, Yihong
    Ye, Haotian
    Weissweiler, Leonie
    Pei, Renhao
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8376 - 8401
  • [8] Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages
    Yuan, Yang
    Li, Xiao
    Yang, Ya-Ting
    INFORMATION, 2020, 11 (01)
  • [9] Parcing of low-resource languages from embedding of multilingual words Application to Northern Sami and Komi-Zyrian
    Lim, KyungTae
    Partanen, Niko
    Poibeau, Thierry
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2018, 59 (03): : 67 - 91
  • [10] Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages
    Budiwati, Sari Dewi
    Siagian, Al Hafiz Akbar Maulana
    Fatyanosa, Tirana Noor
    Aritsugi, Masayoshi
    APPLIED SCIENCES-BASEL, 2021, 11 (04): : 1 - 20