Exploring Bilingual Word Embeddings for Hiligaynon, a Low-Resource Language

被引:0
|
作者
Michel, Leah [1 ]
Hangya, Viktor [1 ]
Fraser, Alexander [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
来源
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020) | 2020年
基金
欧洲研究理事会;
关键词
bilingual word embeddings; bilingual lexicon induction; post-hoc mapping; low-resource languages; Hiligaynon;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper investigates the use of bilingual word embeddings for mining Hiligaynon translations of English words. There is very little research on Hiligaynon, an extremely low-resource language of Malayo-Polynesian origin with over 9 million speakers in the Philippines (we found just one paper). We use a publicly available Hiligaynon corpus with only 300K words, and match it with a comparable corpus in English. As there are no bilingual resources available, we manually develop a English-Hiligaynon lexicon and use this to train bilingual word embeddings. But we fail to mine accurate translations due to the small amount of data. To find out if the same holds true for a related language pair, we simulate the same low-resource setup on English to German and arrive at similar results. We then vary the size of the comparable English and German corpora to determine the minimum corpus size necessary to achieve competitive results. Further, we investigate the role of the seed lexicon. We show that with the same corpus size but with a smaller seed lexicon, performance can surpass results of previous studies. We release the lexicon of 1,200 English-Hiligaynon word pairs we created to encourage further investigation.
引用
收藏
页码:2573 / 2580
页数:8
相关论文
共 50 条
  • [1] Supervised Bilingual Word Embeddings for Low-Resource Language Pairs: Myanmar and Thai
    16TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2021), 2021,
  • [2] Anchor-based Bilingual Word Embeddings for Low-Resource Languages
    Eder, Tobias
    Hangya, Viktor
    Fraser, Alexander
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 227 - 232
  • [3] Cross-Lingual Word Embeddings for Low-Resource Language Modeling
    Adams, Oliver
    Makarucha, Adam
    Neubig, Graham
    Bird, Steven
    Cohn, Trevor
    15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 937 - 947
  • [4] Augmenting Training Data for Low-Resource Neural Machine Translation via Bilingual Word Embeddings and BERT Language Modelling
    Ramesh, Akshai
    Uhana, Haque Usuf
    Parthasarathy, Venkatesh Balavadhani
    Haque, Rejwanul
    Way, Andy
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [5] Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language
    Das, Arjun
    Ganguly, Debasis
    Garain, Utpal
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2017, 16 (03)
  • [6] Word Embeddings in Low Resource Gujarati Language
    Joshi, Ishani
    Koringa, Purvi
    Mitra, Suman
    2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW), VOL 5, 2019, : 110 - 115
  • [7] Dirichlet-Smoothed Word Embeddings for Low-Resource Settings
    Jungmaier, Jakob
    Kassner, Nora
    Roth, Benjamin
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3560 - 3565
  • [8] Learning Bilingual Lexicon for Low-Resource Language Pairs
    Zhu, ShaoLin
    Li, Xiao
    Yang, YaTing
    Wang, Lei
    Mi, ChengGang
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 760 - 770
  • [9] Regressing Word and Sentence Embeddings for Low-Resource Neural Machine Translation
    Unanue I.J.
    Borzeshi E.Z.
    Piccardi M.
    IEEE Transactions on Artificial Intelligence, 2023, 4 (03): : 450 - 463
  • [10] Linguistically-informed Training of Acoustic Word Embeddings for Low-resource Languages
    Yang, Zixiaofan
    Hirschberg, Julia
    INTERSPEECH 2019, 2019, : 2678 - 2682