Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

被引:0
|
作者
Yuan, Yang [1 ,2 ,3 ]
Li, Xiao [1 ,2 ,3 ]
Yang, Ya-Ting [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
基金
中国国家自然科学基金;
关键词
word embedding; word alignment probability; distance attenuation function; Word2vec; GloVe;
D O I
10.3390/info11010024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] USING WORD BURST ANALYSIS TO RESCORE KEYWORD SEARCH CANDIDATES ON LOW-RESOURCE LANGUAGES
    Richards, Justin
    Ma, Min
    Rosenberg, Andrew
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [22] Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary
    Fang, Meng
    Cohn, Trevor
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 2, 2017, : 587 - 593
  • [23] Machine Reading Comprehension Model for Low-Resource Languages and Experimenting on Vietnamese
    Bach Hoang Tien Nguyen
    Dung Manh Nguyen
    Trang Thi Thu Nguyen
    ADVANCES AND TRENDS IN ARTIFICIAL INTELLIGENCE: THEORY AND PRACTICES IN ARTIFICIAL INTELLIGENCE, 2022, 13343 : 370 - 381
  • [24] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Joyanta Basu
    Soma Khan
    Rajib Roy
    Tapan Kumar Basu
    Swanirbhar Majumder
    Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
  • [25] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Basu, Tapan Kumar
    Majumder, Swanirbhar
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
  • [26] Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
    Koehn, Philipp
    Guzman, Francisco
    Chaudhary, Vishrav
    Pino, Juan
    FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 54 - 72
  • [27] Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation
    Imankulova, Aizhan
    Sato, Takayuki
    Komachi, Mamoru
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (02)
  • [28] Enabling Medical Translation for Low-Resource Languages
    Musleh, Ahmad
    Durrani, Nadir
    Temnikova, Irina
    Nakov, Preslav
    Vogel, Stephan
    Alsaad, Osama
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 3 - 16
  • [29] Discourse annotation guideline for low-resource languages
    Vargas, Francielle
    Schmeisser-Nieto, Wolfgang
    Rabinovich, Zohar
    Pardo, Thiago A. S.
    Benevenuto, Fabricio
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 700 - 743
  • [30] GlotLID: Language Identification for Low-Resource Languages
    Kargaran, Amir Hossein
    Imani, Ayyoob
    Yvon, Francois
    Schuetze, Hinrich
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6155 - 6218