Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

被引:0
|
作者
Yuan, Yang [1 ,2 ,3 ]
Li, Xiao [1 ,2 ,3 ]
Yang, Ya-Ting [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
基金
中国国家自然科学基金;
关键词
word embedding; word alignment probability; distance attenuation function; Word2vec; GloVe;
D O I
10.3390/info11010024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
引用
收藏
页数:12
相关论文
共 50 条
  • [11] Predicting Embedding Reliability in Low-Resource Settings Using Corpus Similarity Measures
    Dunn, Jonathan
    Li, Haipeng
    Sastre, Damian
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6461 - 6470
  • [12] Improving a Multi-Source Neural Machine Translation Model with Corpus Extension for Low-Resource Languages
    Choi, Gyu-Hyeon
    Shin, Jong-Hun
    Kim, Young-Kil
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 900 - 904
  • [13] Neural machine translation for low-resource languages without parallel corpora
    Karakanta, Alina
    Dehdari, Jon
    van Genabith, Josef
    MACHINE TRANSLATION, 2018, 32 (1-2) : 167 - 189
  • [14] Building lexicon-based sentiment analysis model for low-resource languages
    Mohammed, Idi
    Prasad, Rajesh
    METHODSX, 2023, 11
  • [15] Linguistically-informed Training of Acoustic Word Embeddings for Low-resource Languages
    Yang, Zixiaofan
    Hirschberg, Julia
    INTERSPEECH 2019, 2019, : 2678 - 2682
  • [16] Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages
    Kulshreshtha, Devang
    Dingliwal, Saket
    Houston, Brady
    Bodapati, Sravan
    INTERSPEECH 2023, 2023, : 3302 - 3306
  • [17] Voice Activation for Low-Resource Languages
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2021, 11 (14):
  • [18] Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
    Aloka Fernando
    Surangika Ranathunga
    Dilan Sachintha
    Lakmali Piyarathna
    Charith Rajitha
    Knowledge and Information Systems, 2023, 65 : 571 - 612
  • [19] Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages
    Fernando, Aloka
    Ranathunga, Surangika
    Sachintha, Dilan
    Piyarathna, Lakmali
    Rajitha, Charith
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (02) : 571 - 612
  • [20] A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages
    Vania, Clara
    Kementchedjhieva, Yova
    Sogaard, Anders
    Lopez, Adam
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1105 - 1116