Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

被引:0
|
作者
Yuan, Yang [1 ,2 ,3 ]
Li, Xiao [1 ,2 ,3 ]
Yang, Ya-Ting [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Xinjiang Tech Inst Phys & Chem, Urumqi 830011, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Xinjiang Lab Minor Speech & Language Informat Pro, Urumqi 830011, Peoples R China
基金
中国国家自然科学基金;
关键词
word embedding; word alignment probability; distance attenuation function; Word2vec; GloVe;
D O I
10.3390/info11010024
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] A Parallel Corpus-Based Approach to the Crime Event Extraction for Low-Resource Languages
    Khairova, Nina
    Mamyrbayev, Orken
    Rizun, Nina
    Razno, Mariia
    Galiya, Ybytayeva
    IEEE ACCESS, 2023, 11 : 54093 - 54111
  • [2] Text Classification Based on Convolutional Neural Networks and Word Embedding for Low-Resource Languages: Tigrinya
    Fesseha, Awet
    Xiong, Shengwu
    Emiru, Eshete Derb
    Diallo, Moussa
    Dahou, Abdelghani
    INFORMATION, 2021, 12 (02) : 1 - 17
  • [3] Pre-trained Word Embedding based Parallel Text Augmentation Technique for Low-Resource NMT in Favor of Morphologically Rich Languages
    Hailu, Tulu Tilahun
    Yu, Junqing
    Fantaye, Tessfu Geteye
    PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND APPLICATION ENGINEERING (CSAE2019), 2019,
  • [4] JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages
    Agic, Eljko
    Vulic, Ivan
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3204 - 3210
  • [5] Anchor-based Bilingual Word Embeddings for Low-Resource Languages
    Eder, Tobias
    Hangya, Viktor
    Fraser, Alexander
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 227 - 232
  • [6] Web Data Selection Based on Word Embedding for Low-Resource Speech Recognition
    Xie, Chuandong
    Guo, Wu
    Hu, Guoping
    Liu, Junhua
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1340 - 1344
  • [7] Phrase Table Combination Based on Symmetrization of Word Alignment for Low-Resource Languages
    Budiwati, Sari Dewi
    Siagian, Al Hafiz Akbar Maulana
    Fatyanosa, Tirana Noor
    Aritsugi, Masayoshi
    APPLIED SCIENCES-BASEL, 2021, 11 (04): : 1 - 20
  • [8] PunCantonese: A Benchmark Corpus for Low-Resource Cantonese Punctuation Restoration from Speech Transcripts
    Li, Yunxiang
    Liu, Pengfei
    Wu, Xixin
    Meng, Helen
    INTERSPEECH 2023, 2023, : 2183 - 2187
  • [9] TOOLS FOR CREATING A CORPUS OF DICTIONARIES AND AN INSTRUMENT FOR DOCUMENTING LOW-RESOURCE LANGUAGES
    Otsomieva-Tagirova, Zabihat
    Temirbulatova, Sapiahanum
    Magomedov, Magomed
    Kieva, Zufira
    Dudarova, Ludmila
    REVISTA ENTRELINGUAS, 2021, 7
  • [10] A Word Representation to Improve Named Entity Recognition in Low-resource Languages
    Mbouopda, Michael Franklin
    Yonta, Paulin Melatagia
    2019 SIXTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2019, : 333 - 337