Sub-word based unsupervised bilingual dictionary induction for Chinese-Uyghur

被引:1
作者
Aysa, Anwar [1 ]
Ablimit, Mijit [1 ]
Yilahun, Hankiz [2 ]
Hamdulla, Askar [2 ]
机构
[1] Xinjiang Univ China, Sch Informat Sci & Engn, Xinjiang Key Lab Signal Detect & Proc, Urumqi 830017, Xinjiang, Peoples R China
[2] Xinjiang Univ China, Sch Informat Sci & Engn, Xinjiang Key Lab Multilingual Informat Technol, Urumqi 830017, Xinjiang, Peoples R China
来源
2022 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2022) | 2022年
关键词
bilingual dictionary; unsupervised learning; seed dictionary; morpheme sequence;
D O I
10.1109/IALP57159.2022.9961257
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we focus on the task of bilingual dictionary induction for the Chinese-Uyghur language pair. Usually, correlating long-distance linguistic information requires cross-linguistic information as supervision, which often requires parallel corpora to link in seed lexicons. And the parallel corpora are expensive. The low-resource Uyghur language text data are only available in a small amount, and the derivative morphological structure is vibrant and complex. In bilingual processing aligning most similar units and entity stems is the first step. So separating sentences into morpheme sequences is essential in the cross-lingual processing tasks. Uyghur words in text sentences consist of stems joined with several suffixes/prefixes. Rich and complex multiple affix forms exist in the text, forming many derivative words. This situation can easily lead to an increase in the repetition rate of intentional features in the text, which affects the efficiency of bilingual dictionary extraction. In this work, we actively explore the resource construction and granularity optimization of minority low-resource languages and learn cross-language word embeddings without the supervision of parallel data. A Chinese-Uyghur bilingual dictionary extraction method is proposed based on the neural network cross-language word embedding vector technology and the multilingual morphological analyzer. Experiments show that the way based on morpheme sequence significantly improved compared to the baseline model of the word sequence.
引用
收藏
页码:476 / 481
页数:6
相关论文
共 19 条
  • [1] Ablimit M, 2017, ASIAPAC SIGN INFO PR, P737, DOI 10.1109/APSIPA.2017.8282131
  • [2] [Anonymous], 2016, P 2016 C EMPIRICAL M, DOI DOI 10.18653/V1/D16-1250
  • [3] [Anonymous], 2013, P 2013 C EMP METH NA
  • [4] [Anonymous], 2014, P 14 C EUR CHAPT ASS, DOI DOI 10.3115/V1/E14-1049
  • [5] Artetxe M, 2018, Arxiv, DOI arXiv:1805.06297
  • [6] Artetxe M, 2018, AAAI CONF ARTIF INTE, P5012
  • [7] Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
    Aysa, Anwar
    Ablimit, Mijit
    Yilahun, Hankiz
    Hamdulla, Askar
    [J]. INFORMATION, 2022, 13 (04)
  • [8] Conneau A, 2018, Arxiv, DOI arXiv:1710.04087
  • [9] Feng XC, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4071
  • [10] Smith SL, 2017, Arxiv, DOI arXiv:1702.03859