Sub-word based unsupervised bilingual dictionary induction for Chinese-Uyghur

被引：1

作者：

Aysa, Anwar ^{[1
]}

Ablimit, Mijit ^{[1
]}

Yilahun, Hankiz ^{[2
]}

Hamdulla, Askar ^{[2
]}

机构：

[1] Xinjiang Univ China, Sch Informat Sci & Engn, Xinjiang Key Lab Signal Detect & Proc, Urumqi 830017, Xinjiang, Peoples R China

[2] Xinjiang Univ China, Sch Informat Sci & Engn, Xinjiang Key Lab Multilingual Informat Technol, Urumqi 830017, Xinjiang, Peoples R China

来源：

2022 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2022) | 2022年

关键词：

bilingual dictionary; unsupervised learning; seed dictionary; morpheme sequence;

D O I：

10.1109/IALP57159.2022.9961257

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we focus on the task of bilingual dictionary induction for the Chinese-Uyghur language pair. Usually, correlating long-distance linguistic information requires cross-linguistic information as supervision, which often requires parallel corpora to link in seed lexicons. And the parallel corpora are expensive. The low-resource Uyghur language text data are only available in a small amount, and the derivative morphological structure is vibrant and complex. In bilingual processing aligning most similar units and entity stems is the first step. So separating sentences into morpheme sequences is essential in the cross-lingual processing tasks. Uyghur words in text sentences consist of stems joined with several suffixes/prefixes. Rich and complex multiple affix forms exist in the text, forming many derivative words. This situation can easily lead to an increase in the repetition rate of intentional features in the text, which affects the efficiency of bilingual dictionary extraction. In this work, we actively explore the resource construction and granularity optimization of minority low-resource languages and learn cross-language word embeddings without the supervision of parallel data. A Chinese-Uyghur bilingual dictionary extraction method is proposed based on the neural network cross-language word embedding vector technology and the multilingual morphological analyzer. Experiments show that the way based on morpheme sequence significantly improved compared to the baseline model of the word sequence.

引用

页码：476 / 481

页数：6

共 19 条

[1] Ablimit M, 2017, ASIAPAC SIGN INFO PR, P737, DOI 10.1109/APSIPA.2017.8282131
[2] [Anonymous], 2016, P 2016 C EMPIRICAL M, DOI DOI 10.18653/V1/D16-1250
[3] [Anonymous], 2013, P 2013 C EMP METH NA
[4] [Anonymous], 2014, P 14 C EUR CHAPT ASS, DOI DOI 10.3115/V1/E14-1049
[5] Artetxe M, 2018, Arxiv, DOI arXiv:1805.06297
[6] Artetxe M, 2018, AAAI CONF ARTIF INTE, P5012
[7] Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision
Aysa, Anwar
Ablimit, Mijit
Yilahun, Hankiz
Hamdulla, Askar
[J]. INFORMATION, 2022, 13 (04)
[8] Conneau A, 2018, Arxiv, DOI arXiv:1710.04087
[9] Feng XC, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4071
[10] Smith SL, 2017, Arxiv, DOI arXiv:1702.03859

← 1 2 →