Chinese-Uyghur Bilingual Lexicon Extraction Based on Weak Supervision

被引：4

作者：

Aysa, Anwar ^{[1
]}

Ablimit, Mijit ^{[1
]}

Yilahun, Hankiz ^{[1
]}

Hamdulla, Askar ^{[1
]}

机构：

[1] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830046, Peoples R China

来源：

INFORMATION | 2022年 / 13卷 / 04期

关键词：

bilingual dictionary; seed dictionary; cross-language word embedding;

D O I：

10.3390/info13040175

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.

引用

页数：18

共 35 条

[1] Learning Bilingual Word Embedding Mappings with Similar Words in Related Languages Using GAN
Alipour, Ghafour
Mohasefi, Jamshid Bagherzadeh
Feizi-Derakhshi, Mohammad-Reza
[J]. APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
[2] [Anonymous], 2007, P 24 INT C MACH LEAR
[3] [Anonymous], 2002, ADV INFORM RETRIEVAL
[4] [Anonymous], 2016, COLING
[5] [Anonymous], 2017, ARXIV171004087
[6] Artetxe M, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5002
[7] A neural probabilistic language model
Bengio, Y
Ducharme, R
Vincent, P
Jauvin, C
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) : 1137 - 1155
[8] Chen YQ, 1995, 1995 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS PROCEEDINGS, VOLS 1-6, P136, DOI 10.1109/ICNN.1995.488081
[9] Goldberg Yoav, 2014, word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method
[10] Gouws Stephan, 2015, P 2015 C N AM CHAPTE, P1386

← 1 2 3 4 →