Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction

被引:3
作者
Cao, Hailong [1 ]
Zhao, Tiejun [1 ]
Wang, Weixuan [2 ]
Peng, Wei [2 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin 150001, Heilongjiang, Peoples R China
[2] Huawei Technol Co Ltd, Artificial Intelligence Applicat Res Ctr, Shenzhen, Guangdong, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Unsupervised learning; Word translation; Unsupervised bilingual lexicon induction; Embedding fusion; Information fusion;
D O I
10.1016/j.inffus.2023.101818
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Great progress has been made in unsupervised bilingual lexicon induction (UBLI) by aligning the source and target word embeddings independently trained on monolingual corpora. The common assumption of most UBLI models is that the embedding spaces of two languages are approximately isomorphic (i.e., similar in geometric structure). Therefore, the performance is bound by the degree of isomorphism, especially on etymologically and typologically distant languages. Near-zero UBLI results have been reported for them. To address this problem, we propose a method to increase the isomorphism based on bilingual word embedding fusion. In particular, the features from the source embeddings are integrated into the target embeddings, and vice versa. Therefore, the resulting structures of source and target embeddings are similar to each other. The method does not require any form of supervision and can be applied to any language pair. On a benchmark dataset of bilingual lexicon induction, our approach can achieve competitive or superior performance compared to the state-of-the-art methods, with particularly strong results being found on distant languages.
引用
收藏
页数:11
相关论文
共 40 条
[1]  
Aboagye Prince Osei, 2022, INT C LEARN REPR
[2]  
Adams O, 2017, 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, P937
[3]  
Alvarez-Melis D, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P1881
[4]  
[Anonymous], 2016, P 1 WORKSH REPR LEAR, DOI [10.18653/v1/W16-1614, DOI 10.18653/V1/W16-1614]
[5]  
Artetxe M., 2018, 6 INT C LEARNING REP
[6]  
Artetxe M., 2018, P 22 C COMP NAT LANG, P282, DOI [DOI 10.18653/V1/K18-1028, 10.18653/v1/K18-1028]
[7]  
Artetxe M, 2018, AAAI CONF ARTIF INTE, P5012
[8]  
Artetxe M, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P789
[9]  
Bojanowski P., 2017, T ASSOC COMPUT LING, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacl\_a\_00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]
[10]  
Dinu G., 2015, WORKSHOP TRACK P