Persian Sentiment Analysis without Training Data Using Cross-Lingual Word Embeddings

被引:3
作者
Aliramezani, Mohammad [1 ]
Doostmohammadi, Ehsan [1 ]
Bokaei, Mohammad Hadi [2 ]
Sameti, Hossien [3 ]
机构
[1] Sharif Univ Technol, Computat Linguist Grp, Tehran, Iran
[2] ICT Res Inst, Informat Technol Dept, Tehran, Iran
[3] Sharif Univ Technol, Comp Engn Dept, Tehran, Iran
来源
2020 10TH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST) | 2020年
关键词
cross-lingual space; natural language processing; sentiment analysis; word embeddings;
D O I
10.1109/IST50524.2020.9345882
中图分类号
TN [电子技术、通信技术];
学科分类号
0809 ;
摘要
In this paper, a low-cost Persian Sentiment Analysis is performed without any Persian training data. A cross-lingual method is proposed to overcome the shortage of labeled Persian Sentiment Analysis datasets by using English as a high-resource language. A cross-lingual model between English and Persian is trained to generate aligned word embeddings that are used as the feature vectors in the sentiment model. Monolingual word embeddings used in cross-lingual approach are English FastText and Persian GloVe. VecMap method is used as the cross-lingual tool to make English and Persian word embeddings aligned in a supervised mode. Furthermore, a 5,000-word English-Persian bilingual dictionary is used as the supervision. Bilingual lexicon induction evaluation reveals that English and Persian are aligned properly in the joint space. The proposed Sentiment Analysis model is trained on an English dataset, and then is tested on Persian using aligned English-Persian word embeddings. The dataset used as the training data is Amazon Fine Food Reviews and Persian Snapp Food dataset is utilized as the test data. The model results show significant efficiency in the Sentiment Analysis task, though it does not use any Persian dataset in training procedure. The proposed cross-lingual Sentiment Analysis shows a good performance with F1-score of 78.16% on Persian test data.
引用
收藏
页码:78 / 82
页数:5
相关论文
共 19 条
[1]  
Abdalla M., 2017, P 8 INT JOINT C NAT, P506
[2]  
Aliramezani M., IMPROVING PERSIAN WO
[3]  
[Anonymous], 2013, EXPLOITING SIMILARIT
[4]  
[Anonymous], 2017, P 5 INT C LEARN REPR
[5]  
[Anonymous], 2015, Transactions of the Association for Computational Linguistics, DOI DOI 10.1162/TACL_A_00134
[6]  
Artetxe M, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P789
[7]   Learning bilingual word embeddings with (almost) no bilingual data [J].
Artetxe, Mikel ;
Labaka, Gorka ;
Agirre, Eneko .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :451-462
[8]   A neural probabilistic language model [J].
Bengio, Y ;
Ducharme, R ;
Vincent, P ;
Jauvin, C .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (06) :1137-1155
[9]  
Doval Y, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P294
[10]  
Feng Y., 2019, NAACL HLT 2019, P420