Overcoming Rare-Language Discrimination in Multi-Lingual Sentiment Analysis

被引:4
作者
Lampert, Jasmin [1 ]
Lampert, Christoph H. [2 ]
机构
[1] AIT Austrian Inst Technol, Ctr Digital Safety & Secur, Competence Unit Data Sci & Artificial Intelligenc, Vienna, Austria
[2] Inst Sci & Technol Austria IST Austria, Machine Learning & Comp Vis Grp, Klosterneuburg, Austria
来源
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2021年
关键词
sentiment analysis; algorithmic fairness; multi-lingual sentence embeddings; self-annotation; natural language processing; social media;
D O I
10.1109/BigData52589.2021.9672003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The digitalization of almost all aspects of our everyday lives has led to unprecedented amounts of data being freely available on the Internet. In particular social media platforms provide rich sources of user-generated data, though typically in unstructured form, and with high diversity, such as written in many different languages. Automatically identifying meaningful information in such big data resources and extracting it efficiently is one of the ongoing challenges of our time. A common step for this is sentiment analysis, which forms the foundation for tasks such as opinion mining or trend prediction. Unfortunately, publicly available tools for this task are almost exclusively available for English-language texts. Consequently, a large fraction of the Internet users, who do not communicate in English, are ignored in automatized studies, a phenomenon called rare-language discrimination. In this work we propose a technique to overcome this problem by a truly multi-lingual model, which can be trained automatically without linguistic knowledge or even the ability to read the many target languages. The main step is to combine self-annotation, specifically the use of emoticons as a proxy for labels, with multi-lingual sentence representations. To evaluate our method we curated several large datasets from data obtained via the free Twitter streaming API. The results show that our proposed multi-lingual training is able to achieve sentiment predictions at the same quality level for rare languages as for frequent ones, and in particular clearly better than what mono-lingual training achieves on the same data.
引用
收藏
页码:5185 / 5192
页数:8
相关论文
共 28 条
[1]   Sentiment analysis and its applications in fighting COVID-19 and infectious diseases: A systematic review [J].
Alamoodi, A. H. ;
Zaidan, B. B. ;
Zaidan, A. A. ;
Albahri, O. S. ;
Mohammed, K. I. ;
Malik, R. Q. ;
Almahdi, E. M. ;
Chyad, M. A. ;
Tareq, Z. ;
Albahri, A. S. ;
Hameed, Hamsa ;
Alaa, Musaab .
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 167
[2]  
[Anonymous], 2009, ENCY DATA WAREHOUSIN
[3]   Social Media Analysis and Public Opinion: The 2010 UK General Election [J].
Anstead, Nick ;
O'Loughlin, Ben .
JOURNAL OF COMPUTER-MEDIATED COMMUNICATION, 2015, 20 (02) :204-220
[4]   Learning bilingual word embeddings with (almost) no bilingual data [J].
Artetxe, Mikel ;
Labaka, Gorka ;
Agirre, Eneko .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :451-462
[5]   A Survey of Sentiment Analysis from Social Media Data [J].
Chakraborty, Koyel ;
Bhattacharyya, Siddhartha ;
Bag, Rajib .
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2020, 7 (02) :450-464
[6]   Perception of social phenomena through the multidimensional analysis of online social networks [J].
Coletto M. ;
Esuli A. ;
Lucchese C. ;
Muntean C.I. ;
Nardini F.M. ;
Perego R. ;
Renso C. .
Online Social Networks and Media, 2017, 1 :14-32
[7]  
Conover Michael D., 2011, PAPER PRESENTED 2011, P192, DOI [10.1109/PASSAT/SocialCom.2011, DOI 10.1109/PASSAT/SOCIALCOM.2011]
[8]   TweetsCOV19-A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic [J].
Dimitrov, Dimitar ;
Baran, Erdal ;
Fafalios, Pavlos ;
Yu, Ran ;
Zhu, Xiaofei ;
Zloch, Matthaus ;
Dietze, Stefan .
CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, :2991-2998
[9]  
Farzindar A., 2015, Natural Language Processing for Social Media, V8, P1
[10]  
Go Alec, 2009, CS224N PROJECT REPOR, V1