Twitter Dataset and Evaluation of Transformers for Turkish Sentiment Analysis

被引:16
作者
Koksal, Abdullatif [1 ]
Ozgur, Arzucan [1 ]
机构
[1] Bogazici Univ, Bilgisayar Muhendisligi Bolumu, Istanbul, Turkey
来源
29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021) | 2021年
关键词
sentiment analysis; Turkish dataset; Twitter; BounTi; transformers; BERT;
D O I
10.1109/SIU53274.2021.9477814
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Sentiment analysis is one of the key topics in Natural Language Processing which helps several applications from social media analysis to stock market prediction. Sentiment analysis datasets are generally collected by semi-supervision through shopping or review websites. These datasets are constructed by mapping users' text reviews to the given scores by users. However, these datasets might contain errors due to automatic mapping, and generally they don't have the characteristic features of social media texts such as emojis, slangs, and typos. To address these problems, one of the first manually annotated Turkish Sentiment Analysis datasets from Twitter is proposed. The BounTi dataset contains Turkish tweets about specific universities at Turkey. Furthermore, the performance of multilingual and Turkish transformer models such as MBERT, XLM-Roberta, and BERTurk are analyzed for this dataset. The best proposed model is based on BERTurk and achieves 0.729 macro-averaged recall score on the test set. Finally, a social media analysis demonstration with the best model is performed on Turkish tweets about a food brand. BounTi dataset, finetuned models, and related scripts are publicly released.
引用
收藏
页数:4
相关论文
共 22 条
[1]   Geospatial sentiment analysis using twitter data for UK-EU referendum [J].
Agarwal, Amit ;
Singh, Ritu ;
Toshniwal, Durga .
JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES, 2018, 39 (01)
[2]  
[Anonymous], 2018, P 2018 EMNLP WORKSH
[3]  
Bakshi RK, 2016, PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, P452
[4]  
Conneau A., 2020, P 58 ANN M ASS COMP, P8440, DOI [DOI 10.18653/V1/2020.ACL-MAIN.747, 10.18653/v1/2020.acl-main.747]
[5]  
Cortis K., 2017, P 11 INT WORKSHOP SE, P519
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Erogul U., 2009, THESIS
[8]  
Fang X., 2015, Journal of Big Data, V2, P1, DOI [10.1186/s40537-015-0015-2, DOI 10.1186/S40537-015-0015-2]
[9]   Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network [J].
Ghiassi, M. ;
Skinner, J. ;
Zimbra, D. .
EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (16) :6266-6282
[10]  
Jurafsky D., Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition