RuSentiTweet: a sentiment analysis dataset of general domain tweets in Russian

被引:4
作者
Smetanin, Sergey [1 ]
机构
[1] Natl Res Univ Higher Sch Econ, Grad Sch Business, Dept Business Informat, Moscow, Russia
关键词
Sentiment dataset; Sentiment analysis; Russian; TWITTER; RECOGNITION; EMOTION;
D O I
10.7717/peerj-cs.1039
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Russian language is still not as well-resourced as English, especially in the field of sentiment analysis of Twitter content. Though several sentiment analysis datasets of tweets in Russia exist, they all are either automatically annotated or manually annotated by one annotator. Thus, there is no inter-annotator agreement, or annotation may be focused on a specific domain. In this article, we present RuSentiTweet, a new sentiment analysis dataset of general domain tweets in Russian. RuSentiTweet is currently the largest in its class for Russian, with 13,392 tweets manually annotated with moderate inter-rater agreement into five classes: Positive, Neutral, Negative, Speech Act, and Skip. As a source of data, we used Twitter Stream Grab, a historical collection of tweets obtained from the general Twitter API stream, which provides a 1% sample of the public tweets. Additionally, we released a RuBERT-based sentiment classification model that achieved F-1 = 0.6594 on the test subset.
引用
收藏
页数:19
相关论文
共 78 条
[1]  
Abu Kausar M, 2021, INT J ADV COMPUT SC, V12, P415
[2]   An In-Depth Experimental Comparison of RNTNs and CNNs for Sentence Modeling [J].
Ahmadi, Zahra ;
Skowron, Marcin ;
Stier, Aleksandrs ;
Kramer, Stefan .
DISCOVERY SCIENCE, DS 2017, 2017, 10558 :144-152
[3]  
Aly M., 2013, Short Papers, P494, DOI DOI 10.13140/2.1.3960.5761
[4]  
[Anonymous], 2016, Comput. Linguist. Intellect. Technol. Mater. Dialogue
[5]   A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks? [J].
Antonakaki, Despoina ;
Fragopoulou, Paraskevi ;
Ioannidis, Sotiris .
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 164
[6]  
Araslanov E, 2020, 2020 INT C DATA ANAL, P1
[7]  
Arefiev A.L., 2013, Demoskop Weekly, P571
[8]  
Asrofi G, 2016, International Journal of Computer Applications, V136, P23, DOI [10.5120/ijca2016908288, 10.5120/ijca2016908288, DOI 10.5120/IJCA2016908288]
[9]  
Babakov N., 2021, P 8 WORKSH BALT SLAV, P26
[10]  
BABAKOV N., 2022, arXiv, DOI DOI 10.48550/ARXIV.2203.02392