Social Media Topic Classification on Greek Reddit

被引:3
作者
Mastrokostas, Charalampos [1 ]
Giarelis, Nikolaos [1 ]
Karacapilidis, Nikos [1 ]
机构
[1] Univ Patras, Ind Management & Informat Syst Lab, MEAD, Rion 26504, Greece
关键词
Greek language; deep learning; large language models; machine learning; natural language processing; transformers; text classification; Greek NLP resources; social media;
D O I
10.3390/info15090521
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.
引用
收藏
页数:18
相关论文
共 34 条
[21]  
Medvedev A.N., 2019, Dynamics on and of complex networks III. DOOCN 2017. Springer Proceedings in Complexity, P183
[22]  
Mikolov T, 2013, Arxiv, DOI arXiv:1301.3781
[23]   Deep Learning-based Text Classification: A Comprehensive Review [J].
Minaee, Shervin ;
Kalchbrenner, Nal ;
Cambria, Erik ;
Nikzad, Narjes ;
Chenaghlu, Meysam ;
Gao, Jianfeng .
ACM COMPUTING SURVEYS, 2022, 54 (03)
[24]  
Murtagh F., 1991, Neurocomputing, V2, P183, DOI [DOI 10.1016/0925-2312(91)90023-5, 10.1016/0925-2312(91)90023-5]
[25]   Multi-label Arabic text classification in Online Social Networks [J].
Omar, Ahmed ;
Mahmoud, Tarek M. ;
Abd-El-Hafeez, Tarek ;
Mahfouz, Ahmed .
INFORMATION SYSTEMS, 2021, 100
[26]  
Outsios S, 2018, Arxiv, DOI arXiv:1810.06694
[27]  
Outsios S, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P2543
[28]   FarFetched: Entity-centric Reasoning and Claim Validation for the Greek Language based on Textually Represented Environments [J].
Papadopoulos, Dimitris ;
Metropoulou, Katerina ;
Matsatsinis, Nikolaos ;
Papadakis, Nikolaos .
PROCEEDINGS OF THE 12TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2022, 2022,
[29]  
Papaloukas C, 2021, P NATURAL LEGAL LANG, P63
[30]  
Papucci M., 2022, P 6 WORKSH NAT LANG