Social Media Topic Classification on Greek Reddit

被引：3

作者：

Mastrokostas, Charalampos ^{[1
]}

Giarelis, Nikolaos ^{[1
]}

Karacapilidis, Nikos ^{[1
]}

机构：

[1] Univ Patras, Ind Management & Informat Syst Lab, MEAD, Rion 26504, Greece

来源：

INFORMATION | 2024年 / 15卷 / 09期

关键词：

Greek language; deep learning; large language models; machine learning; natural language processing; transformers; text classification; Greek NLP resources; social media;

D O I：

10.3390/info15090521

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.

引用

页数：18

共 34 条

[1]

Agarap A. F., 2018, arXiv, DOI DOI 10.48550/ARXIV.1803.08375

[2]

[Anonymous], 2004, P ICML 04, DOI 10.1145/1015330.1015332

[3]

Antypas D., 2022, P 29 INT C COMPUTATI, P3386

[4]

Athinaios K., 2020, Bachelors Thesis

[5]

Bojanowski P., 2017, Trans. Assoc. Comput. Linguistics, V5, P135, DOI [DOI 10.1162/TACLA00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]

[6]

Crammer K, 2006, J MACH LEARN RES, V7, P551

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

El Anigri Salma, 2021, Business Intelligence. 6th International Conference, CBI 2021. Proceedings. Lecture Notes in Business Information Processing (LNBIP 416), P130, DOI 10.1007/978-3-030-76508-8_11

[9]

Evdaimon I., 2024, P 2024 JOINT INT C C, P7949

[10] Greedy function approximation: A gradient boosting machine [J].

Friedman, JH .

ANNALS OF STATISTICS, 2001, 29 (05) :1189-1232

← 1 2 3 4 →