Social Media Topic Classification on Greek Reddit

被引:1
|
作者
Mastrokostas, Charalampos [1 ]
Giarelis, Nikolaos [1 ]
Karacapilidis, Nikos [1 ]
机构
[1] Univ Patras, Ind Management & Informat Syst Lab, MEAD, Rion 26504, Greece
关键词
Greek language; deep learning; large language models; machine learning; natural language processing; transformers; text classification; Greek NLP resources; social media;
D O I
10.3390/info15090521
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Automatic social media news classification: a topic modeling approach
    Amador, Daniel
    Gamboa-Venegas, Carlos
    Garcia, Ernesto
    Segura-Castillo, Andres
    TECNOLOGIA EN MARCHA, 2022, 35
  • [2] Deep Learning for Reddit Text Classification: TextCNN and TextRNN Approaches
    Long, Qiyu
    Wang, Zhichen
    Yu, Hao
    4TH INTERDISCIPLINARY CONFERENCE ON ELECTRICS AND COMPUTER, INTCEC 2024, 2024,
  • [3] Detection of Depression-Related Posts in Reddit Social Media Forum
    Tadesse, Michael M.
    Lin, Hongfei
    Xu, Bo
    Yang, Liang
    IEEE ACCESS, 2019, 7 : 44883 - 44893
  • [4] Thai Defamatory Text Classification on Social Media
    Arreerard, Ratchakrit
    Senivongse, Twittie
    2018 IEEE/ACIS 3RD INTERNATIONAL CONFERENCE ON BIG DATA, CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (BCD 2018), 2018, : 73 - 78
  • [5] Sentiment Classification of Social Media Content with Features Generated Using Topic Models
    Blair, Stuart J.
    Bi, Yaxin
    Mulvenna, Maurice D.
    PROCEEDINGS OF THE EIGHTH EUROPEAN STARTING AI RESEARCHER SYMPOSIUM (STAIRS 2016), 2016, 284 : 155 - 166
  • [6] Aggregated topic models for increasing social media topic coherence
    Blair, Stuart J.
    Bi, Yaxin
    Mulvenna, Maurice D.
    APPLIED INTELLIGENCE, 2020, 50 (01) : 138 - 156
  • [7] Perception of social media users regarding cryptocurrency investment adoption: a case of social media platform - Reddit
    Rodpangtiam, Athit
    Boonchutima, Smith
    Mazahir, Ibtesam
    COGENT BUSINESS & MANAGEMENT, 2024, 11 (01):
  • [8] Aggregated topic models for increasing social media topic coherence
    Stuart J. Blair
    Yaxin Bi
    Maurice D. Mulvenna
    Applied Intelligence, 2020, 50 : 138 - 156
  • [9] The Language of Brands in Social Media: Using Topic Modeling on Social Media Conversations to Drive Brand Strategy
    Swaminathan, Vanitha
    Schwartz, H. Andrew
    Menezes, Rowan
    Hill, Shawndra
    JOURNAL OF INTERACTIVE MARKETING, 2022, 57 (02) : 255 - 277
  • [10] A semantic modular framework for events topic modeling in social media
    Moghaddam, Arya Hadizadeh
    Momtazi, Saeedeh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 10755 - 10778