Multidimensional Affective Analysis for Low-Resource Languages: A Use Case with Guarani-Spanish Code-Switching Language

被引:4
作者
Aguero-Torales, Marvin M. [1 ,3 ]
Lopez-Herrera, Antonio G. [1 ]
Vilares, David [2 ]
机构
[1] Univ Granada, Dept Comp Sci & Artificial Intelligence, Calle Daniel Saucedo Aranda S-N, Granada 18071, Granada, Spain
[2] Univ A Coruna, Dept Comp Sci & Informat Technol, CITIC, Campus Elvina S-N, La Coruna 15008, A Coruna, Spain
[3] Global CoE Data Intelligence, Camino Cerro Gamos 1, Madrid 28224, Spain
基金
欧洲研究理事会;
关键词
Natural language processing; Sentiment analysis; Affective analysis; Code-switching; Low-resource languages; SENTIMENT ANALYSIS;
D O I
10.1007/s12559-023-10165-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper focuses on text-based affective computing for Jopara, a code-switching language that combines Guarani and Spanish. First, we collected a dataset of tweets primarily written in Guarani and annotated them for three widely used dimensions in sentiment analysis: (a) emotion recognition, (b) humor detection, and (c) offensive language identification. Then, we developed several neural network models, including large language models specifically designed for Guarani, and compared their performance against off-the-shelf multilingual and Spanish pre-trained models for the aforementioned dimensions. Our experiments show that language models incorporating Guarani during pre-training or pre-fine-tuning consistently achieve the best results, despite limited resources (a single 24-GB GPU and only 800K tokens). Notably, even a Guarani BERT model with just two layers of Transformers shows a favorable balance between accuracy and computational power, likely due to the inherent low-resource nature of the task. We present a comprehensive overview of corpus creation and model development for low-resource languages like Guarani, particularly in the context of its code-switching with Spanish, resulting in Jopara. Our findings shed light on the challenges and strategies involved in analyzing affective language in such linguistic contexts.
引用
收藏
页码:1391 / 1406
页数:16
相关论文
共 93 条
  • [1] Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis
    Abdellaoui, Houssem
    Zrigui, Mounir
    [J]. COMPUTACION Y SISTEMAS, 2018, 22 (03): : 777 - 786
  • [2] MasakhaNER: Named Entity Recognition for African Languages
    Adelani, David Ifeoluwa
    Abbott, Jade
    Neubig, Graham
    D'souza, Daniel
    Kreutzer, Julia
    Lignos, Constantine
    Palen-Michel, Chester
    Buzaaba, Happy
    Rijhwani, Shruti
    Ruder, Sebastian
    Mayhew, Stephen
    Azime, Israel Abebe
    Muhammad, Shamsuddeen H.
    Emezue, Chris Chinenye
    Nakatumba-Nabende, Joyce
    Ogayo, Perez
    Anuoluwapo, Aremu
    Gitau, Catherine
    Mbaye, Derguene
    Alabi, Jesujoba
    Yimam, Seid Muhie
    Gwadabe, Tajuddeen Rabiu
    Ezeani, Ignatius
    Niyongabo, Rubungo Andre
    Mukiibi, Jonathan
    Otiende, Verrah
    Orife, Iroro
    David, Davis
    Ngom, Samba
    Adewumi, Tosin
    Rayson, Paul
    Adeyemi, Mofetoluwa
    Muriuki, Gerald
    Anebi, Emmanuel
    Chukwuneke, Chiamaka
    Odu, Nkiruka
    Wairagala, Eric Peter
    Oyerinde, Samuel
    Siro, Clemencia
    Bateesa, Tobius Saul
    Oloyede, Temilola
    Wambui, Yvonne
    Akinode, Victor
    Nabagereka, Deborah
    Katusiime, Maurice
    Awokoya, Ayodele
    Mboup, Mouhamadane
    Gebreyohannes, Dibora
    Tilaye, Henok
    Nwaike, Kelechi
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 1116 - 1131
  • [3] Twitter Sentiment Analysis Approaches: A Survey
    Adwan, Omar Y.
    Al-Tawil, Marwan
    Huneiti, Ammar M.
    Shahin, Rawan A.
    Abu Zayed, Abeer A.
    Al-Dibsi, Razan H.
    [J]. INTERNATIONAL JOURNAL OF EMERGING TECHNOLOGIES IN LEARNING, 2020, 15 (15) : 79 - 93
  • [4] Agerri R, 2020, P 12 LANGUAGE RESOUR
  • [5] Aguero-Torales MM, 2021, P 5 WORKSHOP COMPUTA, P95, DOI DOI 10.18653/V1/2021.CALCS-1.12
  • [6] Aguero-Torales MM, 2022, MACHINE LEARNING APP
  • [7] Inter-Coder Agreement for Computational Linguistics
    Artstein, Ron
    Poesio, Massimo
    [J]. COMPUTATIONAL LINGUISTICS, 2008, 34 (04) : 555 - 596
  • [8] Asgari E, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4113
  • [9] Attardi Giusepppe., Wikiextractor
  • [10] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
    Babu, Arun
    Wang, Changhan
    Tjandra, Andros
    Lakhotia, Kushal
    Xu, Qiantong
    Goyal, Naman
    Singh, Kritika
    von Platen, Patrick
    Saraf, Yatharth
    Pino, Juan
    Baevski, Alexei
    Conneau, Alexis
    Auli, Michael
    [J]. INTERSPEECH 2022, 2022, : 2278 - 2282