Spanish MEACorpus 2023: A multimodal speech-text corpus for emotion analysis in Spanish from natural environments

被引:4
作者
Pan, Ronghao [1 ]
Garcia-Diaz, Jose Antonio [1 ]
Rodriguez-Garcia, Miguel angel [2 ]
Valencia-Garcia, Rafel [1 ]
机构
[1] Univ Murcia, Dept Informat & Sistemas, Campus Espinardo, Murcia 30100, Murcia, Spain
[2] Univ Rey Juan Carlos, Dept Ciencias Comp, Calle Tulipan s-n, Mostoles 28933, Madrid, Spain
关键词
Multimodal emotion analysis; Deep; -learning; Speech emotion analysis; Transformers; Text classification; Natural language processing; FACIAL EXPRESSIONS; RECOGNITION;
D O I
10.1016/j.csi.2024.103856
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In human-computer interaction, emotion recognition provides a deeper understanding of the user's emotions, enabling empathetic and effective responses based on the user's emotional state. While deep learning models have improved emotion recognition solutions, it is still an active area of research. One important limitation is that most emotion recognition systems use only text as input, ignoring features such as voice intonation. Another limitation is the limited number of datasets available for multimodal emotion recognition. In addition, most published datasets contain emotions that are simulated by professionals and produce limited results in real-world scenarios. In other languages, such as Spanish, hardly any datasets are available. Therefore, our contributions to emotion recognition are as follows. First, we compile and annotate a new corpus for multimodal emotion recognition in Spanish (Spanish MEACorpus 2023), which contains 13.16 h of speech divided into 5129 segments labeled by considering Ekman's six basic emotions. The dataset is extracted from YouTube videos in natural environments. Second, we explore several deep learning models for emotion recognition using text- and audio-based features. Third, we evaluate different multimodal techniques to build a multimodal recognition system that improves the results of unimodal models, achieving a Macro F1-score of 87.745%, using late fusion with concatenation strategy approach.
引用
收藏
页数:13
相关论文
共 64 条
[1]   Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].
Akcay, Mehmet Berkehan ;
Oguz, Kaya .
SPEECH COMMUNICATION, 2020, 116 (116) :56-76
[2]   Database for an emotion recognition system based on EEG signals and various computer games - GAMEEMO [J].
Alakus, Talha Burak ;
Gonen, Murat ;
Turkoglu, Ibrahim .
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2020, 60
[3]  
[Anonymous], 1962, Affect, Imagery, Consciousness
[4]  
[Anonymous], 2020, SENSORS-BASEL, DOI DOI 10.3390/s20010183
[5]  
Baevski A, 2020, ADV NEUR IN, V33
[6]  
Barrault Loic, 2023, ArXiv
[7]   Bagged support vector machines for emotion recognition from speech [J].
Bhavan, Anjali ;
Chauhan, Pankaj ;
Hitkul ;
Shah, Rajiv Ratn .
KNOWLEDGE-BASED SYSTEMS, 2019, 184
[8]  
Burkhardt F., 2005, 9 EUROPEAN C SPEECH
[9]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[10]   Speech Emotion Recognition with Multi-task Learning [J].
Cai, Xingyu ;
Yuan, Jiahong ;
Zheng, Renjie ;
Huang, Liang ;
Church, Kenneth .
INTERSPEECH 2021, 2021, :4508-4512