Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

被引：57

作者：

Luna-Jimenez, Cristina ^{[1
]}

Griol, David ^{[2
]}

Callejas, Zoraida ^{[2
]}

Kleinlein, Ricardo ^{[1
]}

Montero, Juan M. ^{[1
]}

Fernandez-Martinez, Fernando ^{[1
]}

机构：

[1] Univ Politecn Madrid, Grp Tecnol Habla & Aprendizaje Automat THAU Grp, Informat Proc & Telecommun Ctr, ETSI Telecomunicac, Avda Complutense 30, Madrid 28040, Spain

[2] Univ Granada, Dept Software Engn, CITIC UGR, Periodista Daniel Saucedo Aranda S-N, Granada 18071, Spain

来源：

SENSORS | 2021年 / 21卷 / 22期

关键词：

audio-visual emotion recognition; human-computer interaction; computational paralinguistics; spatial transformers; transfer learning; speech emotion recognition; facial emotion recognition; SENTIMENT ANALYSIS; AUDIO;

D O I：

10.3390/s21227665

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users' emotional state and their combination enables improvement of system performance.

引用

页数：29

共 80 条

[1] A new proposed statistical feature extraction method in speech emotion recognition [J].

Abdulmohsin, Husam Ali ;

Wahab, Hala Bahjat Abdul ;

Hossen, Abdul Mohssen Jaber Abdul .

COMPUTERS & ELECTRICAL ENGINEERING, 2021, 93 (93)

[2] Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding [J].

Ahmad, Zishan ;

Jindal, Raghav ;

Ekbal, Asif ;

Bhattachharyya, Pushpak .

EXPERT SYSTEMS WITH APPLICATIONS, 2020, 139

[3] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers [J].

Akcay, Mehmet Berkehan ;

Oguz, Kaya .

SPEECH COMMUNICATION, 2020, 116 :56-76

[4] Facial Emotion Recognition Using Transfer Learning in the Deep CNN [J].

Akhand, M. A. H. ;

Roy, Shuvendu ;

Siddique, Nazmul ;

Kamal, Md Abdus Samad ;

Shimamura, Tetsuya .

ELECTRONICS, 2021, 10 (09)

[5] Snore Sound Classification Using Image-based Deep Spectrum Features [J].

Amiriparian, Shahin ;

Gerczuk, Maurice ;

Ottl, Sandra ;

Cummins, Nicholas ;

Freitag, Michael ;

Pugachevskiy, Sergey ;

Baird, Alice ;

Schuller, Bjoern .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3512-3516

[6]

Amiriparian S, 2017, INT CONF AFFECT, P26, DOI 10.1109/ACIIW.2017.8272618

[7] Improved speech emotion recognition with Mel frequency magnitude coefficient [J].

Ancilin, J. ;

Milton, A. .

APPLIED ACOUSTICS, 2021, 179

[8]

[Anonymous], HDB COGNITION EMOTIO

[9]

[Anonymous], 2015, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2015.7298710

[10] Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features [J].

Anvarjon, Tursunov ;

Mustaqeem ;

Kwon, Soonil .

SENSORS, 2020, 20 (18) :1-16

← 1 2 3 4 5 6 7 8 →