MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION

被引：15

作者：

Dutta, Soumya

Ganapathy, Sriram

机构：

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Multi-modal emotion recognition; Transformer networks; self-attention models; learnable front-end; SENTIMENT ANALYSIS; FUSION;

D O I：

10.1109/ICASSP43922.2022.9747723

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMO-CAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.

引用

页码：6917 / 6921

页数：5

共 35 条

[1]

[Anonymous], 2018, IEEE W SP LANG TECH, DOI [10.1109/SLT.2018.8639585., DOI 10.1109/SLT.2018.8639585]

[2] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[3]

Castagnos G, 2008, LECT NOTES COMPUT SC, V5229, P92, DOI 10.1007/978-3-540-85855-3_7

[4]

Cho K., 2014, ARXIV14061078, DOI [10.48550/arXiv.1406.1078, DOI 10.3115/V1/D14-1179]

[5]

Devlin Jacob, 2018, CoRR

[6]

Dutta Debottam, 2021, ARXIV210714793

[7]

Eyben F., 2010, P 18 ACM INT C MULT, P1459

[8] Evaluating deep learning architectures for Speech Emotion Recognition [J].

Fayek, Haytham M. ;

Lech, Margaret ;

Cavedon, Lawrence .

NEURAL NETWORKS, 2017, 92 :60-68

[9]

Kim Y, 2013, INT CONF ACOUST SPEE, P3687, DOI 10.1109/ICASSP.2013.6638346

[10]

Knapp RB, 2011, COGN TECHNOL, P133, DOI 10.1007/978-3-642-15184-2_9

← 1 2 3 4 →