MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION

被引:15
作者
Dutta, Soumya
Ganapathy, Sriram
机构
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Multi-modal emotion recognition; Transformer networks; self-attention models; learnable front-end; SENTIMENT ANALYSIS; FUSION;
D O I
10.1109/ICASSP43922.2022.9747723
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMO-CAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.
引用
收藏
页码:6917 / 6921
页数:5
相关论文
共 50 条
[41]   A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face [J].
Lian, Hailun ;
Lu, Cheng ;
Li, Sunan ;
Zhao, Yan ;
Tang, Chuangao ;
Zong, Yuan .
ENTROPY, 2023, 25 (10)
[42]   Evaluating significant features in context-aware multimodal emotion recognition with XAI methods [J].
Khalane, Aaishwarya ;
Makwana, Rikesh ;
Shaikh, Talal ;
Ullah, Abrar .
EXPERT SYSTEMS, 2025, 42 (01)
[43]   MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion [J].
Khan, Mustaqeem ;
Tran, Phuong-Nam ;
Pham, Nhat Truong ;
El Saddik, Abdulmotaleb ;
Othmani, Alice .
SCIENTIFIC REPORTS, 2025, 15 (01)
[44]   A systematic survey on multimodal emotion recognition using learning algorithms [J].
Ahmed, Naveed ;
Al Aghbari, Zaher ;
Girija, Shini .
INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 17
[45]   Multimodal Emotion Recognition Based on Facial Expressions, Speech, and EEG [J].
Pan, Jiahui ;
Fang, Weijie ;
Zhang, Zhihang ;
Chen, Bingzhi ;
Zhang, Zheng ;
Wang, Shuihua .
IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY, 2024, 5 :396-403
[46]   Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision [J].
Thebaud, Thomas ;
Favaro, Anna ;
Guan, Yaohan ;
Yang, Yuchen ;
Singh, Prabhav ;
Villalba, Jesus ;
Mono-Velazquez, Laureano ;
Dehak, Najim .
PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024, 2024, :684-689
[47]   Emotion recognition in live broadcasting: a multimodal deep learning framework [J].
Abbas, Rizwan ;
Schuller, Bjorn W. ;
Li, Xuewei ;
Lin, Chi ;
Li, Xi .
MULTIMEDIA SYSTEMS, 2025, 31 (03)
[48]   Modality emotion semantic correlation analysis for multimodal emotion recognition [J].
Zhang, Yuqing ;
Xie, Dongliang ;
Luo, Dawei ;
Sun, Baosheng .
COMPUTERS & ELECTRICAL ENGINEERING, 2025, 126
[49]   Multimodal Emotion Recognition with Auxiliary Sentiment Information [J].
Wu L. ;
Liu Q. ;
Zhang D. ;
Wang J. ;
Li S. ;
Zhou G. .
Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56 (01) :75-81
[50]   Multimodal Emotion Recognition With Temporal and Semantic Consistency [J].
Chen, Bingzhi ;
Cao, Qi ;
Hou, Mixiao ;
Zhang, Zheng ;
Lu, Guangming ;
Zhang, David .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3592-3603