MULTIMODAL TRANSFORMER WITH LEARNABLE FRONTEND AND SELF ATTENTION FOR EMOTION RECOGNITION

被引:11
|
作者
Dutta, Soumya
Ganapathy, Sriram
机构
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
Multi-modal emotion recognition; Transformer networks; self-attention models; learnable front-end; SENTIMENT ANALYSIS; FUSION;
D O I
10.1109/ICASSP43922.2022.9747723
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMO-CAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.
引用
收藏
页码:6917 / 6921
页数:5
相关论文
共 50 条
  • [1] Multimodal Transformer Fusion for Emotion Recognition: A Survey
    Belaref, Amdjed
    Seguier, Renaud
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 107 - 113
  • [2] Self-supervised representation learning using multimodal Transformer for emotion recognition
    Goetz, Theresa
    Arora, Pulkit
    Erick, F. X.
    Holzer, Nina
    Sawant, Shrutika
    PROCEEDINGS OF THE 8TH INTERNATIONAL WORKSHOP ON SENSOR-BASED ACTIVITY RECOGNITION AND ARTIFICIAL INTELLIGENCE, IWOAR 2023, 2023,
  • [3] Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion
    Siriwardhana, Shamane
    Kaluarachchi, Tharindu
    Billinghurst, Mark
    Nanayakkara, Suranga
    IEEE ACCESS, 2020, 8 (08): : 176274 - 176285
  • [4] Focus-attention-enhanced Crossmodal Transformer with Metric Learning for Multimodal Speech Emotion Recognition
    Kim, Keulbit
    Cho, Namhyun
    INTERSPEECH 2023, 2023, : 2673 - 2677
  • [5] Noise-Resistant Multimodal Transformer for Emotion Recognition
    Liu, Yuanyuan
    Zhang, Haoyu
    Zhan, Yibing
    Chen, Zijing
    Yin, Guanghao
    Wei, Lin
    Chen, Zhe
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (05) : 3020 - 3040
  • [6] Survey on multimodal approaches to emotion recognition
    Gladys, A. Aruna
    Vetriselvi, V.
    NEUROCOMPUTING, 2023, 556
  • [7] Token-disentangling Mutual Transformer for multimodal emotion recognition
    Yin, Guanghao
    Liu, Yuanyuan
    Liu, Tengfei
    Zhang, Haoyu
    Fang, Fang
    Tang, Chang
    Jiang, Liangxiao
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [8] CSAT-FTCN: A Fuzzy-Oriented Model with Contextual Self-attention Network for Multimodal Emotion Recognition
    Jiang, Dazhi
    Liu, Hao
    Wei, Runguo
    Tu, Geng
    COGNITIVE COMPUTATION, 2023, 15 (03) : 1082 - 1091
  • [9] Towards Learning a Joint Representation from Transformer in Multimodal Emotion Recognition
    Deng, James J.
    Leung, Clement H. C.
    BRAIN INFORMATICS, BI 2021, 2021, 12960 : 179 - 188
  • [10] Coordination Attention based Transformers with bidirectional contrastive loss for multimodal speech emotion recognition
    Fan, Weiquan
    Xu, Xiangmin
    Zhou, Guohua
    Deng, Xiaofang
    Xing, Xiaofen
    SPEECH COMMUNICATION, 2025, 169