An End-to-End Transformer with Progressive Tri-Modal Attention for Multi-modal Emotion Recognition

被引:0
|
作者
Wu, Yang [1 ]
Peng, Pai [1 ]
Zhang, Zhenyu [1 ]
Zhao, Yanyan [1 ]
Qin, Bing [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
关键词
Multi-modal emotion recognition; Multi-modal transformer; Feature fusion;
D O I
10.1007/978-981-99-8540-1_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent works on multi-modal emotion recognition move towards end-to-end models, which can extract the task-specific features supervised by the target task compared with the two-phase pipeline. In this paper, we propose a novel multi-modal end-to-end transformer for emotion recognition, which can effectively model the tri-modal features interaction among the textual, acoustic, and visual modalities at the low-level and high-level. At the low-level, we propose the progressive tri-modal attention, which can model the tri-modal feature interactions by adopting a two-pass strategy and can further leverage such interactions to significantly reduce the computation and memory complexity through reducing the input token length. At the high-level, we introduce the tri-modal feature fusion layer to explicitly aggregate the semantic representations of three modalities. The experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET achieves the state-of-the-art performance. The further in-depth analysis demonstrates the effectiveness, efficiency, and interpretability of the proposed tri-modal attention, which can help our model to achieve better performance while significantly reducing the computation and memory cost (Our code is available at https://github.com/SCIR-MSA-Team/UFMAC.).
引用
收藏
页码:396 / 408
页数:13
相关论文
共 50 条
  • [1] DeepVANet: A Deep End-to-End Network for Multi-modal Emotion Recognition
    Zhang, Yuhao
    Hossain, Md Zakir
    Rahman, Shafin
    HUMAN-COMPUTER INTERACTION, INTERACT 2021, PT III, 2021, 12934 : 227 - 237
  • [2] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
    Prakash, Aditya
    Chitta, Kashyap
    Geiger, Andreas
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7073 - 7083
  • [3] TransOrga: End-To-End Multi-modal Transformer-Based Organoid Segmentation
    Qin, Yiming
    Li, Jiajia
    Chen, Yulong
    Wang, Zikai
    Huang, Yu-An
    You, Zhuhong
    Hu, Lun
    Hu, Pengwei
    Tan, Feng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, ICIC 2023, PT III, 2023, 14088 : 460 - 472
  • [4] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
  • [5] END-TO-END MULTI-MODAL SPEECH RECOGNITION WITH AIR AND BONE CONDUCTED SPEECH
    Chen, Junqi
    Wang, Mou
    Zhang, Xiao-Lei
    Huang, Zhiyong
    Rahardja, Susanto
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6052 - 6056
  • [6] End-to-end Knowledge Retrieval with Multi-modal Queries
    Luo, Man
    Fang, Zhiyuan
    Gokhale, Tejas
    Yang, Yezhou
    Baral, Chitta
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 8573 - 8589
  • [7] End-to-end Multi-modal Video Temporal Grounding
    Chen, Yi-Wen
    Tsai, Yi-Hsuan
    Yang, Ming-Hsuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [8] Multi-modal Attention for Speech Emotion Recognition
    Pan, Zexu
    Luo, Zhaojie
    Yang, Jichen
    Li, Haizhou
    INTERSPEECH 2020, 2020, : 364 - 368
  • [9] End-to-End Multi-Modal Speech Recognition on an Air and Bone Conducted Speech Corpus
    Wang, Mou
    Chen, Junqi
    Zhang, Xiao-Lei
    Rahardja, Susanto
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 513 - 524
  • [10] End-to-End Multi-Modal Behavioral Context Recognition in a Real-Life Setting
    Saeed, Aaqib
    Ozcelebi, Tanir
    Trajanovski, Stojan
    Lukkien, Johan J.
    2019 22ND INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2019), 2019,