ATTA-NET: ATTENTION AGGREGATION NETWORK FOR AUDIO-VISUAL EMOTION RECOGNITION

被引:6
作者
Fan, Ruijia [1 ]
Liu, Hong [1 ]
Li, Yidi [1 ,2 ]
Guo, Peini [1 ]
Wang, Guoquan [1 ]
Wang, Ti [1 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Natl Key Lab Gen Artificial Intelligence, Shenzhen, Peoples R China
[2] Taiyuan Univ Technol, Coll Comp Sci & Technol, Taiyuan, Peoples R China
来源
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年
基金
中国国家自然科学基金;
关键词
Emotion Recognition; Audio-Visual Fusion; Attention Aggregation; Auxiliary Optimization;
D O I
10.1109/ICASSP48485.2024.10447640
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the modal-specific properties of audio and visual modalities and the unalignment of model-shared emotional semantic features. In this paper, an Attention Aggregation Network (AttANET) is proposed to address these challenges. An attention aggregation module is proposed to get modal-shared properties effectively. This module comprises similarity-aware enhancement blocks and a contrastive loss that facilitates aligning audio and visual semantic features. Moreover, an auxiliary uni-modal classifier is introduced to obtain modal-specific properties, in which intra-modal discriminative features are fully extracted. Under joint optimization of uni-modal and multi-modal classification loss, modal-specific information can be infused. Extensive experiments on RAVDESS and PKUER datasets validate the superiority of AttA-NET. The code is available at: https://github.com/NariFan2002/AttA-NET.
引用
收藏
页码:8030 / 8034
页数:5
相关论文
共 24 条
[1]   Speech-Visual Emotion Recognition by Fusing Shared and Specific Features [J].
Chen, Guanghui ;
Jiao, Shuang .
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 :678-682
[2]   Self-attention fusion for audiovisual emotion recognition with incomplete data [J].
Chumachenko, Kateryna ;
Iosifidis, Alexandros ;
Gabbouj, Moncef .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828
[3]   Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss [J].
Franceschini, Riccardo ;
Fini, Enrico ;
Beyan, Cigdem ;
Conti, Alessandro ;
Arrigoni, Federica ;
Ricci, Elisa .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2589-2596
[4]   Computer Network Intrusion Anomaly Detection with Recurrent Neural Network [J].
Fu, Zeyuan .
MOBILE INFORMATION SYSTEMS, 2022, 2022
[5]   CEPSTRAL ANALYSIS TECHNIQUE FOR AUTOMATIC SPEAKER VERIFICATION [J].
FURUI, S .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1981, 29 (02) :254-272
[6]   Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition [J].
Guo, Peini ;
Chen, Zhengyan ;
Li, Yidi ;
Liu, Hong .
ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 :315-326
[7]   MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].
Hazarika, Devamanyu ;
Zimmermann, Roger ;
Poria, Soujanya .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131
[8]  
Huang J, 2020, INT CONF ACOUST SPEE, P3507, DOI [10.1109/ICASSP40776.2020.9053762, 10.1109/icassp40776.2020.9053762]
[9]   Audio and Video-based Emotion Recognition using Multimodal Transformers [J].
John, Vijay ;
Kawanishi, Yasutomo .
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2582-2588
[10]   MMTM: Multimodal Transfer Module for CNN Fusion [J].
Joze, Hamid Reza Vaezi ;
Shaban, Amirreza ;
Iuzzolino, Michael L. ;
Koishida, Kazuhito .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13286-13296