ATTA-NET: ATTENTION AGGREGATION NETWORK FOR AUDIO-VISUAL EMOTION RECOGNITION

被引：6

作者：

Fan, Ruijia ^{[1
]}

Liu, Hong ^{[1
]}

Li, Yidi ^{[1
,2
]}

Guo, Peini ^{[1
]}

Wang, Guoquan ^{[1
]}

Wang, Ti ^{[1
]}

机构：

[1] Peking Univ, Shenzhen Grad Sch, Natl Key Lab Gen Artificial Intelligence, Shenzhen, Peoples R China

[2] Taiyuan Univ Technol, Coll Comp Sci & Technol, Taiyuan, Peoples R China

来源：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024) | 2024年

基金：

中国国家自然科学基金;

关键词：

Emotion Recognition; Audio-Visual Fusion; Attention Aggregation; Auxiliary Optimization;

D O I：

10.1109/ICASSP48485.2024.10447640

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the modal-specific properties of audio and visual modalities and the unalignment of model-shared emotional semantic features. In this paper, an Attention Aggregation Network (AttANET) is proposed to address these challenges. An attention aggregation module is proposed to get modal-shared properties effectively. This module comprises similarity-aware enhancement blocks and a contrastive loss that facilitates aligning audio and visual semantic features. Moreover, an auxiliary uni-modal classifier is introduced to obtain modal-specific properties, in which intra-modal discriminative features are fully extracted. Under joint optimization of uni-modal and multi-modal classification loss, modal-specific information can be infused. Extensive experiments on RAVDESS and PKUER datasets validate the superiority of AttA-NET. The code is available at: https://github.com/NariFan2002/AttA-NET.

引用

页码：8030 / 8034

页数：5

共 24 条

[1] Speech-Visual Emotion Recognition by Fusing Shared and Specific Features [J].

Chen, Guanghui ;

Jiao, Shuang .

IEEE SIGNAL PROCESSING LETTERS, 2023, 30 :678-682

[2] Self-attention fusion for audiovisual emotion recognition with incomplete data [J].

Chumachenko, Kateryna ;

Iosifidis, Alexandros ;

Gabbouj, Moncef .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2822-2828

[3] Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss [J].

Franceschini, Riccardo ;

Fini, Enrico ;

Beyan, Cigdem ;

Conti, Alessandro ;

Arrigoni, Federica ;

Ricci, Elisa .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2589-2596

[4] Computer Network Intrusion Anomaly Detection with Recurrent Neural Network [J].

Fu, Zeyuan .

MOBILE INFORMATION SYSTEMS, 2022, 2022

[5] CEPSTRAL ANALYSIS TECHNIQUE FOR AUTOMATIC SPEAKER VERIFICATION [J].

FURUI, S .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1981, 29 (02) :254-272

[6] Audio-Visual Fusion Network Based on Conformer for Multimodal Emotion Recognition [J].

Guo, Peini ;

Chen, Zhengyan ;

Li, Yidi ;

Liu, Hong .

ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 :315-326

[7] MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].

Hazarika, Devamanyu ;

Zimmermann, Roger ;

Poria, Soujanya .

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131

[8]

Huang J, 2020, INT CONF ACOUST SPEE, P3507, DOI [10.1109/ICASSP40776.2020.9053762, 10.1109/icassp40776.2020.9053762]

[9] Audio and Video-based Emotion Recognition using Multimodal Transformers [J].

John, Vijay ;

Kawanishi, Yasutomo .

2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, :2582-2588

[10] MMTM: Multimodal Transfer Module for CNN Fusion [J].

Joze, Hamid Reza Vaezi ;

Shaban, Amirreza ;

Iuzzolino, Michael L. ;

Koishida, Kazuhito .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :13286-13296

← 1 2 3 →