Multimodal emotion recognition based on audio and text by using hybrid attention networks

被引：28

作者：

Zhang, Shiqing ^{[1
]}

Yang, Yijiao ^{[1
,2
]}

Chen, Chen ^{[1
]}

Liu, Ruixin ^{[1
,2
]}

Tao, Xin ^{[1
]}

Guo, Wenping ^{[1
]}

Xu, Yicheng ^{[3
]}

Zhao, Xiaoming ^{[1
]}

机构：

[1] Taizhou Univ, Inst Intelligent Informat Proc, Taizhou 318000, Zhejiang, Peoples R China

[2] Zhejiang Univ Sci & Technol, Sch Sci, Hangzhou 310023, Zhejiang, Peoples R China

[3] Taizhou Vocat & Tech Coll, Sch Informat Technol Engn, Taizhou 318000, Zhejiang, Peoples R China

来源：

BIOMEDICAL SIGNAL PROCESSING AND CONTROL | 2023年 / 85卷

基金：

中国国家自然科学基金;

关键词：

Multimodal emotion recognition; Deep learning; Local intra-modal attention; Cross-modal attention; Global inter-modal attention; NEURAL-NETWORKS; SPEECH; FEATURES;

D O I：

10.1016/j.bspc.2023.105052

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

Multimodal Emotion Recognition (MER) has recently become a popular and challenging topic. The most key challenge in MER is how to effectively fuse multimodal information. Most of prior works may not fully consider the inter-modal and intra-modal attention mechanism to jointly learn intra-modal and inter-modal emotional salient information for further improving the performance of MER. To address this problem, this paper proposes a new MER framework based on audio and text by using Hybrid Attention Networks (MER-HAN). The proposed MER-HAN combines three different attention mechanisms such as the local intra-modal attention, the cross-modal attention, and the global inter-modal attention to effectively learn intra-modal and inter-modal emotional salient features for MER. Specifically, an Audio and Text Encoder (ATE) block equipped with deep learning techniques with the local intra-modal attention mechanism is initially designed to learn high-level audio and text feature representations from the corresponding audio and text sequences, respectively. Then, a Cross-Modal Attention (CMA) block is presented to jointly capture high-level shared feature representations across audio and text modalities. Finally, a Multimodal Emotion Classification (MEC) block with the global inter-modal attention mechanism is provided to obtain final MER results. Extensive experiments conducted on two public multimodal emotional datasets, i.e., IEMOCAP and MELD datasets, show the advantage of the proposed MER-HAN on MER tasks.

引用

页数：10

共 81 条

[1] Transformer models for text-based emotion detection: a review of BERT-based approaches [J].

Acheampong, Francisca Adoma ;

Nunoo-Mensah, Henry ;

Chen, Wenyu .

ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (08) :5789-5829

[2] A survey of state-of-the-art approaches for emotion recognition in text [J].

Alswaidan, Nourah ;

Menai, Mohamed El Bachir .

KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (08) :2937-2987

[3] Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].

Anagnostopoulos, Christos-Nikolaos ;

Iliou, Theodoros ;

Giannoukos, Ioannis .

ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177

[4]

[Anonymous], 2014, P 16 INT C MULT INT

[5] DETECTING EMOTION CARRIERS BY COMBINING ACOUSTIC AND LEXICAL REPRESENTATIONS [J].

Bayerl, Sebastian P. ;

Tammewar, Aniruddha ;

Riedhammer, Korbinian ;

Riccardi, Giuseppe .

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :31-38

[6] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[7] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[8]

Castellano G, 2008, LECT NOTES COMPUT SC, V4868, P92, DOI 10.1007/978-3-540-85099-1_8

[9]

Chauhan DS, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P5647

[10] Transformer Encoder With Multi-Modal Multi-Head Attention for Continuous Affect Recognition [J].

Chen, Haifeng ;

Jiang, Dongmei ;

Sahli, Hichem .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :4171-4183

← 1 2 3 4 5 6 7 8 9 →