Discriminative feature learning based on multi-view attention network with diffusion joint loss for speech emotion recognition

被引:3
作者
Liu, Yang [1 ]
Chen, Xin [1 ]
Song, Yuan [1 ]
Li, Yarong [1 ]
Wang, Shengbei [2 ]
Yuan, Weitao [2 ]
Li, Yongwei [3 ]
Zhao, Zhen [1 ]
机构
[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China
[2] Tiangong Univ, Sch Comp Sci & Technol, Tianjin 300387, Peoples R China
[3] Chinese Acad Sci, Inst Psychol, CAS Key Lab Behav Sci, Beijing 100089, Peoples R China
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Multi-view attention network; Diffusion joint loss; GENERATION; MODEL;
D O I
10.1016/j.engappai.2024.109219
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In speech emotion recognition, existing models often struggle to accurately classify emotions with high similarity. In this paper, we propose a novel architecture that integrates a multi-view attention network (MVAN) and diffusion joint loss to alleviate confusion by placing a stronger focus on emotions that are challenging to classify accurately. First, we use logarithmic Mel-spectrograms (log-Mels), deltas, and delta- deltas of log-Mels as three-dimensional features to minimize external interference. Then, we design the MVAN to extract effective multi-time scale emotion features, where the channel and spatial attention are used to selectively localize the regions in the input features related to the target emotion. A Multi-time view bidirectional long and short-term memory network is used to extract the shallow edge features and deep semantic features, and multi-scale self-attention fuses these features through cross-scale attention fusion to obtain multi-time scale emotion features. Finally, a diffusion joint loss strategy is introduced to distinguish the emotional embeddings with high similarity by the generated complex emotion triplets in a diffusing fashion. We evaluated our proposed method on the Interactive Emotional Mood Binary Motion Capture (IEMOCAP), Chinese Academy of Sciences Automation Institute of Automation (CASIA), and Berlin German Emotion Speech Bank (EMODB) corpus. The results show significant improvements over existing methods, achieving 86.87% WA, 86.60% UA, and 86.82% WF1 on IEMOCAP; 70.74% WA, 70.74% UA, and 70.25% WF1 on CASIA; and 93.65% WA, 91.13% UA, and 92.26% WF1 on EMODB. These results confirm the superiority of our method. Our code and model are available at https://github.com/Littleznnz/MVAN-DiffSEG.
引用
收藏
页数:15
相关论文
共 41 条
[1]   An enhanced speech emotion recognition using vision transformer [J].
Akinpelu, Samson ;
Viriri, Serestina ;
Adegun, Adekanmi .
SCIENTIFIC REPORTS, 2024, 14 (01)
[2]   Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network [J].
Bhangale, Kishor ;
Kothandaraman, Mohanaprasad .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2024, 43 (04) :2341-2384
[3]  
Burkhardt F., 2005, INTERSPEECH, V5, P1517
[4]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[5]   A novel dual attention-based BLSTM with hybrid features in speech emotion recognition [J].
Chen, Qiupu ;
Huang, Guimin .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 102
[6]   Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition [J].
Gao, Yuan ;
Liu, JiaXing ;
Wang, Longbiao ;
Dang, Jianwu .
INTERSPEECH 2021, 2021, :4503-4507
[7]   DIFFUSION MODELS FOR AUDIO SEMANTIC COMMUNICATION [J].
Grassucci, Eleonora ;
Marinoni, Christian ;
Rodriguez, Andrea ;
Comminiello, Danilo .
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :13136-13140
[8]   Affect Recognition through Scalogram and Multi-resolution Cochleagram Features [J].
Haider, Fasih ;
Luz, Saturnino .
INTERSPEECH 2021, 2021, :4478-4482
[9]   Speech Emotion Recognition from Variable-Length Inputs with Triplet Loss Function [J].
Huang, Jian ;
Li, Ya ;
Tao, Jianhua ;
Lian, Zheng .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3673-3677
[10]   Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition [J].
Jiang, Pengxu ;
Fu, Hongliang ;
Tao, Huawei ;
Lei, Peizhi ;
Zhao, Li .
IEEE ACCESS, 2019, 7 :90368-90377