Enhancing Speech Emotion Recognition With Conditional Emotion Feature Diffusion and Progressive Interleaved Learning Strategy

被引：0

作者：

Liu, Yang ^{[1
]}

Chen, Xin ^{[1
]}

Peng, Zhichao ^{[2
]}

Li, Yongwei ^{[3
]}

Li, Xingfeng ^{[4
]}

Song, Peng ^{[5
]}

Unoki, Masashi ^{[6
]}

Zhao, Zhen ^{[1
]}

机构：

[1] Qingdao Univ Sci & Technol, Sch Informat Sci & Technol, Qingdao 266061, Peoples R China

[2] Hunan Univ Humanities Sci & Technol, Sch Informat, Loudi 417000, Peoples R China

[3] Chinese Acad Sci, Inst Psychol, CAS Key Lab Behav Sci, Beijing 100089, Peoples R China

[4] City Univ Macau, Fac Data Sci, Macau 999078, Peoples R China

[5] Yantai Univ, Sch Comp & Control Engn, Yantai 264005, Peoples R China

[6] Japan Adv Inst Sci & Technol, Sch Informat Sci, Nomi 9231292, Japan

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Emotion recognition; Training; Three-dimensional displays; Diffusion models; Speech enhancement; Electronic mail; Speech recognition; Semantics; Face recognition; Speech emotion recognition; multi-resolution features; diffusion model; GENERATION;

D O I：

10.1109/TASLPRO.2025.3561606

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech emotion recognition (SER) aims to identify the speaker's emotional states in specific utterances accurately. However, existing methods still face feature confusion when attempting to recognize certain emotions because traditional acoustic feature extraction methods fail to capture dynamic emotional changes, blurring emotional boundaries. Additionally, existing classification networks (CNs) are constrained by fixed learning strategies, hindering their ability to capture subtle emotional nuances and resulting in label confusion. To address these two issues, we introduce 3D multiresolution modulation filtered cochleogram (MMCG) features by computing the deltas and delta-deltas of MMCG features to enhance the dynamic emotional changes and produce distinct emotional boundaries. We then customize a conditional emotion feature diffusion (CEFD) module, which progressively diffuses features based on emotional context to retain emotional nuances effectively and reduce reliance on conditioned information. In addition, a confidence filtering module is used to filter diffused features based on confidence-based posterior probabilities to ensure enhanced feature discrimination. We design a flexible training strategy named the progressive interleaved learning strategy (PILS) to learn further complex emotional nuances, which consists of two alternating stages: fine-tuning the CN parameters and supervising the CEFD output. Testing on the IEMOCAP, CASIA, and EMODB corpora demonstrates significant performance improvements in SER.

引用

页码：1787 / 1800

页数：14

共 42 条

[1] An enhanced speech emotion recognition using vision transformer [J].

Akinpelu, Samson ;

Viriri, Serestina ;

Adegun, Adekanmi .

SCIENTIFIC REPORTS, 2024, 14 (01)

[2] DIFFUSION-BASED SPEECH ENHANCEMENT WITH A WEIGHTED GENERATIVE-SUPERVISED LEARNING LOSS [J].

Ayilo, Jean-Eudes ;

Sadeghi, Mostafa ;

Serizel, Romain .

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, :12506-12510

[3]

Burkhardt Felix, 2005, Interspeech

[4] IEMOCAP: interactive emotional dyadic motion capture database [J].

Busso, Carlos ;

Bulut, Murtaza ;

Lee, Chi-Chun ;

Kazemzadeh, Abe ;

Mower, Emily ;

Kim, Samuel ;

Chang, Jeannette N. ;

Lee, Sungbok ;

Narayanan, Shrikanth S. .

LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359

[5] A Feature Study for Classification-Based Speech Separation at Low Signal-to-Noise Ratios [J].

Chen, Jitong ;

Wang, Yuxuan ;

Wang, DeLiang .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1993-2002

[6] A novel dual attention-based BLSTM with hybrid features in speech emotion recognition [J].

Chen, Qiupu ;

Huang, Guimin .

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 102

[7]

Dhariwal P, 2021, ADV NEUR IN, V34

[8] Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption [J].

Gao, Jin ;

Zhang, Jialing ;

Liu, Xihui ;

Darrell, Trevor ;

Shelhamer, Evan ;

Wang, Dequan .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :11786-11796

[9] Affect Recognition through Scalogram and Multi-resolution Cochleagram Features [J].

Haider, Fasih ;

Luz, Saturnino .

INTERSPEECH 2021, 2021, :4478-4482

[10] Improving Sample Quality of Diffusion Models Using Self-Attention Guidance [J].

Hong, Susung ;

Lee, Gyuseong ;

Jang, Wooseok ;

Kim, Seungryong .

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, :7428-7437

← 1 2 3 4 5 →