Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

被引：1

作者：

Dong, Ke ^{[1
]}

Peng, Hao ^{[2
,3
]}

Che, Jie ^{[1
]}

机构：

[1] Hefei Univ Technol, Hefei, Peoples R China

[2] Dalian Univ Technol, Dalian, Peoples R China

[3] Newcastle Univ, Newcastle, NSW, Australia

来源：

MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷

关键词：

Speech Emotion Recognition; Attention Mechanism; Feature Fusion; Multi-view Learning; Cross-corpus;

D O I：

10.1007/978-3-031-27818-1_29

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.

引用

页码：350 / 361

页数：12

共 20 条

[1] MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].