Dynamic-Static Cross Attentional Feature Fusion Method for Speech Emotion Recognition

被引:1
作者
Dong, Ke [1 ]
Peng, Hao [2 ,3 ]
Che, Jie [1 ]
机构
[1] Hefei Univ Technol, Hefei, Peoples R China
[2] Dalian Univ Technol, Dalian, Peoples R China
[3] Newcastle Univ, Newcastle, NSW, Australia
来源
MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷
关键词
Speech Emotion Recognition; Attention Mechanism; Feature Fusion; Multi-view Learning; Cross-corpus;
D O I
10.1007/978-3-031-27818-1_29
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The dynamic-static fusion features play an important role in speech emotion recognition (SER). However, the fusion methods of dynamic features and static features generally are simple addition or serial fusion, which might cause the loss of certain underlying emotional information. To address this issue, we proposed a dynamic-static cross attentional feature fusion method (SD-CAFF) with a cross attentional feature fusion mechanism (Cross AFF) to extract superior deep dynamic-static fusion features. To be specific, the Cross AFF is utilized to parallel fuse the deep features from the CNN/LSTM feature extraction module, which can extract the deep static features and the deep dynamic features from acoustic features (MFCC, Delta, and Delta-delta). In addition to the SD-CAFF framework, we also employed muti-task learning in the training process to further improve the accuracy of emotion recognition. The experimental results on IEMOCAP demonstrated the WA and UA of SD-CAFF are 75.78% and 74.89%, respectively, which outperformed the current SOTAs. Furthermore, SD-CAFF achieved competitive performances (WA: 56.77%; UA: 56.30%) in the comparison experiments of cross-corpus capability on MSP-IMPROV.
引用
收藏
页码:350 / 361
页数:12
相关论文
共 20 条
[1]   MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception [J].
Busso, Carlos ;
Parthasarathy, Srinivas ;
Burmania, Alec ;
AbdelWahab, Mohammed ;
Sadoughi, Najmeh ;
Provost, Emily Mower .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2017, 8 (01) :67-80
[2]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[3]   HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION [J].
Cao, Qi ;
Hou, Mixiao ;
Chen, Bingzhi ;
Zhang, Zheng ;
Lu, Guangming .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6334-6338
[4]   Dynamic ReLU [J].
Chen, Yinpeng ;
Dai, Xiyang ;
Liu, Mengchen ;
Chen, Dongdong ;
Yuan, Lu ;
Liu, Zicheng .
COMPUTER VISION - ECCV 2020, PT XIX, 2020, 12364 :351-367
[5]   Attentional Feature Fusion [J].
Dai, Yimian ;
Gieseke, Fabian ;
Oehmcke, Stefan ;
Wu, Yiquan ;
Barnard, Kobus .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, :3559-3568
[6]   Gender differences in emotion recognition: Impact of sensory modality and emotional category [J].
Lambrecht, Lena ;
Kreifelts, Benjamin ;
Wildgruber, Dirk .
COGNITION & EMOTION, 2014, 28 (03) :452-469
[7]   Multi-Task Semi-Supervised Adversarial Autoencoding for Speech Emotion Recognition [J].
Latif, Siddique ;
Rana, Rajib ;
Khalifa, Sara ;
Jurdak, Raja ;
Epps, Julien ;
Schuller, Bjoern W. .
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2022, 13 (02) :992-1004
[8]  
Liu JX, 2020, INT CONF ACOUST SPEE, P7174, DOI [10.1109/icassp40776.2020.9053192, 10.1109/ICASSP40776.2020.9053192]
[9]   ATDA: Attentional temporal dynamic activation for speech emotion recognition [J].
Liu, Lu-Yao ;
Liu, Wen-Zhe ;
Zhou, Jian ;
Deng, Hui-Yuan ;
Feng, Lin .
KNOWLEDGE-BASED SYSTEMS, 2022, 243
[10]  
Lv Huilian, 2020, ICDSP 2020: Proceedings of the 2020 4th International Conference on Digital Signal Processing, P169, DOI 10.1145/3408127.3408192