Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism

被引:0
作者
Tang, Xiaoyu [1 ,2 ]
Huang, Jiazheng [1 ]
Lin, Yixin [1 ]
Dang, Ting [3 ]
Cheng, Jintao [2 ]
机构
[1] South China Normal Univ, Fac Engn, Sch Elect & Informat Engn, Foshan 528225, Guangdong, Peoples R China
[2] South China Normal Univ, Xingzhi Coll, Guangzhou 510000, Guangdong, Peoples R China
[3] Univ Melbourne, Melbourne, Australia
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Temporal-channel-spatial attention; Local-global feature fusion; Lightweight convolution transformer; FEATURES;
D O I
10.1016/j.specom.2025.103242
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech Emotion Recognition (SER) is crucial in human-machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. https://github.com/SCNU-RISLAB/CNN-Transforemr-and-MultidimensionalAttention-Mechanism.
引用
收藏
页数:13
相关论文
共 74 条
[1]  
Baevski A, 2020, ADV NEUR IN, V33
[2]  
Baevski A, 2020, Arxiv, DOI arXiv:1910.05453
[3]   Iris feature extraction through wavelet mel-frequency cepstrum coefficients [J].
Barpanda, Soubhagya Sankar ;
Majhi, Banshidhar ;
Sa, Panjak Kumar ;
Sangaiah, Arun Kumar ;
Bakshi, Sambit .
OPTICS AND LASER TECHNOLOGY, 2019, 110 :13-23
[4]   A comparative study of traditional and newly proposed features for recognition of speech under stress [J].
Bou-Ghazale, SE ;
Hansen, JHL .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04) :429-442
[5]  
Burkhardt Felix, 2005, Interspeech, DOI [10.21437/Interspeech.2005-446, DOI 10.21437/INTERSPEECH.2005-446]
[6]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[7]   Speech Emotion Recognition with Multi-task Learning [J].
Cai, Xingyu ;
Yuan, Jiahong ;
Zheng, Renjie ;
Huang, Liang ;
Church, Kenneth .
INTERSPEECH 2021, 2021, :4508-4512
[8]   HIERARCHICAL NETWORK BASED ON THE FUSION OF STATIC AND DYNAMIC FEATURES FOR SPEECH EMOTION RECOGNITION [J].
Cao, Qi ;
Hou, Mixiao ;
Chen, Bingzhi ;
Zhang, Zheng ;
Lu, Guangming .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6334-6338
[9]   GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [J].
Cao, Yue ;
Xu, Jiarui ;
Lin, Stephen ;
Wei, Fangyun ;
Hu, Han .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1971-1980
[10]  
Dai DY, 2019, INT CONF ACOUST SPEE, P7405, DOI [10.1109/icassp.2019.8683765, 10.1109/ICASSP.2019.8683765]