Speech Emotion Recognition via CNN-Transformer and multidimensional attention mechanism

被引:0
作者
Tang, Xiaoyu [1 ,2 ]
Huang, Jiazheng [1 ]
Lin, Yixin [1 ]
Dang, Ting [3 ]
Cheng, Jintao [2 ]
机构
[1] South China Normal Univ, Fac Engn, Sch Elect & Informat Engn, Foshan 528225, Guangdong, Peoples R China
[2] South China Normal Univ, Xingzhi Coll, Guangzhou 510000, Guangdong, Peoples R China
[3] Univ Melbourne, Melbourne, Australia
基金
中国国家自然科学基金;
关键词
Speech emotion recognition; Temporal-channel-spatial attention; Local-global feature fusion; Lightweight convolution transformer; FEATURES;
D O I
10.1016/j.specom.2025.103242
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech Emotion Recognition (SER) is crucial in human-machine interactions. Previous approaches have predominantly focused on local spatial or channel information and neglected the temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods. https://github.com/SCNU-RISLAB/CNN-Transforemr-and-MultidimensionalAttention-Mechanism.
引用
收藏
页数:13
相关论文
共 74 条
[11]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, 10.48550/arXiv.2010.11929]
[12]   Survey on speech emotion recognition: Features, classification schemes, and databases [J].
El Ayadi, Moataz ;
Kamel, Mohamed S. ;
Karray, Fakhri .
PATTERN RECOGNITION, 2011, 44 (03) :572-587
[13]   Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition [J].
Gao, Yuan ;
Liu, JiaXing ;
Wang, Longbiao ;
Dang, Jianwu .
INTERSPEECH 2021, 2021, :4503-4507
[14]   The role of voice quality in communicating emotion, mood and attitude [J].
Gobl, C ;
Ní Chasaide, A .
SPEECH COMMUNICATION, 2003, 40 (1-2) :189-212
[15]  
Gómez-Zaragozá L, 2024, Arxiv, DOI arXiv:2403.02167
[16]  
Guizzo E, 2020, INT CONF ACOUST SPEE, P6489, DOI [10.1109/ICASSP40776.2020.9053727, 10.1109/icassp40776.2020.9053727]
[17]   CMT: Convolutional Neural Networks Meet Vision Transformers [J].
Guo, Jianyuan ;
Han, Kai ;
Wu, Han ;
Tang, Yehui ;
Chen, Xinghao ;
Wang, Yunhe ;
Xu, Chang .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :12165-12175
[18]   REPRESENTATION LEARNING WITH SPECTRO-TEMPORAL-CHANNEL ATTENTION FOR SPEECH EMOTION RECOGNITION [J].
Guo, Lili ;
Wang, Longbiao ;
Xu, Chenglin ;
Dang, Jianwu ;
Chng, Eng Siong ;
Li, Haizhou .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6304-6308
[19]  
Gupta B., 2019, Emerg. Sci. J., V3, P23
[20]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778