CCTG-NET: Contextualized Convolutional Transformer-GRU Network for speech emotion recognition

被引:4
作者
Tellai M. [1 ]
Mao Q. [1 ,2 ]
机构
[1] Department of Computer Science and Communication Engineering, Jiangsu University, Jiangsu Province, Zhenjiang
[2] Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Jiangsu Province, Zhenjiang
基金
中国国家自然科学基金;
关键词
Conversational speech; Convolutional neural network; Gated recurrent unit; Mel-spectrogram; Speech emotion recognition; Transformer;
D O I
10.1007/s10772-023-10080-7
中图分类号
学科分类号
摘要
Speech is a crucial aspect of human-to-human interactions and plays a fundamental role in the advancement of human–computer interaction (HCI) systems. Developing an accurate speech emotion recognition (SER) system for human conversations poses a critical yet challenging task. Existing state-of-the-art (SOTA) research in SER primarily focuses on modeling vocal information within individual conversational speech utterances, overlooking the significance of incorporating transactional information from the interaction context. In this paper, we present a novel Contextualized Convolutional Transformer-GRU Network (CCTG-Net) for recognizing speech emotions using Mel-spectrogram features, effectively integrating contextual information for emotion recognition. Our experiments are conducted on the widely-used emotional benchmark dataset, IEMOCAP. Compared to SOTA methods in four-class emotion recognition, our proposed model achieves a weighted accuracy of 88.4% and an unweighted accuracy (UA) of 89.1%. This marks a substantial 3.0% enhancement in UA while maintaining an optimal balance between performance and complexity. © 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
引用
收藏
页码:1099 / 1116
页数:17
相关论文
共 65 条
[1]  
Afrillia Y., Mawengkang H., Ramli M., Fhonna R.P., Performance measurement of Mel frequency ceptral coefficient (MFCC) method in learning system of Al-Qur’an based in nagham pattern recognition, Journal of Physics: Conference Series, 930, (2017)
[2]  
Light-sernet: A lightweight fully convolutional neural network for speech emotion recognition, . in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP, pp. 6912-6916, (2022)
[3]  
Anagnostopoulos C.-N., Iliou T., Giannoukos I., Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artificial Intelligence Review, 43, 2, pp. 155-177, (2015)
[4]  
Araujo A., Norris W., Sim J., Computing receptive fields of convolutional neural networks, Distill, 4, 11, (2019)
[5]  
Barsade S.G., The ripple effect: Emotional contagion and its influence on group behavior, Administrative Science Quarterly, 47, 4, pp. 644-675, (2002)
[6]  
Bingol M.C., Aydogmus O., Performing predefined tasks using the human–robot interaction on speech recognition for an industrial robot, Engineering Applications of Artificial Intelligence, 95, (2020)
[7]  
Bone D., Lee C.-C., Chaspari T., Gibson J., Narayanan S., Signal processing and machine learning for mental health research and clinical applications [perspectives], IEEE Signal Processing Magazine, 34, 5, pp. 195-196, (2017)
[8]  
Busso C., Bulut M., Lee C.-C., Kazemzadeh A., Mower E., Kim S., Chang J.N., Lee S., Narayanan S.S., IEMOCAP: Interactive emotional dyadic motion capture database, Language Resources and Evaluation, 42, pp. 335-359, (2008)
[9]  
Chen M., He X., Yang J., Zhang H., 3-d convolutional recurrent neural networks with attention model for speech emotion recognition, IEEE Signal Processing Letters, 25, 10, pp. 1440-1444, (2018)
[10]  
Chung J., Gulcehre C., Cho K., Bengio Y., Empirical evaluation of gated recurrent neural networks on sequence modeling, . Arxiv Preprint Arxiv, 1412, (2014)