A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition

被引:171
作者
Mustaqeem [1 ]
Kwon, Soonil [1 ]
机构
[1] Sejong Univ, Dept Software, Interact Technol Lab, Seoul 05006, South Korea
关键词
artificial intelligence; emotion recognition; neural networks; noise removal; spectrogram; signals enhancement; NEURAL-NETWORKS; SPECTROGRAM;
D O I
10.3390/s20010183
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.
引用
收藏
页数:15
相关论文
共 48 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]   Variance sensitive adaptive threshold-based PCA method for fault detection with experimental application [J].
Alkaya, Alkan ;
Eker, Ilyas .
ISA TRANSACTIONS, 2011, 50 (02) :287-302
[3]  
[Anonymous], 2018, ARXIV180600984
[4]  
[Anonymous], 2014, Comput. Sci.
[5]  
[Anonymous], COMMUN ACM
[6]  
[Anonymous], 2018, EMNLP
[7]  
[Anonymous], 2018, ARXIV180602146
[8]  
[Anonymous], 2019, ARXIV190605681
[9]  
[Anonymous], 2016, SOFT COMPUT
[10]  
[Anonymous], 2019, LEARNING TEMPORAL CL, DOI DOI 10.21437/INTERSPEECH.2019-3068