3D CNN-Based Speech Emotion Recognition Using K-Means Clustering and Spectrograms

被引:108
作者
Hajarolasvadi, Noushin [1 ]
Demirel, Hasan [1 ]
机构
[1] Eastern Mediterranean Univ, Dept Elect & Elect Engn, Via Mersin 10, TR-99628 Gazimagusa, North Cyprus, Turkey
关键词
speech emotion recognition; 3D convolutional neural networks; deep learning; k-means clustering; spectrograms; CONVOLUTIONAL NEURAL-NETWORKS;
D O I
10.3390/e21050479
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
Detecting human intentions and emotions helps improve human-robot interactions. Emotion recognition has been a challenging research direction in the past decade. This paper proposes an emotion recognition system based on analysis of speech signals. Firstly, we split each speech signal into overlapping frames of the same length. Next, we extract an 88-dimensional vector of audio features including Mel Frequency Cepstral Coefficients (MFCC), pitch, and intensity for each of the respective frames. In parallel, the spectrogram of each frame is generated. In the final preprocessing step, by applying k-means clustering on the extracted features of all frames of each audio signal, we select k most discriminant frames, namely keyframes, to summarize the speech signal. Then, the sequence of the corresponding spectrograms of keyframes is encapsulated in a 3D tensor. These tensors are used to train and test a 3D Convolutional Neural network using a 10-fold cross-validation approach. The proposed 3D CNN has two convolutional layers and one fully connected layer. Experiments are conducted on the Surrey Audio-Visual Expressed Emotion (SAVEE), Ryerson Multimedia Laboratory (RML), and eNTERFACE'05 databases. The results are superior to the state-of-the-art methods reported in the literature.
引用
收藏
页数:17
相关论文
共 44 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]   Determining speaker attributes from stress-affected speech in emergency situations with hybrid SVM-DNN architecture [J].
Ahmad, Jamil ;
Sajjad, Muhammad ;
Rho, Seungmin ;
Kwon, Soon-il ;
Lee, Mi Young ;
Baik, Sung Wook .
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (04) :4883-4907
[3]  
[Anonymous], ARXIV180601506
[4]   Audiovisual emotion recognition in wild [J].
Avots, Egils ;
Sapinski, Tomasz ;
Bachmann, Maie ;
Kaminska, Dorota .
MACHINE VISION AND APPLICATIONS, 2019, 30 (05) :975-985
[5]   Deep features-based speech emotion recognition for smart affective services [J].
Badshah, Abdul Malik ;
Rahim, Nasir ;
Ullah, Noor ;
Ahmad, Jamil ;
Muhammad, Khan ;
Lee, Mi Young ;
Kwon, Soonil ;
Baik, Sung Wook .
MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (05) :5571-5589
[6]  
Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125
[7]   Real-time ensemble based face recognition system for NAO humanoids using local binary pattern [J].
Bolotnikova, Anastasia ;
Demirel, Hasan ;
Anbarjafari, Gholamreza .
ANALOG INTEGRATED CIRCUITS AND SIGNAL PROCESSING, 2017, 92 (03) :467-475
[8]   Randomized Dimensionality Reduction for k-Means Clustering [J].
Boutsidis, Christos ;
Zouzias, Anastasios ;
Mahoney, Michael W. ;
Drineas, Petros .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2015, 61 (02) :1045-1062
[9]  
Burkhardt F., 2005, INTERSPEECH, V5, P1517, DOI DOI 10.21437/INTERSPEECH.2005-446
[10]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359