3D Convolutional Neural Network for Speech Emotion Recognition With Its Realization on Intel CPU and NVIDIA GPU

被引:6
作者
Falahzadeh, Mohammad Reza [1 ]
Farsa, Edris Zaman [2 ]
Harimi, Ali [3 ]
Ahmadi, Arash [4 ]
Abraham, Ajith [5 ,6 ]
机构
[1] Islamic Azad Univ, Dept Elect Engn, Cent Tehran Branch, Tehran 1477893855, Iran
[2] Islamic Azad Univ, Dept Comp Engn, Sanandaj Branch, Sanandaj 6134937333, Iran
[3] Islamic Azad Univ, Dept Elect Engn, Shahrood Branch, Shahrood 3619943189, Iran
[4] Carleton Univ, Dept Elect, Ottawa, ON K1S 5B6, Canada
[5] Machine Intelligence Res Labs MIR Labs, Auburn, WA 98071 USA
[6] Innopolis Univ, Ctr Artificial Intelligence, Innopolis 420500, Russia
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Three-dimensional displays; Speech recognition; Mutual information; Emotion recognition; Tensors; Image reconstruction; Feature extraction; 3D convolutional neural networks (3D CNNs); speech emotion recognition; reconstructed phase space; 3D tensor; CLASSIFICATION; MODEL;
D O I
10.1109/ACCESS.2022.3217226
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the high level of precision and remarkable capabilities to solve the intricate problems in industry and academia, convolutional neural networks (CNNs) are presented. Speech emotion recognition is an interesting application for CNNs in the field of audio processing. In this paper, a speech emotion recognition system based on a 3D CNN is suggested to analyze and classify the emotions. In the proposed method, the three-dimensional reconstructed phase spaces of the speech signals were calculated. Then, emotion-related patterns formed in these spaces were converted into 3D tensors. Accordingly, a 3D CNN for speech emotion recognition applied to two datasets, EMO-DB and eNTERFACE05, using a speaker-independent technique achieved 90.40% and 82.20% accuracy, respectively. By employing gender recognition, the accuracy rates on EMO-DB increased to 94.42% and on eNTERFACE05 rose to 88.47%. Realization of the introduced 3D CNN on both Intel CPU and NVIDIA GPU is also explored. The results of the implemented 3D CNN without and with regard to gender recognition show that GPU-based running is faster for the EMO-DB and eNTERFACE05 datasets than CPU-based executions (using Python).
引用
收藏
页码:112460 / 112471
页数:12
相关论文
共 54 条
[51]   Deep learning for sentiment analysis: A survey [J].
Zhang, Lei ;
Wang, Shuai ;
Liu, Bing .
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2018, 8 (04)
[52]   Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching [J].
Zhang, Shiqing ;
Zhang, Shiliang ;
Huang, Tiejun ;
Gao, Wen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (06) :1576-1590
[53]   Learning deep features to recognise speech emotion using merged deep CNN [J].
Zhao, Jianfeng ;
Mao, Xia ;
Chen, Lijiang .
IET SIGNAL PROCESSING, 2018, 12 (06) :713-721
[54]  
US