Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients

被引:24
作者
de Pinto, Marco Giuseppe [1 ]
Polignano, Marco [1 ]
Lops, Pasquale [1 ]
Semeraro, Giovanni [1 ]
机构
[1] Univ Bari Aldo Moro, Via E Orabona 4, Bari, Italy
来源
2020 IEEE INTERNATIONAL CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS (EAIS) | 2020年
关键词
emotion detection; natural language understanding; sentiment analysis; deep learning; machine learning; classification; mel-frequency cepstral coefficients; cnn; ravdess;
D O I
10.1109/eais48028.2020.9122698
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ability to understand people through spoken language is a skill that many human beings take for granted. On the contrary, the same task is not as easy for machines, as consequences of a large number of variables which vary the speaking sound wave while people are talking to each other. A sub-task of speeches understanding is about the detection of the emotions elicited by the speaker while talking, and this is the main focus of our contribution. In particular, we are presenting a classification model of emotions elicited by speeches based on deep neural networks (CNNs). For the purpose, we focused on the audio recordings available in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. The model has been trained to classify eight different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise) which correspond to the ones proposed by Ekman plus the neutral and calm ones. We considered as evaluation metric the F1 score, obtaining a weighted average of 0.91 on the test set and the best performances on the "Angry" class with a score of 0.95. Our worst results have been observed for the sad class with a score of 0.87 that is nevertheless better than the state-of-the-art. In order to support future development and the replicability of results, the source code of the proposed model is available on the following GitHub repository: https://github.com/marcogdepinto/Emotion-Classification-Ravdess
引用
收藏
页数:5
相关论文
共 16 条
[1]  
[Anonymous], 2000, Ismir
[2]  
Bengio Y, 1995, HDB BRAIN THEORY NEU, V3361
[3]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[4]  
Ekman P., 1999, Handbook of Cognition and Emotion, P45
[5]  
Goodfellow I, 2016, ADAPT COMPUT MACH LE, P1
[6]   Decoding mental states from brain activity in humans [J].
Haynes, John-Dylan ;
Rees, Geraint .
NATURE REVIEWS NEUROSCIENCE, 2006, 7 (07) :523-534
[7]   Decoding Neural Responses in Mouse Visual Cortex through a Deep Neural Network [J].
Iqbal, Asim ;
Dong, Phil ;
Kim, Christopher M. ;
Jang, Heeun .
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[8]   Ubiquitous Emotion Recognition Using Audio and Video Data [J].
Jannat, Rahatul ;
Tynes, Iyonna ;
LaLime, Lott ;
Adorno, Juan ;
Canavan, Shaun .
PROCEEDINGS OF THE 2018 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING AND PROCEEDINGS OF THE 2018 ACM INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (UBICOMP/ISWC'18 ADJUNCT), 2018, :956-959
[9]   The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English [J].
Livingstone, Steven R. ;
Russo, Frank A. .
PLOS ONE, 2018, 13 (05)
[10]  
Muda L., 2010, J COMPUTING, V2