Emotion Recognition from Speech using Artificial Neural Networks and. Recurrent Neural Networks

被引：0

作者：

Sharma, Shambhavi ^{[1
]}

机构：

[1] Amity Univ Uttar Pradesh, CSE, Amity Sch Engn & Technol, Noida, India

来源：

2021 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING (CONFLUENCE 2021) | 2021年

关键词：

Speech emotion recognition; Mel Frequency Cepstral Coefficient; Long Short-Term Memory; Multi-Layer Perceptron; Recurrent neural network; Artificial neural networks; Deep learning;

D O I：

10.1109/Confluence51648.2021.9377192

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper presents a comparative study on two classifiers created for speech emotion recognition. Perceiving a person's feeling has consistently been an intriguing task for everyone. These feelings can he expressed through facial expressions, speech, actions, and so forth. The most widely used form of communication is through speech. Speech is an elaborated form of communication constituting various details. These details provide several information such as the abstract of the message, tone of the speaker, language used, background noise, any form of musical sound, emotions, etc. The significance of speech emotion recognition technology is getting mainstream with the advancement of "Voice User Interface" technology. This technology makes it possible for computers to interact with humans by applying speech analysis to understand the instructions given by a person and perform the required tasks and commands. There is always an emotion attached to a piece of speech while communicating but recognizing this emotion is a complex job in the research field. This is mainly because the way emotions are perceived from an audio differs from person to person. I have created two models for speech emotion recognition. I have used Mel Frequency Cepstral Coefficient (MFCC) for feature extraction from the audio files. The first model has been created using Multi Layer Perceptron (MLP) classifier which gave an accuracy 57.29 percent. The second model was created Long Short-Term Memory (LSTM) and gave a good accuracy of 92.88. I have made use of RAVDESS dataset for classification purpose.

引用

页码：153 / 158

页数：6

共 13 条

[1]

An SM, 2017, ASIAPAC SIGN INFO PR, P1563, DOI 10.1109/APSIPA.2017.8282282

[2]

Ayadi Moataz El, SURVEY SPEECH EMOTIO

[3]

Basu Saikat, 2017, 2017 2nd International Conference on Communication and Electronics Systems (ICCES). Proceedings, P333, DOI 10.1109/CESYS.2017.8321292

[4]

Bombatkar A., 2014, EMOTION RECOGNITION

[5]

Burkhardt F., 2005, INTERSPEECH, V5, P1517

[6] CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset [J].

Cao, Houwei ;

Cooper, David G. ;

Keutmann, Michael K. ;

Gur, Ruben C. ;

Nenkova, Ani ;

Verma, Ragini .

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2014, 5 (04) :377-390

[7]

Hastie T., 2009, The elements of statistical learning: Data mining, inference, and prediction, DOI DOI 10.1007/978-0-387-84858-7

[8]

Likitha MS, 2017, 2017 2ND IEEE INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, SIGNAL PROCESSING AND NETWORKING (WISPNET), P2257, DOI 10.1109/WiSPNET.2017.8300161

[9] The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English [J].

Livingstone, Steven R. ;

Russo, Frank A. .

PLOS ONE, 2018, 13 (05)

[10]

Tischler M. A., 2007, APPL EMOTION RECOGNI

← 1 2 →