MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network

被引：30

作者：

Jothimani, S. ^{[1
]}

Premalatha, K. ^{[1
]}

机构：

[1] Bannari Amman Inst Technol, Dept Comp Sci & Engn, Sathyamangalam 638401, India

来源：

CHAOS SOLITONS & FRACTALS | 2022年 / 162卷

关键词：

Augmentation; Contrastive loss; MFCC; RMS; Speech emotion recognition; ZCR; ACCURACY;

D O I：

10.1016/j.chaos.2022.112512

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

The Speech Emotion Recognition (SER) is a complex task because of the feature selections that reflect the emotion from the human speech. The SER plays a vital role and is very challenging in Human-Computer Interaction (HCI). Traditional methods provide inconsistent feature extraction for emotion recognition. The primary motive of this paper is to improve the accuracy of the classification of eight emotions from the human voice. The proposed MFF-SAug research, Enhance the emotion prediction from the speech by Noise Removal, White Noise Injection, and Pitch Tuning. On pre-processed speech signals, the feature extraction techniques Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), and Root Mean Square (RMS) are applied and combined to achieve substantial performance used for emotion recognition. The augmentation applies to the raw speech for a contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstructs the loss of input representation for better accuracy prediction. A state-of-the-art Convolution Neural Network (CNN) is proposed for enhanced speech representation learning and voice emotion classification. Further, this MFF-SAug method is compared with the CNN + LSTM model. The experi-mental analysis was carried out using the RAVDESS, CREMA, SAVEE, and TESS datasets. Thus, the classifier achieved a robust representation for speech emotion recognition with an accuracy of 92.6 %, 89.9, 84.9 %, and 99.6 % for RAVDESS, CREMA, SAVEE, and TESS datasets, respectively.

引用

页数：18

共 51 条

[1] Model selection for ecologists: the worldviews of AIC and BIC [J].

Aho, Ken ;

Derryberry, DeWayne ;

Peterson, Teri .

ECOLOGY, 2014, 95 (03) :631-636

[2] Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].

Anagnostopoulos, Christos-Nikolaos ;

Iliou, Theodoros ;

Giannoukos, Ioannis .

ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177

[3] Improved speech emotion recognition with Mel frequency magnitude coefficient [J].

Ancilin, J. ;

Milton, A. .

APPLIED ACOUSTICS, 2021, 179

[4]

[Anonymous], 2016, P 4 INT C LEARNING R

[5] Learned vs. Hand-Crafted Features for Pedestrian Gender Recognition [J].

Antipov, Grigory ;

Berrani, Sid-Ahmed ;

Ruchaud, Natacha ;

Dugelay, Jean-Luc .

MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, :1263-1266

[6] Bagged support vector machines for emotion recognition from speech [J].

Bhavan, Anjali ;

Chauhan, Pankaj ;

Hitkul ;

Shah, Rajiv Ratn .

KNOWLEDGE-BASED SYSTEMS, 2019, 184

[7] Statistical modeling: The two cultures [J].

Breiman, L .

STATISTICAL SCIENCE, 2001, 16 (03) :199-215

[8] Text-Independent Phoneme Segmentation Combining EGG and Speech Data [J].

Chen, Lijiang ;

Mao, Xia ;

Yan, Hong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (06) :1029-1037

[9] Speech emotion recognition: Features and classification models [J].

Chen, Lijiang ;

Mao, Xia ;

Xue, Yuli ;

Cheng, Lee Lung .

DIGITAL SIGNAL PROCESSING, 2012, 22 (06) :1154-1160

[10] An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech [J].

Cummins, Nicholas ;

Amiriparian, Shahin ;

Hagerer, Gerhard ;

Batliner, Anton ;

Steidl, Stefan ;

Schuller, Bjorn W. .

PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :478-484

← 1 2 3 4 5 6 →