MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network

被引：27

作者：

Jothimani, S. ^{[1
]}

Premalatha, K. ^{[1
]}

机构：

[1] Bannari Amman Inst Technol, Dept Comp Sci & Engn, Sathyamangalam 638401, India

来源：

CHAOS SOLITONS & FRACTALS | 2022年 / 162卷

关键词：

Augmentation; Contrastive loss; MFCC; RMS; Speech emotion recognition; ZCR; ACCURACY;

D O I：

10.1016/j.chaos.2022.112512

中图分类号：

O1 [数学];

学科分类号：

0701 ; 070101 ;

摘要：

The Speech Emotion Recognition (SER) is a complex task because of the feature selections that reflect the emotion from the human speech. The SER plays a vital role and is very challenging in Human-Computer Interaction (HCI). Traditional methods provide inconsistent feature extraction for emotion recognition. The primary motive of this paper is to improve the accuracy of the classification of eight emotions from the human voice. The proposed MFF-SAug research, Enhance the emotion prediction from the speech by Noise Removal, White Noise Injection, and Pitch Tuning. On pre-processed speech signals, the feature extraction techniques Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), and Root Mean Square (RMS) are applied and combined to achieve substantial performance used for emotion recognition. The augmentation applies to the raw speech for a contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstructs the loss of input representation for better accuracy prediction. A state-of-the-art Convolution Neural Network (CNN) is proposed for enhanced speech representation learning and voice emotion classification. Further, this MFF-SAug method is compared with the CNN + LSTM model. The experi-mental analysis was carried out using the RAVDESS, CREMA, SAVEE, and TESS datasets. Thus, the classifier achieved a robust representation for speech emotion recognition with an accuracy of 92.6 %, 89.9, 84.9 %, and 99.6 % for RAVDESS, CREMA, SAVEE, and TESS datasets, respectively.

引用

页数：18

共 50 条

[21] Speech Emotion Recognition Using Multi-granularity Feature Fusion Through Auditory Cognitive Mechanism
Xu, Cong
Li, Haifeng
Bo, Hongjian
Ma, Lin
COGNITIVE COMPUTING - ICCC 2019, 2019, 11518 : 117 - 131
[22] Multimodal speech emotion recognition and classification using convolutional neural network techniques
Christy, A.
Vaithyasubramanian, S.
Jesudoss, A.
Praveena, M. D. Anto
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2020, 23 (02) : 381 - 388
[23] Multimodal speech emotion recognition and classification using convolutional neural network techniques
A. Christy
S. Vaithyasubramanian
A. Jesudoss
M. D. Anto Praveena
International Journal of Speech Technology, 2020, 23 : 381 - 388
[24] Speech Emotion Recognition Using Neural Network and Wavelet Features
Roy, Tanmoy
Marwala, Tshilidzi
Chakraverty, S.
RECENT TRENDS IN WAVE MECHANICS AND VIBRATIONS, WMVC 2018, 2020, : 427 - 438
[25] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
Wani, Taiba Majid
Gunawan, Teddy Surya
Qadri, Syed Asif Ahmad
Mansor, Hasmah
Kartiwi, Mira
Ismail, Nanang
PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
[26] Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients
Manju D. Pawar
Rajendra D. Kokate
Multimedia Tools and Applications, 2021, 80 : 15563 - 15587
[27] Speech emotion recognition based on multi-dimensional feature extraction and multi-scale feature fusion
Yu, Lingli
Xu, Fengjun
Qu, Yundong
Zhou, Kaijun
APPLIED ACOUSTICS, 2024, 216
[28] Graph-Based Multi-Feature Fusion Method for Speech Emotion Recognition
Liu, Xueyu
Lin, Jie
Wang, Chao
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (16)
[29] Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition
Zhang, Hua
Gou, Ruoyun
Shang, Jili
Shen, Fangyao
Wu, Yifan
Dai, Guojun
FRONTIERS IN PHYSIOLOGY, 2021, 12
[30] End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
Duowei Tang
Peter Kuppens
Luc Geurts
Toon van Waterschoot
EURASIP Journal on Audio, Speech, and Music Processing, 2021

← 1 2 3 4 5 →