MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network

被引:27
|
作者
Jothimani, S. [1 ]
Premalatha, K. [1 ]
机构
[1] Bannari Amman Inst Technol, Dept Comp Sci & Engn, Sathyamangalam 638401, India
关键词
Augmentation; Contrastive loss; MFCC; RMS; Speech emotion recognition; ZCR; ACCURACY;
D O I
10.1016/j.chaos.2022.112512
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The Speech Emotion Recognition (SER) is a complex task because of the feature selections that reflect the emotion from the human speech. The SER plays a vital role and is very challenging in Human-Computer Interaction (HCI). Traditional methods provide inconsistent feature extraction for emotion recognition. The primary motive of this paper is to improve the accuracy of the classification of eight emotions from the human voice. The proposed MFF-SAug research, Enhance the emotion prediction from the speech by Noise Removal, White Noise Injection, and Pitch Tuning. On pre-processed speech signals, the feature extraction techniques Mel Frequency Cepstral Coefficients (MFCC), Zero Crossing Rate (ZCR), and Root Mean Square (RMS) are applied and combined to achieve substantial performance used for emotion recognition. The augmentation applies to the raw speech for a contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstructs the loss of input representation for better accuracy prediction. A state-of-the-art Convolution Neural Network (CNN) is proposed for enhanced speech representation learning and voice emotion classification. Further, this MFF-SAug method is compared with the CNN + LSTM model. The experi-mental analysis was carried out using the RAVDESS, CREMA, SAVEE, and TESS datasets. Thus, the classifier achieved a robust representation for speech emotion recognition with an accuracy of 92.6 %, 89.9, 84.9 %, and 99.6 % for RAVDESS, CREMA, SAVEE, and TESS datasets, respectively.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network
    Tang, Duowei
    Kuppens, Peter
    Geurts, Luc
    van Waterschoot, Toon
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [32] Speech Emotion Recognition in Neurological Disorders Using Convolutional Neural Network
    Zisad, Sharif Noor
    Hossain, Mohammad Shahadat
    Andersson, Karl
    BRAIN INFORMATICS, BI 2020, 2020, 12241 : 287 - 296
  • [33] Active Learning for Speech Emotion Recognition Using Deep Neural Network
    Abdelwahab, Mohammed
    Busso, Carlos
    2019 8TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII), 2019,
  • [34] Multimodal Approach of Speech Emotion Recognition Using Multi-Level Multi-Head Fusion Attention-Based Recurrent Neural Network
    Ngoc-Huynh Ho
    Yang, Hyung-Jeong
    Kim, Soo-Hyung
    Lee, Gueesang
    IEEE ACCESS, 2020, 8 : 61672 - 61686
  • [35] SPEECH EMOTION RECOGNITION WITH GLOBAL-AWARE FUSION ON MULTI-SCALE FEATURE REPRESENTATION
    Zhu, Wenjing
    Li, Xiang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6437 - 6441
  • [36] Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network
    Bhangale, Kishor
    Kothandaraman, Mohanaprasad
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2024, 43 (04) : 2341 - 2384
  • [37] Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network
    Kishor Bhangale
    Mohanaprasad Kothandaraman
    Circuits, Systems, and Signal Processing, 2024, 43 : 2341 - 2384
  • [38] Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network
    Li, Juan
    Zhang, Xueying
    Huang, Lixia
    Li, Fenglian
    Duan, Shufei
    Sun, Ying
    APPLIED SCIENCES-BASEL, 2022, 12 (19):
  • [39] Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition
    Sun L.
    Chen J.
    Xie K.
    Gu T.
    International Journal of Speech Technology, 2018, 21 (04) : 931 - 940
  • [40] SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORK CONSIDERING VERBAL AND NONVERBAL SPEECH SOUNDS
    Huang, Kun-Yi
    Wu, Chung-Hsien
    Hong, Qian-Bei
    Su, Ming-Hsiang
    Chen, Yi-Hsuan
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5866 - 5870