Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

被引：7

作者：

Bhangale, Kishor ^{[1
]}

Kothandaraman, Mohanaprasad ^{[1
]}

机构：

[1] VIT, SENSE, Chennai, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2024年 / 43卷 / 04期

关键词：

Data augmentation; Deep learning; Deep convolutional neural network; Generative adversarial network; Multi-taper Mel frequency spectrogram; Speech processing; Speech emotion recognition; FEATURES; CLASSIFIERS;

D O I：

10.1007/s00034-023-02562-5

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Speech emotion recognition (SER) has recently increased because of vast innovations in human-computer interaction and affective computing. In recent years, numerous deep learning-based schemes presented for SER have shown significant improvement over the traditional machine learning approaches. Most deep learning-based faced SER systems face challenges due to data imbalance problem that occurs due to unequal samples in the database. The input to two-dimensional CNN uses traditional MFCC for SER. It degrades the quality of deep attributes because of the higher variance, frequency resolution problem and spectral leakage problem of traditional MFCC. This paper proposed the novel Multi-taper Mel Frequency Logarithmic Spectrogram to enrich the Deep Convolutional Neural Network effectiveness for SER. Further, Generative Adversarial Network is used for speech emotion data augmentation during training to deal with data scarcity problems in SER. The performance of the proposed SER scheme is validated using the Berlin EmoDB and RAVDESS datasets. The proposed method provides SER accuracy of 96.65% and 97.12% for the EmoDB and RAVDESS dataset, respectively, and significantly improves over the recent techniques.

引用

页码：2341 / 2384

页数：44

共 66 条

[1] Multitaper MFCC and PLP features for speaker verification using i-vectors [J].