Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network

被引:7
作者
Bhangale, Kishor [1 ]
Kothandaraman, Mohanaprasad [1 ]
机构
[1] VIT, SENSE, Chennai, India
关键词
Data augmentation; Deep learning; Deep convolutional neural network; Generative adversarial network; Multi-taper Mel frequency spectrogram; Speech processing; Speech emotion recognition; FEATURES; CLASSIFIERS;
D O I
10.1007/s00034-023-02562-5
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Speech emotion recognition (SER) has recently increased because of vast innovations in human-computer interaction and affective computing. In recent years, numerous deep learning-based schemes presented for SER have shown significant improvement over the traditional machine learning approaches. Most deep learning-based faced SER systems face challenges due to data imbalance problem that occurs due to unequal samples in the database. The input to two-dimensional CNN uses traditional MFCC for SER. It degrades the quality of deep attributes because of the higher variance, frequency resolution problem and spectral leakage problem of traditional MFCC. This paper proposed the novel Multi-taper Mel Frequency Logarithmic Spectrogram to enrich the Deep Convolutional Neural Network effectiveness for SER. Further, Generative Adversarial Network is used for speech emotion data augmentation during training to deal with data scarcity problems in SER. The performance of the proposed SER scheme is validated using the Berlin EmoDB and RAVDESS datasets. The proposed method provides SER accuracy of 96.65% and 97.12% for the EmoDB and RAVDESS dataset, respectively, and significantly improves over the recent techniques.
引用
收藏
页码:2341 / 2384
页数:44
相关论文
共 66 条
[1]   Multitaper MFCC and PLP features for speaker verification using i-vectors [J].
Alam, Md Jahangir ;
Kinnunen, Tomi ;
Kenny, Patrick ;
Ouellet, Pierre ;
O'Shaughnessy, Douglas .
SPEECH COMMUNICATION, 2013, 55 (02) :237-251
[2]   Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011 [J].
Anagnostopoulos, Christos-Nikolaos ;
Iliou, Theodoros ;
Giannoukos, Ioannis .
ARTIFICIAL INTELLIGENCE REVIEW, 2015, 43 (02) :155-177
[3]  
[Anonymous], 2008, P INT C AUD VIS SPEE
[4]  
Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125
[5]   Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features [J].
Ben Alex, Starlet ;
Mary, Leena ;
Babu, Ben P. .
CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2020, 39 (11) :5681-5709
[6]  
Bhangale K., 2022, P INT C FUT COMM NET, P241, DOI DOI 10.1007/978-981-16-4625-6_24
[7]  
Bhangale KB, 2017, P 2017 INT C INFORM, P1
[8]   Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network) [J].
Bhangale, Kishor B. ;
Kothandaraman, Mohanaprasad .
APPLIED ACOUSTICS, 2023, 212
[9]   A review on speech processing using machine learning paradigm [J].
Bhangale, Kishor Barasu ;
Mohanaprasad, K. .
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (02) :367-388
[10]  
Burkhardt F., 2005, Interspeech, P1517