Emotional Speech Generation: An Approach Using Convolutional Neural Networks (CNN) Based Generative Adversarial Network

被引：0

作者：

Anver, S. R. ^{[1
]}

Deepambika, V. A. ^{[2
]}

Rahiman, M. Abdul ^{[3
]}

Santhosh, R. ^{[4
]}

机构：

[1] LBS Coll Engn, Dept Comp Sci & Engn, Kasaragod, Kerala, India

[2] LBS Inst Technol Women, Dept Elect & Commun Engn, Trivandrum, Kerala, India

[3] LBS Ctr Sci & Technol, Trivandrum, Kerala, India

[4] Karpagam Acad Higher Educ, Dept Comp Sci & Engn, Coimbatore, Tamil Nadu, India

来源：

CIRCUITS SYSTEMS AND SIGNAL PROCESSING | 2025年

关键词：

Speech emotion generation; Generative Adversarial Networks; Convolutional neural network; Mel spectrograms; Min-max normalization;

D O I：

10.1007/s00034-025-03224-4

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The objective of emotional speech generation is to create synthetic speech that convincingly conveys specific emotions, enhancing the emotional quality of human-computer interactions. However, existing techniques often fall short of capturing the subtle emotional nuances, leading to speech that feels inauthentic. Additionally, many models lack the robustness needed to perform well across various emotional contexts, which limits their adaptability. Some methods may also generate overly exaggerated or artificial emotional responses, diminishing their effectiveness in real-world scenarios. This research explores using Generative Adversarial Networks (GAN) combined with Convolutional Neural Networks (CNN) for emotional speech generation. The process begins with audio preprocessing using Mel spectrograms for noise reduction and min-max normalization. A CNN-based GAN is then applied for feature extraction. The combination of CNN and GAN is used to classify emotions such as fear, anger, sadness, and happiness from the extracted features. The performance of the proposed method was evaluated using two datasets: RAVDESS and IEMOCAP. Results show that this approach can effectively detect speech emotions, achieving average accuracies of 99% on both datasets.

引用

页数：23

共 30 条

[1] Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion [J].

Al-onazi, Badriyya B. ;

Nauman, Muhammad Asif ;

Jahangir, Rashid ;

Malik, Muhmmad Mohsin ;

Alkhammash, Eman H. ;

Elshewey, Ahmed M. .

APPLIED SCIENCES-BASEL, 2022, 12 (18)

[2] Recognizing Semi-Natural and Spontaneous Speech Emotions Using Deep Neural Networks [J].

Amjad, Ammar ;

Khan, Lal ;

Ashraf, Noman ;

Mahmood, Muhammad Bilal ;

Chang, Hsien-Tsung .

IEEE ACCESS, 2022, 10 :37149-37163

[3] When Old Meets New: Emotion Recognition from Speech Signals [J].

Arano, Keith April ;

Gloor, Peter ;

Orsenigo, Carlotta ;

Vercellis, Carlo .

COGNITIVE COMPUTATION, 2021, 13 (03) :771-783

[4]

Bakhshi A., 2021, Speech Emotion Recognition Using Deep Neural Networks

[5]

Benita R, 2024, Arxiv, DOI arXiv:2310.01381

[6] Speech Emotion Recognition Using Generative Adversarial Network and Deep Convolutional Neural Network [J].

Bhangale, Kishor ;

Kothandaraman, Mohanaprasad .

CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2024, 43 (04) :2341-2384

[7] Multimodal Emotion Recognition Using Feature Fusion: An LLM-Based Approach [J].

Chandraumakantham, Omkumar ;

Gowtham, N. ;

Zakariah, Mohammed ;

Almazyad, Abdulaziz .

IEEE ACCESS, 2024, 12 :108052-108071

[8]

Dikbiyik E, 2025, IEEE ACCESS, V13, P64330, DOI 10.1109/ACCESS.2025.3559339

[9]

Du C., 2025, P IEEE INT C AC SPEE, P1

[10] Speech Driven Talking Face Generation From a Single Image and an Emotion Condition [J].

Eskimez, Sefik Emre ;

Zhang, You ;

Duan, Zhiyao .

IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 :3480-3490

← 1 2 3 →