Mel-Spectrograms Based LSTM Model for Speech Emotion Recognition

被引：0

作者：

Bhuyan, Hemanta Kumar ^{[1
]}

Brahma, Biswajit ^{[2
]}

Kamila, Nilayam Kumar ^{[3
]}

Peram, Subbarao ^{[1
]}

Leelambika, Bannaravuri ^{[1
]}

Sahu, Amaresh ^{[4
]}

机构：

[1] Vignans Fdn Sci Technol & Res, Dept Informat Technol, Guntur 522213, India

[2] McKesson Corp, Dept Life Sci, San Francisco, CA 94555 USA

[3] Capital One Serv, Dept Retail Bank Technol, Wilmington, DE 19801 USA

[4] Ajay Binay Inst Technol, Dept MCA, Cuttack 753014, India

来源：

TRAITEMENT DU SIGNAL | 2025年 / 42卷 / 03期

关键词：

emotion recognition; deep learning; multimodal features; MFCC; DenseNet; audio processing; REPRESENTATIONS; FEATURES;

D O I：

10.18280/ts.420312

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Emotion recognition from audio data holds immense potential in revolutionizing human-computer interaction (HMI), affective computing, and psychological health monitoring. This paper delves into a novel deep learning approach that leverages the strengths of multimodal features mined from audio signals. We propose a model that transcends the disadvantages of existing methods by combining Mel-Frequency Cepstral Coefficients (MFCCs) with high-level representations extracted from a pre-trained DenseNet architecture. MFCCs provide a compressed representation of the audio signal's spectral characteristics, capturing crucial emotional cues like pitch and intensity. These learned patterns can translate to the domain of audio emotion recognition, enabling the model to identify subtle emotional nuances that might be difficult to capture with traditional feature engineering techniques. Our deep learning model, comprised of dense layers, fosters robust performance in accurately classifying emotions across diverse categories. We used a Mel-spectrograms-based LSTM model for speech emotion recognition that effectively identifies various emotions. We rigorously evaluate the proposed approach on the TESS dataset. The experimental results are truly compelling, showcasing a staggering accuracy of 100%. This exceptional performance signifies the effectiveness of the multimodal approach in extracting and interpreting emotional cues from audio data.

引用

页码：1353 / 1365

页数：13

共 26 条

[1] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition [J].

Atmaja, Bagus Tris ;

Sasou, Akira .

IEEE ACCESS, 2022, 10 :124396-124407

[2] Diagnosis system for cancer disease using a single setting approach [J].

Bhuyan, Hemanta Kumar ;

Vijayaraj, A. ;

Ravi, Vinayakumar .

MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (30) :46241-46267

[3] An Integrated Framework with Deep Learning for Segmentation and Classification of Cancer Disease [J].

Bhuyan, Hemanta Kumar ;

Ravi, Vinayakumar .

INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2023, 32 (02)

[4] Development of secrete images in image transferring system [J].

Bhuyan, Hemanta Kumar ;

Vijayaraj, A. ;

Ravi, Vinayakumar .

MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (05) :7529-7552

[5] Disease analysis using machine learning approaches in healthcare system [J].

Bhuyan, Hemanta Kumar ;

Ravi, Vinayakumar ;

Bramha, Biswajit ;

Kamila, Nilayam Kumar .

HEALTH AND TECHNOLOGY, 2022, 12 (05) :987-1005

[6] K-Means Clustering-Based Kernel Canonical Correlation Analysis for Multimodal Emotion Recognition in Human-Robot Interaction [J].

Chen, Luefeng ;

Wang, Kuanlin ;

Li, Min ;

Wu, Min ;

Pedrycz, Witold ;

Hirota, Kaoru .

IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2023, 70 (01) :1016-1024

[7]

Gavali M.P., 2023, 2023 IEEE INT C ART, P1, DOI [10.1109/AIBThings58340.2023.1029246, DOI 10.1109/AIBTHINGS58340.2023.1029246]

[8]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[9] Convolutional-Recurrent Neural Networks With Multiple Attention Mechanisms for Speech Emotion Recognition [J].

Jiang, Pengxu ;

Xu, Xinzhou ;

Tao, Huawei ;

Zhao, Li ;

Zou, Cairong .

IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2022, 14 (04) :1564-1573

[10] RobinNet: A Multimodal Speech Emotion Recognition System With Speaker Recognition for Social Interactions [J].

Khurana, Yash ;

Gupta, Swamita ;

Sathyaraj, R. ;

Raja, S. P. .

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 11 (01) :478-487

← 1 2 3 →