CochleaSpecNet: An Attention-Based Dual Branch Hybrid CNN-GRU Network for Speech Emotion Recognition Using Cochleagram and Spectrogram

被引：2

作者：

Namey, Atkia Anika ^{[1
]}

Akter, Khadija ^{[1
]}

Hossain, Md. Azad ^{[1
]}

Dewan, M. Ali Akber ^{[2
]}

机构：

[1] Chittagong Univ Engn & Technol, Dept Elect & Telecommun Engn, Chattogram 4349, Bangladesh

[2] Athabasca Univ, Fac Sci & Technol, Sch Comp & Informat Syst, Athabasca, AB T9S 3A3, Canada

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

Feature extraction; Convolutional neural networks; Accuracy; Speech recognition; Emotion recognition; Data models; Mel frequency cepstral coefficient; Data mining; Time-frequency analysis; Speech enhancement; Speech emotion; cochleagram; spectrogram; hybrid network; multi-head attention; DEEP;

D O I：

10.1109/ACCESS.2024.3517733

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Being one of the main communication medium, speech contains necessary information about the emotional state of a human. Accurate emotion recognition is crucial for enhancing human-machine interactions, highlighting the importance of a strong Speech Emotion Recognition (SER) system. SER system classifies the human emotional state based on speaker's utterances in different catagories such as sad, happy, neutral, angry, surprise, calm and so on. This research introduces a novel SER approach that utilizes cochleagram and spectrogram features to capture relevant speech patterns for the classifier network. The network integrates a hybrid model that combines Convolutional Neural Networks (CNN) for feature extraction with Gated Recurrent Units (GRU) to handle temporal dependencies. Furthermore, to improve the performance of this network, a multi-head attention mechanism has been incorporated following the GRU layer. Despite increasing interest in SER, there is a notable lack of studies using Bangla language datasets, revealing a significant gap in current research. To address this gap, evaluation of the model has been conducted on the augmented BanglaSER (Bangla Speech Emotion Recognition) dataset in which the model has achieved a notable accuracy of 92.04% in categorizing five distinct emotions: angry, surprise, happy, neutral, and sad. Additionally, to further evaluate the performance of the SER model, English language based RAVDESS (Ryerson Audio-Visual Database of Emotional Speech) dataset has also been employed into the proposed model. This attempt has provided 82.40% accuracy in classifying eight diverse emotions that includes fear, disgust, calm along with the emotions of BanglaSER. Moreover, a comparative analysis of the proposed model with existing SER approaches is carried out to demonstrate it's stability and robustness. The incorporation of two individual features as inputs into the attention guided hybrid neural network showcases the efficacy of the proposed SER system, offering a promising approach for precise and efficient emotion categorization from speech signals.

引用

页码：190760 / 190774

页数：15

共 35 条

[1] Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models [J].

Abbaschian, Babak Joze ;

Sierra-Sosa, Daniel ;

Elmaghraby, Adel .

SENSORS, 2021, 21 (04) :1-27

[2]

Ahmed S., 2021, P 5 INT C EL ENG INF, P1, DOI [10.1109/ICEEICT53905.2021.9667916, DOI 10.1109/ICEEICT53905.2021.9667916]

[3]

Arif SA, 2018, INT CONF COMPUT INTE, P1, DOI [10.1109/CICN.2018.01, 10.1109/CICN.2018.8864943]

[4] Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition [J].

Atila, Orhan ;

Sengur, Abdulkadir .

APPLIED ACOUSTICS, 2021, 182

[5]

Ayadi S., 2022, P 6 INT C ADV TECHN, P1

[6]

Chang J, 2017, INT CONF ACOUST SPEE, P2746, DOI 10.1109/ICASSP.2017.7952656

[7] Development of music emotion classification system using convolution neural network [J].

Chaudhary, Deepti ;

Singh, Niraj Pratap ;

Singh, Sachin .

INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (03) :571-580

[8]

Chebbi S, 2018, 2018 4TH INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR SIGNAL AND IMAGE PROCESSING (ATSIP)

[9] BanglaSER: A speech emotion recognition dataset for the Bangla language [J].

Das, Rakesh Kumar ;

Islam, Nahidul ;

Ahmed, Md. Rayhan ;

Islam, Salekul ;

Shatabda, Swakkhar ;

Islam, A. K. M. Muzahidul .

DATA IN BRIEF, 2022, 42

[10] A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features [J].

Er, Mehmet Bilal .

IEEE ACCESS, 2020, 8 :221640-221653

← 1 2 3 4 →