DSTCNet: Deep Spectro-Temporal-Channel Attention Network for Speech Emotion Recognition

被引：8

作者：

Guo, Lili ^{[1
,2
]}

Ding, Shifei ^{[1
,2
]}

Wang, Longbiao ^{[3
,4
]}

Dang, Jianwu ^{[3
,5
]}

机构：

[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China

[2] Mine Digitizat Engn Res Ctr, Minist Educ, Xuzhou 221116, Jiangsu, Peoples R China

[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300350, Peoples R China

[4] Huiyan Technol Tianjin Co Ltd, Tianjin 300350, Peoples R China

[5] Pengcheng Lab, Shenzhen 518055, Peoples R China

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2025年 / 36卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Channel attention; representation learning; spectro-temporal attention; speech emotion recognition (SER);

D O I：

10.1109/TNNLS.2023.3304516

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech emotion recognition (SER) plays an important role in human-computer interaction, which can provide better interactivity to enhance user experiences. Existing approaches tend to directly apply deep learning networks to distinguish emotions. Among them, the convolutional neural network (CNN) is the most commonly used method to learn emotional representations from spectrograms. However, CNN does not explicitly model features' associations in the spectral-, temporal-, and channel-wise axes or their relative relevance, which will limit the representation learning. In this article, we propose a deep spectro-temporal-channel network (DSTCNet) to improve the representational ability for speech emotion. The proposed DSTCNet integrates several spectro-temporal-channel (STC) attention modules into a general CNN. Specifically, we propose the STC module that infers a 3-D attention map along the dimensions of time, frequency, and channel. The STC attention can focus more on the regions of crucial time frames, frequency ranges, and feature channels. Finally, experiments were conducted on the Berlin emotional database (EmoDB) and interactive emotional dyadic motion capture (IEMOCAP) databases. The results reveal that our DSTCNet can outperform the traditional CNN-based and several state-of-the-art methods.

引用

页码：188 / 197

页数：10

共 50 条

[1] REPRESENTATION LEARNING WITH SPECTRO-TEMPORAL-CHANNEL ATTENTION FOR SPEECH EMOTION RECOGNITION
Guo, Lili
Wang, Longbiao
Xu, Chenglin
Dang, Jianwu
Chng, Eng Siong
Li, Haizhou
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6304 - 6308
[2] Spectro-Temporal Modulations for Robust Speech Emotion Recognition
Yeh, Lan-Ying
Chi, Tai-Shih
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 789 - 792
[3] Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation
Liu, Jiaxing
Liu, Zhilei
Wang, Longbiao
Gao, Yuan
Guo, Lili
Dang, Jianwu
INTERSPEECH 2020, 2020, : 2337 - 2341
[4] DeepCNN: Spectro-temporal feature representation for speech emotion recognition
Saleem, Nasir
Gao, Jiechao
Irfan, Rizwana
Almadhor, Ahmad
Rauf, Hafiz Tayyab
Zhang, Yudong
Kadry, Seifedine
CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (02) : 401 - 417
[5] Deep scattering network for speech emotion recognition
Singh, Premjeet
Saha, Goutam
Sahidullah, Md
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 131 - 135
[6] Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
Geng, Mengzhe
Liu, Shansong
Yu, Jianwei
Xie, Xurong
Hu, Shoukang
Ye, Zi
Jin, Zengrui
Liu, Xunying
Meng, Helen
INTERSPEECH 2021, 2021, : 4793 - 4797
[7] DEEP CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ATTENTION MECHANISM FOR ROBUST SPEECH EMOTION RECOGNITION
Huang, Che-Wei
Narayanan, Shrikanth
2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 583 - 588
[8] Deep temporal clustering features for speech emotion recognition
Lin, Wei-Cheng
Busso, Carlos
SPEECH COMMUNICATION, 2024, 157
[9] Speech Emotion Recognition Based on Deep Belief Network
Shi, Peng
2018 IEEE 15TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2018,
[10] Speech Emotion Recognition Based on Deep Neural Network
Zhu, Zijiang
Hu, Yi
Li, Junshan
Li, Jianjun
Wang, Junhua
BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 126 : 154 - 154

← 1 2 3 4 5 →