DSTCNet: Deep Spectro-Temporal-Channel Attention Network for Speech Emotion Recognition

被引:8
|
作者
Guo, Lili [1 ,2 ]
Ding, Shifei [1 ,2 ]
Wang, Longbiao [3 ,4 ]
Dang, Jianwu [3 ,5 ]
机构
[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China
[2] Mine Digitizat Engn Res Ctr, Minist Educ, Xuzhou 221116, Jiangsu, Peoples R China
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300350, Peoples R China
[4] Huiyan Technol Tianjin Co Ltd, Tianjin 300350, Peoples R China
[5] Pengcheng Lab, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Channel attention; representation learning; spectro-temporal attention; speech emotion recognition (SER);
D O I
10.1109/TNNLS.2023.3304516
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech emotion recognition (SER) plays an important role in human-computer interaction, which can provide better interactivity to enhance user experiences. Existing approaches tend to directly apply deep learning networks to distinguish emotions. Among them, the convolutional neural network (CNN) is the most commonly used method to learn emotional representations from spectrograms. However, CNN does not explicitly model features' associations in the spectral-, temporal-, and channel-wise axes or their relative relevance, which will limit the representation learning. In this article, we propose a deep spectro-temporal-channel network (DSTCNet) to improve the representational ability for speech emotion. The proposed DSTCNet integrates several spectro-temporal-channel (STC) attention modules into a general CNN. Specifically, we propose the STC module that infers a 3-D attention map along the dimensions of time, frequency, and channel. The STC attention can focus more on the regions of crucial time frames, frequency ranges, and feature channels. Finally, experiments were conducted on the Berlin emotional database (EmoDB) and interactive emotional dyadic motion capture (IEMOCAP) databases. The results reveal that our DSTCNet can outperform the traditional CNN-based and several state-of-the-art methods.
引用
收藏
页码:188 / 197
页数:10
相关论文
共 50 条
  • [1] REPRESENTATION LEARNING WITH SPECTRO-TEMPORAL-CHANNEL ATTENTION FOR SPEECH EMOTION RECOGNITION
    Guo, Lili
    Wang, Longbiao
    Xu, Chenglin
    Dang, Jianwu
    Chng, Eng Siong
    Li, Haizhou
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6304 - 6308
  • [2] Spectro-Temporal Modulations for Robust Speech Emotion Recognition
    Yeh, Lan-Ying
    Chi, Tai-Shih
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 789 - 792
  • [3] Temporal Attention Convolutional Network for Speech Emotion Recognition with Latent Representation
    Liu, Jiaxing
    Liu, Zhilei
    Wang, Longbiao
    Gao, Yuan
    Guo, Lili
    Dang, Jianwu
    INTERSPEECH 2020, 2020, : 2337 - 2341
  • [4] DeepCNN: Spectro-temporal feature representation for speech emotion recognition
    Saleem, Nasir
    Gao, Jiechao
    Irfan, Rizwana
    Almadhor, Ahmad
    Rauf, Hafiz Tayyab
    Zhang, Yudong
    Kadry, Seifedine
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (02) : 401 - 417
  • [5] Deep scattering network for speech emotion recognition
    Singh, Premjeet
    Saha, Goutam
    Sahidullah, Md
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 131 - 135
  • [6] Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
    Geng, Mengzhe
    Liu, Shansong
    Yu, Jianwei
    Xie, Xurong
    Hu, Shoukang
    Ye, Zi
    Jin, Zengrui
    Liu, Xunying
    Meng, Helen
    INTERSPEECH 2021, 2021, : 4793 - 4797
  • [7] DEEP CONVOLUTIONAL RECURRENT NEURAL NETWORK WITH ATTENTION MECHANISM FOR ROBUST SPEECH EMOTION RECOGNITION
    Huang, Che-Wei
    Narayanan, Shrikanth
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 583 - 588
  • [8] Deep temporal clustering features for speech emotion recognition
    Lin, Wei-Cheng
    Busso, Carlos
    SPEECH COMMUNICATION, 2024, 157
  • [9] Speech Emotion Recognition Based on Deep Belief Network
    Shi, Peng
    2018 IEEE 15TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC), 2018,
  • [10] Speech Emotion Recognition Based on Deep Neural Network
    Zhu, Zijiang
    Hu, Yi
    Li, Junshan
    Li, Jianjun
    Wang, Junhua
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2020, 126 : 154 - 154