Self-attention transfer networks for speech emotion recognition

被引:3
|
作者
Ziping ZHAO [1 ]
Keru Wang [1 ]
Zhongtian BAO [1 ]
Zixing ZHANG [2 ]
Nicholas CUMMINS [3 ,4 ]
Shihuang SUN [5 ]
Haishuai WANG [5 ]
Jianhua TAO [6 ]
Bj?rn W.SCHULLER [1 ,2 ,3 ]
机构
[1] College of Computer and Information Engineering, Tianjin Normal University
[2] GLAM-Group on Language, Audio & Music, Imperial College London
[3] Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg
[4] Department of Biostatistics and Health Informatics, Io PPN, King's College London
[5] Department of Computer Science and Engineering, Fairfield University
[6] National Laboratory of Pattern Recognition,CASIA
基金
欧盟地平线“2020”; 中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TN912.34 [语音识别与设备];
学科分类号
摘要
Background A crucial element of human-machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition(SER) is learning robust and discriminative representations from speech. Although machine learning methods have been widely applied in SER research, the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques(e. g., deep neural networks). To address this issue, we present a deep learning method that combines knowledge transfer and self-attention for SER tasks. Herein, we apply the log-Mel spectrogram with deltas and delta-deltas as inputs. Moreover, given that emotions are timedependent, we apply temporal convolutional neural networks to model the variations in emotions. We further introduce an attention transfer mechanism, which is based on a self-attention algorithm to learn long-term dependencies. The self-attention transfer network(SATN) in our proposed approach takes advantage of attention transfer to learn attention from speech recognition, followed by transferring this knowledge into SER. An evaluation built on Interactive Emotional Dyadic Motion Capture(IEMOCAP)dataset demonstrates the effectiveness of the proposed model.
引用
收藏
页码:43 / 54
页数:12
相关论文
共 50 条
  • [21] Region Adaptive Self-Attention for an Accurate Facial Emotion Recognition
    Lee, Seongmin
    Lee, Jeonghaeng
    Kim, Minsik
    Lee, Sanghoon
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 791 - 796
  • [22] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399
  • [23] EEG-Based Emotion Recognition With Emotion Localization via Hierarchical Self-Attention
    Zhang, Yuzhe
    Liu, Huan
    Zhang, Dalin
    Chen, Xuxu
    Qin, Tao
    Zheng, Qinghua
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (03) : 2458 - 2469
  • [24] IS CROSS-ATTENTION PREFERABLE TO SELF-ATTENTION FOR MULTI-MODAL EMOTION RECOGNITION?
    Rajan, Vandana
    Brutti, Alessio
    Cavallaro, Andrea
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4693 - 4697
  • [25] Emotion embedding framework with emotional self-attention mechanism for speaker recognition
    Li, Dongdong
    Yang, Zhuo
    Liu, Jinlin
    Yang, Hai
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [26] Spatiotemporal and frequential cascaded attention networks for speech emotion recognition
    Li, Shuzhen
    Xing, Xiaofen
    Fan, Weiquan
    Cai, Bolun
    Fordson, Perry
    Xu, Xiangmin
    Neurocomputing, 2021, 448 : 238 - 248
  • [27] Spatiotemporal and frequential cascaded attention networks for speech emotion recognition
    Li, Shuzhen
    Xing, Xiaofen
    Fan, Weiquan
    Cai, Bolun
    Fordson, Perry
    Xu, Xiangmin
    NEUROCOMPUTING, 2021, 448 : 238 - 248
  • [28] A Fast Convolutional Self-attention Based Speech Dereverberation Method for Robust Speech Recognition
    Li, Nan
    Ge, Meng
    Wang, Longbiao
    Dang, Jianwu
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT III, 2019, 11955 : 295 - 305
  • [29] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    SPEECH COMMUNICATION, 2022, 139 : 1 - 9
  • [30] Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework
    Liu, Yang
    Sun, Haoqin
    Guan, Wenbo
    Xia, Yuqi
    Zhao, Zhen
    Speech Communication, 2022, 139 : 1 - 9