Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion

被引:44
作者
Xu, Mingke [1 ]
Zhang, Fan [2 ]
Khan, Samee U. [3 ]
机构
[1] Nanjing Tech Univ, Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
[2] IBM Massachusette Lab, IBM Watson Grp, Littleton, MA USA
[3] North Dakota State Univ, Elect & Comp Eng, Fargo, ND USA
来源
2020 10TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC) | 2020年
基金
美国国家科学基金会;
关键词
speech emotion recognition; convolutional neural network; attention mechanism; pattern recognition; machine Learning; CLASSIFICATION; MODEL;
D O I
10.1109/ccwc47524.2020.9031207
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.
引用
收藏
页码:1058 / 1064
页数:7
相关论文
共 24 条
  • [1] Altrov R., 2015, EESTI JA SOOME UGRI, V6, P11
  • [2] Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125
  • [3] IEMOCAP: interactive emotional dyadic motion capture database
    Busso, Carlos
    Bulut, Murtaza
    Lee, Chi-Chun
    Kazemzadeh, Abe
    Mower, Emily
    Kim, Samuel
    Chang, Jeannette N.
    Lee, Sungbok
    Narayanan, Shrikanth S.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
  • [4] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
    Chen, Mingyi
    He, Xuanji
    Yang, Jing
    Zhang, Han
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444
  • [5] Chernykh V., 2017, ARXIV PREPRINT ARXIV
  • [6] Survey on speech emotion recognition: Features, classification schemes, and databases
    El Ayadi, Moataz
    Kamel, Mohamed S.
    Karray, Fakhri
    [J]. PATTERN RECOGNITION, 2011, 44 (03) : 572 - 587
  • [7] Han K, 2014, INTERSPEECH, P223
  • [8] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [9] India M., 2019, ARXIV PREPRINT ARXIV
  • [10] Speech Emotion Recognition Using Deep Learning Techniques: A Review
    Khalil, Ruhul Amin
    Jones, Edward
    Babar, Mohammad Inayatullah
    Jan, Tariqullah
    Zafar, Mohammad Haseeb
    Alhussain, Thamer
    [J]. IEEE ACCESS, 2019, 7 : 117327 - 117345