Improve Accuracy of Speech Emotion Recognition with Attention Head Fusion

被引：44

作者：

Xu, Mingke ^{[1
]}

Zhang, Fan ^{[2
]}

Khan, Samee U. ^{[3
]}

机构：

[1] Nanjing Tech Univ, Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China

[2] IBM Massachusette Lab, IBM Watson Grp, Littleton, MA USA

[3] North Dakota State Univ, Elect & Comp Eng, Fargo, ND USA

来源：

2020 10TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC) | 2020年

基金：

美国国家科学基金会;

关键词：

speech emotion recognition; convolutional neural network; attention mechanism; pattern recognition; machine Learning; CLASSIFICATION; MODEL;

D O I：

10.1109/ccwc47524.2020.9031207

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Speech Emotion Recognition (SER) refers to the use of machines to recognize the emotions of a speaker from his (or her) speech. SER has broad application prospects in the fields of criminal investigation and medical care. However, the complexity of emotion makes it hard to be recognized and the current SER model still does not accurately recognize human emotions. In this paper, we propose a multi-head self-attention based attention method to improve the recognition accuracy of SER. We call this method head fusion. With this method, an attention layer can generate some attention map with multiple attention points instead of common attention maps with a single attention point. We implemented an attention-based convolutional neural networks (ACNN) model with this method and conducted experiments and evaluations on the Interactive Emotional Dyadic Motion Capture(IEMOCAP) corpus, obtained on improvised data 76.18% of weighted accuracy (WA) and 76.36% of unweighted accuracy (UA), which is increased by about 6% compared to the previous state-of-the-art SER model.

引用

页码：1058 / 1064

页数：7

共 24 条

[1] Altrov R., 2015, EESTI JA SOOME UGRI, V6, P11
[2] Badshah AM, 2017, 2017 INTERNATIONAL CONFERENCE ON PLATFORM TECHNOLOGY AND SERVICE (PLATCON), P125
[3] IEMOCAP: interactive emotional dyadic motion capture database
Busso, Carlos
Bulut, Murtaza
Lee, Chi-Chun
Kazemzadeh, Abe
Mower, Emily
Kim, Samuel
Chang, Jeannette N.
Lee, Sungbok
Narayanan, Shrikanth S.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) : 335 - 359
[4] 3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition
Chen, Mingyi
He, Xuanji
Yang, Jing
Zhang, Han
[J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) : 1440 - 1444
[5] Chernykh V., 2017, ARXIV PREPRINT ARXIV
[6] Survey on speech emotion recognition: Features, classification schemes, and databases
El Ayadi, Moataz
Kamel, Mohamed S.
Karray, Fakhri
[J]. PATTERN RECOGNITION, 2011, 44 (03) : 572 - 587
[7] Han K, 2014, INTERSPEECH, P223
[8] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[9] India M., 2019, ARXIV PREPRINT ARXIV
[10] Speech Emotion Recognition Using Deep Learning Techniques: A Review
Khalil, Ruhul Amin
Jones, Edward
Babar, Mohammad Inayatullah
Jan, Tariqullah
Zafar, Mohammad Haseeb
Alhussain, Thamer
[J]. IEEE ACCESS, 2019, 7 : 117327 - 117345

← 1 2 3 →