Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features

被引:11
作者
Santoso, Jennifer [1 ]
Yamada, Takeshi [1 ]
Ishizuka, Kenkichi [2 ]
Hashimoto, Taiichi [2 ]
Makino, Shoji [1 ,3 ]
机构
[1] Univ Tsukuba, Degree Programs Syst & Informat Engn, Tsukuba, Ibaraki 3058573, Japan
[2] RevComm Inc, Tokyo 1500002, Japan
[3] Waseda Univ, Grad Sch Informat Prod & Syst, Fukuoka 8080135, Japan
关键词
Feature extraction; Speech recognition; Acoustics; Emotion recognition; Data mining; Text recognition; Speech emotion recognition; confidence measure; automatic speech recognition; self-attention mechanism;
D O I
10.1109/ACCESS.2022.3219094
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech emotion recognition (SER) is essential for understanding a speaker's intention. Recently, some groups have attempted to improve SER performance using a bidirectional long short-term memory (BLSTM) to extract features from speech sequences and a self-attention mechanism to focus on the important parts of the speech sequences. SER also benefits from combining the information in speech with text, which can be accomplished automatically using an automatic speech recognizer (ASR), further improving its performance. However, ASR performance deteriorates in the presence of emotion in speech. Although there is a method to improve ASR performance in the presence of emotional speech, it requires the fine-tuning of ASR, which has a high computational cost and leads to the loss of cues important for determining the presence of emotion in speech segments, which can be helpful in SER. To solve these problems, we propose a BLSTM-and-self-attention-based SER method using self-attention weight correction (SAWC) with confidence measures. This method is applied to acoustic and text feature extractors in SER to adjust the importance weights of speech segments and words with a high possibility of ASR error. Our proposed SAWC reduces the importance of words with speech recognition error in the text feature while emphasizing the importance of speech segments containing these words in acoustic features. Our experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset reveal that our proposed method achieves a weighted average accuracy of 76.6%, outperforming other state-of-the-art methods. Furthermore, we investigated the behavior of our proposed SAWC in each of the feature extractors.
引用
收藏
页码:115732 / 115743
页数:12
相关论文
共 49 条
[1]  
Amiriparian S., 2021, arXiv
[2]   Customer Satisfaction Estimation in Contact Center Calls Based on a Hierarchical Multi-Task Model [J].
Ando, Atsushi ;
Masumura, Ryo ;
Kamiyama, Hosana ;
Kobashikawa, Satoshi ;
Aono, Yushi ;
Toda, Tomoki .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :715-728
[3]  
[Anonymous], 1999, P ART NEUR NETW ENG
[4]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[5]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[6]   CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition [J].
Chen, Chengxin ;
Zhang, Pengyuan .
INTERSPEECH 2022, 2022, :4730-4734
[7]   A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition [J].
Chen, Ming ;
Zhao, Xudong .
INTERSPEECH 2020, 2020, :374-378
[8]  
Devillers L, 2003, 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL III, PROCEEDINGS, P549
[9]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[10]  
Fayek HM, 2015, 2015 9TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS)