WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition

被引:16
作者
Shen, Guang [1 ]
Lai, Riwei [1 ]
Chen, Rui [1 ]
Zhang, Yu [2 ]
Zhang, Kejia [1 ]
Han, Qilong [1 ]
Song, Hongtao [1 ]
机构
[1] Harbin Engn Univ, Coll Comp Sci & Technol, Harbin, Peoples R China
[2] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
Speech emotion recognition; dynamic interaction mechanism; hierarchical representation; deep multimodal fusion; NEURAL-NETWORKS;
D O I
10.21437/Interspeech.2020-3131
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.
引用
收藏
页码:369 / 373
页数:5
相关论文
共 30 条
[1]   IEMOCAP: interactive emotional dyadic motion capture database [J].
Busso, Carlos ;
Bulut, Murtaza ;
Lee, Chi-Chun ;
Kazemzadeh, Abe ;
Mower, Emily ;
Kim, Samuel ;
Chang, Jeannette N. ;
Lee, Sungbok ;
Narayanan, Shrikanth S. .
LANGUAGE RESOURCES AND EVALUATION, 2008, 42 (04) :335-359
[2]   3-D Convolutional Recurrent Neural Networks With Attention Model for Speech Emotion Recognition [J].
Chen, Mingyi ;
He, Xuanji ;
Yang, Jing ;
Zhang, Han .
IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (10) :1440-1444
[3]   Deep neural networks for emotion recognition combining audio and transcripts [J].
Cho, Jaejin ;
Pappagari, Raghavendra ;
Kulkarni, Purva ;
Villalba, Jesus ;
Carmiel, Yishay ;
Dehak, Najim .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :247-251
[4]  
Cho Kyunghyun, 2014, P 2014 C EMP METH NA, P1724
[5]  
Chung JY, 2014, Arxiv, DOI arXiv:1412.3555
[6]  
Eyben F., 2010, P ACM INT C MULT, P1459
[7]  
Ghosal D, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P3454
[8]   Mutual Correlation Attentive Factors in Dyadic Fusion Networks for Speech Emotion Recognition [J].
Gu, Yue ;
Lyu, Xinyu ;
Sun, Weijia ;
Li, Weitian ;
Chen, Shuhong ;
Li, Xinyu ;
Ivan, Marsic .
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :157-165
[9]   Towards Temporal Modelling of Categorical Speech Emotion Recognition [J].
Han, Wenjing ;
Ruan, Huabin ;
Chen, Xiaomin ;
Wang, Zhixiang ;
Li, Haifeng ;
Schuller, Bjoern .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :932-936
[10]  
Hochreiter S., 1997, NEURAL COMPUT, V9, P1735