Learning Alignment for Multimodal Emotion Recognition from Speech

被引:59
|
作者
Xu, Haiyang [1 ]
Zhang, Hui [1 ]
Han, Kun [2 ]
Wang, Yun [3 ]
Peng, Yiping [1 ]
Li, Xiangang [1 ]
机构
[1] DiDi Chuxing, Beijing, Peoples R China
[2] DiDi Res Amer, Mountain View, CA USA
[3] Peking Univ, Beijing, Peoples R China
来源
INTERSPEECH 2019 | 2019年
关键词
Emotion Recognition; Multimodal; Attention; Alignment; CLASSIFICATION;
D O I
10.21437/Interspeech.2019-3247
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.
引用
收藏
页码:3569 / 3573
页数:5
相关论文
共 50 条
  • [1] Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition
    Wang, Yuhua
    Shen, Guang
    Xu, Yuezhu
    Li, Jiahang
    Zhao, Zhengdao
    INTERSPEECH 2021, 2021, : 4518 - 4522
  • [2] Masked Graph Learning With Recurrent Alignment for Multimodal Emotion Recognition in Conversation
    Meng, Tao
    Zhang, Fuchen
    Shou, Yuntao
    Shao, Hongen
    Ai, Wei
    Li, Keqin
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4298 - 4312
  • [3] MMER: Multimodal Multi-task Learning for Speech Emotion Recognition
    Ghosh, Sreyan
    Tyagi, Utkarsh
    Ramaneswaran, S.
    Srivastava, Harshvardhan
    Manocha, Dinesh
    INTERSPEECH 2023, 2023, : 1209 - 1213
  • [4] Learning deep multimodal affective features for spontaneous speech emotion recognition
    Zhang, Shiqing
    Tao, Xin
    Chuang, Yuelong
    Zhao, Xiaoming
    SPEECH COMMUNICATION, 2021, 127 : 73 - 81
  • [5] Towards the explainability of Multimodal Speech Emotion Recognition
    Kumar, Puneet
    Kaushik, Vishesh
    Raman, Balasubramanian
    INTERSPEECH 2021, 2021, : 1748 - 1752
  • [6] Speech emotion recognition using multimodal feature fusion with machine learning approach
    Sandeep Kumar Panda
    Ajay Kumar Jena
    Mohit Ranjan Panda
    Susmita Panda
    Multimedia Tools and Applications, 2023, 82 : 42763 - 42781
  • [7] Speech emotion recognition using multimodal feature fusion with machine learning approach
    Panda, Sandeep Kumar
    Jena, Ajay Kumar
    Panda, Mohit Ranjan
    Panda, Susmita
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (27) : 42763 - 42781
  • [8] Multimodal emotion recognition from expressive faces, body gestures and speech
    Caridakis, George
    Castellano, Ginevra
    Kessous, Loic
    Raouzaiou, Amaryllis
    Malatesta, Lori
    Asteriadis, Stelios
    Karpouzis, Kostas
    ARTIFICIAL INTELLIGENCE AND INNOVATIONS 2007: FROM THEORY TO APPLICATIONS, 2007, : 375 - +
  • [9] Annotations from speech and heart rate: impact on multimodal emotion recognition
    Sharma, Kaushal
    Chanel, Guillaume
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 51 - 59
  • [10] A multimodal hierarchical approach to speech emotion recognition from audio and text
    Singh, Prabhav
    Srivastava, Ridam
    Rana, K. P. S.
    Kumar, Vineet
    KNOWLEDGE-BASED SYSTEMS, 2021, 229