Multimodal Emotion Recognition Harnessing the Complementarity of Speech, Language, and Vision

被引:0
作者
Thebaud, Thomas [1 ,2 ]
Favaro, Anna [1 ]
Guan, Yaohan [1 ]
Yang, Yuchen [1 ]
Singh, Prabhav [1 ]
Villalba, Jesus [1 ,2 ]
Mono-Velazquez, Laureano [1 ,2 ]
Dehak, Najim [1 ,2 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD USA
来源
PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024 | 2024年
关键词
Foundational models; Fusion; Multimodal Emotion Recognition;
D O I
10.1145/3678957.3689332
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the realm of audiovisual emotion recognition, a significant challenge lies in developing neural network architectures capable of effectively harnessing and integrating multimodal information. This study introduces an advanced methodology for the Empathic Virtual Agent Challenge (EVAC), utilizing state-of-the-art speech, language, and image models. Specifically, we leverage cutting-edge pre-trained models, including multilingual variants fine-tuned in French for each modality, and integrate them using late fusion techniques. Through extensive experimentation and validation, we demonstrate the efficacy of our approach in achieving competitive results on the challenge dataset. Our findings highlight that multimodal approaches outperform unimodal methods across Core Affect Presence and Intensity and Appraisal Dimensions tasks, underscoring the effectiveness of integrating diverse modalities. This underscores the importance of leveraging multiple sources of information to capture nuanced emotional states more accurately and robustly in real-world applications.
引用
收藏
页码:684 / 689
页数:6
相关论文
共 42 条
[1]   Transformer models for text-based emotion detection: a review of BERT-based approaches [J].
Acheampong, Francisca Adoma ;
Nunoo-Mensah, Henry ;
Chen, Wenyu .
ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (08) :5789-5829
[2]  
2023, Arxiv, DOI [arXiv:2303.08774, DOI 10.48550/ARXIV.2303.08774, 10.48550/arXiv.2303.08774]
[3]  
Ardila R, 2020, Arxiv, DOI [arXiv:1912.06670, DOI 10.48550/ARXIV.1912.06670]
[4]  
Babu A, 2021, Arxiv, DOI arXiv:2111.09296
[5]  
Baevski A, 2020, ADV NEUR IN, V33
[6]  
Baevski A, 2020, Arxiv, DOI arXiv:2006.11477
[7]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[8]  
Cao Q, 2018, Arxiv, DOI arXiv:1710.08092
[9]  
Carreira J, 2018, Arxiv, DOI [arXiv:1705.07750, 10.48550/arXiv.1705.07750]
[10]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518