Disentangled Representation Learning for Multimodal Emotion Recognition

被引:125
作者
Yang, Dingkang [1 ]
Huang, Shuai [1 ]
Kuang, Haopeng [1 ]
Du, Yangtao [1 ,2 ]
Zhang, Lihua [1 ,3 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China
[2] Artif Intelligence & Unmanned Syst Engn Res Ctr J, Minist Educ, Engn Res Ctr Ai & Robot, Changchun, Peoples R China
[3] Ji Hua Lab, Jilin Prov Key Lab Intelligence Sci & Engn, Foshan, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
disentangled representation learning; emotion recognition; adversarial learning; multimodal fusion;
D O I
10.1145/3503161.3547754
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.
引用
收藏
页码:1642 / 1651
页数:10
相关论文
共 56 条
[1]  
Akbari Hassan, 2021, ARXIV210411178
[2]  
Aman S, 2007, LECT NOTES ARTIF INT, V4629, P196
[3]  
[Anonymous], 2012, EUR C COMP VIS, DOI DOI 10.1109/MINES.2012.65
[4]  
Baltrusaitis T., 2016, Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, P1, DOI DOI 10.1109/WACV.2016.7477553
[5]  
Bousmalis K, 2016, ADV NEUR IN, V29
[6]  
Brave S., 2007, The human-computer interaction handbook, P103, DOI DOI 10.1201/9781410615862
[7]  
Chen MH, 2017, PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2017, P163, DOI 10.1145/3136755.3136801
[8]   Towards Practical Certifiable Patch Defense with Vision Transformer [J].
Chen, Zhaoyu ;
Li, Bo ;
Xu, Jianghe ;
Wu, Shuang ;
Ding, Shouhong ;
Zhang, Wenqiang .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15127-15137
[9]  
Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739
[10]   ArcFace: Additive Angular Margin Loss for Deep Face Recognition [J].
Deng, Jiankang ;
Guo, Jia ;
Xue, Niannan ;
Zafeiriou, Stefanos .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4685-4694