Disentangled Representation Learning for Multimodal Emotion Recognition

被引:89
作者
Yang, Dingkang [1 ]
Huang, Shuai [1 ]
Kuang, Haopeng [1 ]
Du, Yangtao [1 ,2 ]
Zhang, Lihua [1 ,3 ]
机构
[1] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China
[2] Artif Intelligence & Unmanned Syst Engn Res Ctr J, Minist Educ, Engn Res Ctr Ai & Robot, Changchun, Peoples R China
[3] Ji Hua Lab, Jilin Prov Key Lab Intelligence Sci & Engn, Foshan, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
disentangled representation learning; emotion recognition; adversarial learning; multimodal fusion;
D O I
10.1145/3503161.3547754
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.
引用
收藏
页码:1642 / 1651
页数:10
相关论文
共 56 条
  • [1] Akbari Hassan, 2021, ARXIV210411178
  • [2] Aman S, 2007, LECT NOTES ARTIF INT, V4629, P196
  • [3] [Anonymous], 2012, EUR C COMP VIS, DOI DOI 10.1109/MINES.2012.65
  • [4] Baltruaitis T, 2016, 2016 IEEE WINT C APP, DOI [10.1109/WACV.2016.7477553, DOI 10.1109/WACV.2016.7477553]
  • [5] Bousmalis K, 2016, ADV NEUR IN, V29
  • [6] Brave S., 2007, The humancomputer interaction handbook, P103
  • [7] Chen M., 2017, P 19 ACM INT C MULTI, P163, DOI DOI 10.1145/3136755.3136801
  • [8] Towards Practical Certifiable Patch Defense with Vision Transformer
    Chen, Zhaoyu
    Li, Bo
    Xu, Jianghe
    Wu, Shuang
    Ding, Shouhong
    Zhang, Wenqiang
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15127 - 15137
  • [9] Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739
  • [10] ArcFace: Additive Angular Margin Loss for Deep Face Recognition
    Deng, Jiankang
    Guo, Jia
    Xue, Niannan
    Zafeiriou, Stefanos
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4685 - 4694