Disentanglement Translation Network for multimodal sentiment analysis

被引:17
作者
Zeng, Ying [1 ]
Yan, Wenjun [1 ]
Mai, Sijie [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
关键词
Mutlimodal representation learning; Disentanglement learning; Multimodal sentiment analysis; Feature reconstruction; FUSION;
D O I
10.1016/j.inffus.2023.102031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Obtaining an effective joint representation has always been the goal for multimodal tasks. However, distri-butional gap inevitably exists due to the heterogeneous nature of different modalities, which poses burden on the fusion process and the learning of multimodal representation. The imbalance of modality dominance further aggravates this problem, where inferior modalities may contain much redundancy that introduces additional variations. To address the aforementioned issues, we propose a Disentanglement Translation Network (DTN) with Slack Reconstruction to capture desirable information properties, obtain a unified feature distribution and reduce redundancy. Specifically, the encoder-decoder-based disentanglement framework is adopted to decouple the unimodal representations into modality-common and modality-specific subspaces, which explores the cross-modal commonality and diversity, respectively. In the encoding stage, to narrow down the discrepancy, a two-stage translation is devised to incorporate with the disentanglement learning framework. The first stage targets at learning modality-invariant embedding for modality-common information with adversarial learning strategy, capturing the commonality shared across modalities. The second stage considers the modality-specific information that reveals diversity. To relieve the burden of multimodal fusion, we realize Specific-Common Distribution Matching to further unify the distribution of the desirable information. As for the decoding and reconstruction stage, we propose Slack Reconstruction to seek a balance between retaining discriminative information and reducing redundancy. Although the existing commonly-used reconstruction loss with strict constraint lowers the risk of information loss, it easily leads to the preservation of information redundancy. In contrast, Slack Reconstruction imposes a more relaxed constraint so that the redundancy is not forced to be retained, and simultaneously explores the inter-sample relationships. The proposed method aids multimodal fusion by learning the exact properties and obtaining a more uniform distribution for cross -modal data, and manages to reduce information redundancy to further ensure feature effectiveness. Extensive experiments on the task of multimodal sentiment analysis indicate the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/DTN.
引用
收藏
页数:12
相关论文
共 58 条
[1]  
Akhtar MS, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P370
[2]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[3]  
Han W, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P9180
[4]   MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].
Hazarika, Devamanyu ;
Zimmermann, Roger ;
Poria, Soujanya .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131
[5]  
Jianfeng Wu, 2021, ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction, P521, DOI 10.1145/3462244.3479931
[6]  
Kampman O, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, P606
[7]   Multi-graph fusion for multi-view spectral clustering [J].
Kang, Zhao ;
Shi, Guoxin ;
Huang, Shudong ;
Chen, Wenyu ;
Pu, Xiaorong ;
Zhou, Joey Tianyi ;
Xu, Zenglin .
KNOWLEDGE-BASED SYSTEMS, 2020, 189
[8]  
Kay W, 2017, Arxiv, DOI [arXiv:1705.06950, 10.48550/arXiv.1705.06950, DOI 10.48550/ARXIV.1705.06950]
[9]   AOBERT: All-modalities-in-One BERT for multimodal sentiment analysis [J].
Kim, Kyeonghun ;
Park, Sanghyun .
INFORMATION FUSION, 2023, 92 :37-45
[10]  
Kingma D. P., ADAM METHOD STOCHAST