Inter-Intra Modal Representation Augmentation With Trimodal Collaborative Disentanglement Network for Multimodal Sentiment Analysis

被引:12
作者
Chen, Chen [1 ]
Hong, Hansheng [2 ]
Guo, Jie [1 ]
Song, Bin [1 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[2] Guangdong OPPO Mobile Telecommun Corp, Dongguan 523860, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature extraction; Transformers; Sentiment analysis; Task analysis; Federated learning; Collaboration; Data models; Multimodal sentiment analysis; multimodal fusion; transformers; data augmentation;
D O I
10.1109/TASLP.2023.3263801
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, Multimodal Sentiment Analysis (MSA) is a challenging research area given its complex nature, and humans express emotional cues across various modalities such as language, facial expressions, and speech. Representation and fusion of features are the most crucial tasks in multimodal sentiment analysis research. However, in the current research, most methods ignore the importance of eliminating potential irrelevant features in the original features of each modality and cross-modal common feature. Moreover, the features extracted from all the modalities contain cluttered background noise and different occlusions noise, which negatively affects feature alignment. Different from these methods, we propose a novel Trimodal Collaborative Disentanglement Network (TCDN) to solve these problems in this paper. This work can obtain effective sentiment results on two aspects: i) Trimodal collaborative uses L1-norm to eliminate irrelevant features and unify the characteristics of the three modals (inter-modal). ii) Disentanglement network introduces an adversary noise by combining the original features of various single modalities and the common representation, alleviating the background noises within each modality (intra-modal). This inter-intra modal feature augmentation method is the first work to obtain the common representation by implementing data augmentation as far as we know. Extensive experiments are completed on two benchmark datasets, including MOSI and MOSEI, demonstrating the superiority of the TCDN model over the state-of-the-art methods.
引用
收藏
页码:1476 / 1488
页数:13
相关论文
共 46 条
[1]  
Cheng JY, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2447
[2]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[3]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[4]  
Frid-Adar M, 2018, I S BIOMED IMAGING, P289, DOI 10.1109/ISBI.2018.8363576
[5]   MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis [J].
Hazarika, Devamanyu ;
Zimmermann, Roger ;
Poria, Soujanya .
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, :1122-1131
[6]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[7]   Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks [J].
Li, Xiujun ;
Yin, Xi ;
Li, Chunyuan ;
Zhang, Pengchuan ;
Hu, Xiaowei ;
Zhang, Lei ;
Wang, Lijuan ;
Hu, Houdong ;
Dong, Li ;
Wei, Furu ;
Choi, Yejin ;
Gao, Jianfeng .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :121-137
[8]   SSD: Single Shot MultiBox Detector [J].
Liu, Wei ;
Anguelov, Dragomir ;
Erhan, Dumitru ;
Szegedy, Christian ;
Reed, Scott ;
Fu, Cheng-Yang ;
Berg, Alexander C. .
COMPUTER VISION - ECCV 2016, PT I, 2016, 9905 :21-37
[9]  
Liu YH, 2019, Arxiv, DOI arXiv:1907.11692
[10]  
Liu Z, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2247