Dynamically Shifting Multimodal Representations via Hybrid-Modal Attention for Multimodal Sentiment Analysis

被引:10
作者
Lin, Ronghao [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Acoustics; Visualization; Feature extraction; Task analysis; Logic gates; Sentiment analysis; Multi-stage fusion framework; intra- and inter-modality dynamics; multimodal representations shifting; hybrid-modal attention; PREDICTION; LANGUAGE; SPEECH; FUSION;
D O I
10.1109/TMM.2023.3303711
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the field of multimodal machine learning, multimodal sentiment analysis task has been an active area of research. The predominant approaches focus on learning efficient multimodal representations containing intra- and inter-modality information. However, the heterogeneous nature of different modalities brings great challenges to multimodal representation learning. In this article, we propose a multi-stage fusion framework to dynamically fine-tune multimodal representations via a hybrid-modal attention mechanism. Previous methods mostly only fine-tune the textual representation due to the success of large corpus pre-trained models and neglect the inconsistency problem of different modality spaces. Thus, we design a module called the Multimodal Shifting Gate (MSG) to fine-tune the three modalities by modeling inter-modality dynamics and shifting representations. We also adopt a module named Masked Bimodal Adjustment (MBA) on the textual modality to improve the inconsistency of parameter spaces and reduce the modality gap. In addition, we utilize syntactic-level and semantic-level textual features output from different layers of the Transformer model to sufficiently capture the intra-modality dynamics. Moreover, we construct a Shifting HuberLoss to robustly introduce the variation of the shifting value into the training process. Extensive experiments on the public datasets, including CMU-MOSI and CMU-MOSEI, demonstrate the efficacy of our approach.
引用
收藏
页码:2740 / 2755
页数:16
相关论文
共 69 条
[1]   Convolutional Neural Networks for Speech Recognition [J].
Abdel-Hamid, Ossama ;
Mohamed, Abdel-Rahman ;
Jiang, Hui ;
Deng, Li ;
Penn, Gerald ;
Yu, Dong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545
[2]  
[Anonymous], 2017, FACIAL EXPRESSION AN
[3]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[4]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[5]  
Brown TB, 2020, ADV NEUR IN, V33
[6]  
Chen MH, 2017, PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2017, P163, DOI 10.1145/3136755.3136801
[7]  
Cho Kyunghyun., 2014, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing EMNLP, P1724
[8]  
Chung J., [No title captured]
[9]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[10]  
Degottex G, 2014, INT CONF ACOUST SPEE, DOI 10.1109/ICASSP.2014.6853739