CCMA: CapsNet for audio-video sentiment analysis using cross-modal attention

被引:2
作者
Li, Haibin [1 ]
Guo, Aodi [1 ]
Li, Yaqian [1 ]
机构
[1] Yanshan Univ, Key Lab Ind Comp Control Engn Hebei Prov, Qinhuangdao 066004, Peoples R China
基金
中国国家自然科学基金;
关键词
Sentiment analysis; Audio-video bimodal; Positional embedding; Capsule network; Cross-modal fusion; FUSION;
D O I
10.1007/s00371-024-03453-9
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Multimodal sentiment analysis is a challenging research area that aims to investigate the use of complementary multimodal information to analyze the sentiment tendencies of a character. In order to effectively fuse multimodal heterogeneous data from different information sources, the current advanced models have developed a variety of fusion strategies mainly based on text modality, and research in the field of audio-visual bimodal fusion is relatively scarce. Therefore, in this paper, we propose a framework for sentiment analysis based on audio and video bimodality, CCMA. Initially, we preprocess the raw data and retain modality-specific temporal information through positional embedding. On the one hand, in order to solve the issue of modal contribution unbalance, we use capsule network and 1D convolution at the video modality side and audio modality side, respectively, to better represent the modal features. On the other hand, we believe that inter-modal explicit interaction is the best way to fuse cross-modal information, and design a cross-modal attentional interaction module for explicit interaction of modal information to enhance the fusion quality. Experiments on two popular sentiment analysis datasets RAVDESS and CMU-MOSEI show that the accuracy of our model performs better that the competing methods, which illustrates the effectiveness of our method.
引用
收藏
页码:1609 / 1620
页数:12
相关论文
共 49 条
[41]   FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference [J].
Wei, Qinglan ;
Huang, Xuling ;
Zhang, Yuan .
IEEE TRANSACTIONS ON BROADCASTING, 2023, 69 (01) :10-20
[42]   Interpretable Multimodal Capsule Fusion [J].
Wu, Jianfeng ;
Mai, Sijie ;
Hu, Haifeng .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :1815-1826
[43]   Natural language based financial forecasting: a survey [J].
Xing, Frank Z. ;
Cambria, Erik ;
Welsch, Roy E. .
ARTIFICIAL INTELLIGENCE REVIEW, 2018, 50 (01) :49-73
[44]   Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning [J].
Yang, Bo ;
Wu, Lijun ;
Zhu, Jinhua ;
Shao, Bo ;
Lin, Xiaola ;
Liu, Tie-Yan .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :2015-2024
[45]   Multimodal sentiment analysis with unidirectional modality translation [J].
Yang, Bo ;
Shao, Bo ;
Wu, Lijun ;
Lin, Xiaola .
NEUROCOMPUTING, 2022, 467 :130-137
[46]  
Zadeh A, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P2236
[47]  
Zeng ZH, 2007, ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, P126
[48]  
Zhang Y, 2022, AAAI CONF ARTIF INTE, P9100
[49]   Multimodal sentiment analysis based on fusion methods: A survey [J].
Zhu, Linan ;
Zhu, Zhechao ;
Zhang, Chenwei ;
Xu, Yifei ;
Kong, Xiangjie .
INFORMATION FUSION, 2023, 95 :306-325