Multi-layer cross-modality attention fusion network for multimodal sentiment analysis

被引:0
作者
Yin Z. [1 ]
Du Y. [1 ]
Liu Y. [1 ]
Wang Y. [1 ]
机构
[1] Faculty of Information Technology, Beijing University of Technology, Pingleyuan 100, Beijing
基金
中国国家自然科学基金;
关键词
Cross-modality attention; Multimodal feature fusion; Multimodal sentiment analysis; Semantic alignment;
D O I
10.1007/s11042-023-17685-9
中图分类号
学科分类号
摘要
Sentiment analysis aims to detect the sentiment polarity towards the massive opinions and reviews emerging on the internet. With the increasing of multimodal information on social media, such as text, image, audio and video, multimodal sentiment analysis has attracted more attention in recent years and our work focuses on the text and image data. The previous works usually ignore the semantic alignment between the text and image, and cannot capture the interaction between them, which will affect the correct judgement for the sentiment polarity prediction. To resolve these problems, we propose a novel multimodal sentiment analysis model LXMERT-MMSA based on cross-modality attention mechanism. The single-modality feature is encoded by multi-layer Transformer encoder to achieve the deep semantic information implied in the text and image. Moreover, the cross-modality attention mechanism enables the model to fuse the text and image features effectively and achieve the rich semantic information by the alignment. It improves the ability of the model to capture the semantic relation between text and image. The evaluation metrics of accuracy and F1 score are used, and the experimental results on MVSA-multiple dataset and Twitter dataset show that our proposed model outperforms the previous SOTA model, and the ablation experimental results further prove that the model can make well use of multimodal features. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
引用
收藏
页码:60171 / 60187
页数:16
相关论文
共 35 条
[1]  
Borth D., Ji R., Chen T., Breuel T., Chang S.-F., Large-scale visual sentiment ontology and detectors using adjective noun pairs, pp. 223-232, (2013)
[2]  
Lu D., Neves L., Carvalho V., Zhang N., Ji H., Visual attention model for name tagging in multimodal social media, pp. 1990-1999, (2018)
[3]  
Sun L., Wang J., Su Y., Weng F., Sun Y., Zheng Z., Chen Y., RIVA: A pre-trained tweet multimodal model based on text-image relation for multimodal NER, pp. 1852-1862, (2020)
[4]  
Xu N., Mao W., Multisentinet: A deep semantic network for multimodal sentiment analysis, pp. 2399-2402, (2017)
[5]  
Xu N., Mao W., Chen G., A co-memory network for multimodal sentiment analysis, pp. 929-932, (2018)
[6]  
Yu Y., Lin H., Meng J., Zhao Z., Visual and textual sentiment analysis of a microblog using deep convolutional neural networks, Algorithms, 9, 2, (2016)
[7]  
Truong Q., Lauw H.W., Vistanet: Visual aspect attention network for multimodal sentiment analysis, pp. 305-312, (2019)
[8]  
Gu Y., Yang K., Fu S., Chen S., Li X., Marsic I., Multimodal affective analysis using hierarchical attention strategy with word-level alignment, pp. 2225-2235, (2018)
[9]  
Pham H., Liang Pu P., Manzini T., Morency L.-P., Poczos B., Found in translation: Learning robust joint representations by cyclic translations between modalities, pp. 6892-6899, (2019)
[10]  
Zhang J., Yu Y., Tang S., Wu J., Li W., Variational autoencoder with CCA for audio-visual cross-modal retrieval, Corr Abs/2112.02601., (2021)