Align vision-language semantics by multi-task learning for multi-modal summarization

被引：0

作者：

Cui C. ^{[1
]}

Liang X. ^{[2
]}

Wu S. ^{[3
]}

Li Z. ^{[2
]}

机构：

[1] School of Cyber Science and Technology, Beihang University, Beijing

[2] School of Computer Science and Engineering, Beihang University, Beijing

[3] Cloud Xiaowei, Tencent, Beijing

来源：

Neural Computing and Applications | 2024年 / 36卷 / 25期

基金：

中国国家自然科学基金;

关键词：

Multi-modal summarization; Multi-task learning; Semantic alignment;

D O I：

10.1007/s00521-024-09908-3

中图分类号：

学科分类号：

摘要：

Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features. After that, these visual features are fused with language representations for the decoder to generate the text summary. However, the cascaded way employs separate encoders for different modalities, which makes it hard to learn the joint vision and language representation. In addition, they also ignore the semantics alignment between paragraphs and images for multi-modal summarization tasks, which are crucial to a precise summary. To tackle these issues, in this paper, we propose ViL-Sum to jointly model paragraph-level Vision-Language Semantic Alignment and Multi-Modal Summarization. Our ViL-Sum contains two components for better learning multi-modal semantics and aims to align them. The first one is a joint multi-modal encoder. The other one is two well-designed tasks for multi-task learning, including image reordering and image selection. Specifically, the joint multi-modal encoder converts images into visual embeddings and attaches them with text embedding as the input of the encoder. The reordering task guides the model to learn paragraph-level semantic alignment, and the selection task guides the model to select summary-related images in the final summary. Experimental results show that our proposed ViL-Sum outperforms current state-of-the-art methods on most automatic and manual evaluation metrics. In further analysis, we find that two well-designed tasks and a joint multi-modal encoder can effectively guide the model to learn reasonable paragraph-image and summary-image relations. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.

引用

页码：15653 / 15666

页数：13

共 50 条

[1] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
Long, Sifan
Zhao, Zhen
Yuan, Junkun
Tan, Zichang
Liu, Jiangjiang
Zhou, Luping
Wang, Shengsheng
Wang, Jingdong
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
[2] Multi-task Learning of Hierarchical Vision-Language Representation
Duy-Kien Nguyen
Okatani, Takayuki
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10484 - 10493
[3] MMA: Multi-Modal Adapter for Vision-Language Models
Yang, Lingxiao
Zhang, Ru-Yuan
Wang, Yanchen
Xie, Xiaohua
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23826 - +
[4] Multi-Modal Attribute Prompting for Vision-Language Models
Liu, Xin
Wu, Jiamin
Yang, Wenfei
Zhou, Xu
Zhang, Tianzhu
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11579 - 11591
[5] Fine-grained multi-modal prompt learning for vision-language models
Liu, Yunfei
Deng, Yunziwei
Liu, Anqi
Liu, Yanan
Li, Shengyang
NEUROCOMPUTING, 2025, 636
[6] MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model
Wang, Pengyu
Zhang, Huaqi
Yuan, Yixuan
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (12) : 4224 - 4235
[7] MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving
Chowdhuri, Sauhaarda
Pankaj, Tushar
Zipser, Karl
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1496 - 1504
[8] Multi-modal microblog classification via multi-task learning
Sicheng Zhao
Hongxun Yao
Sendong Zhao
Xuesong Jiang
Xiaolei Jiang
Multimedia Tools and Applications, 2016, 75 : 8921 - 8938
[9] Multi-modal microblog classification via multi-task learning
Zhao, Sicheng
Yao, Hongxun
Zhao, Sendong
Jiang, Xuesong
Jiang, Xiaolei
MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 8921 - 8938
[10] Multi-Modal Multi-Task Learning for Automatic Dietary Assessment
Liu, Qi
Zhang, Yue
Liu, Zhenguang
Yuan, Ye
Cheng, Li
Zimmermann, Roger
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2347 - 2354

← 1 2 3 4 5 →