Align vision-language semantics by multi-task learning for multi-modal summarization

被引:0
|
作者
Cui C. [1 ]
Liang X. [2 ]
Wu S. [3 ]
Li Z. [2 ]
机构
[1] School of Cyber Science and Technology, Beihang University, Beijing
[2] School of Computer Science and Engineering, Beihang University, Beijing
[3] Cloud Xiaowei, Tencent, Beijing
基金
中国国家自然科学基金;
关键词
Multi-modal summarization; Multi-task learning; Semantic alignment;
D O I
10.1007/s00521-024-09908-3
中图分类号
学科分类号
摘要
Most current multi-modal summarization methods follow a cascaded manner, where an off-the-shelf object detector is first used to extract visual features. After that, these visual features are fused with language representations for the decoder to generate the text summary. However, the cascaded way employs separate encoders for different modalities, which makes it hard to learn the joint vision and language representation. In addition, they also ignore the semantics alignment between paragraphs and images for multi-modal summarization tasks, which are crucial to a precise summary. To tackle these issues, in this paper, we propose ViL-Sum to jointly model paragraph-level Vision-Language Semantic Alignment and Multi-Modal Summarization. Our ViL-Sum contains two components for better learning multi-modal semantics and aims to align them. The first one is a joint multi-modal encoder. The other one is two well-designed tasks for multi-task learning, including image reordering and image selection. Specifically, the joint multi-modal encoder converts images into visual embeddings and attaches them with text embedding as the input of the encoder. The reordering task guides the model to learn paragraph-level semantic alignment, and the selection task guides the model to select summary-related images in the final summary. Experimental results show that our proposed ViL-Sum outperforms current state-of-the-art methods on most automatic and manual evaluation metrics. In further analysis, we find that two well-designed tasks and a joint multi-modal encoder can effectively guide the model to learn reasonable paragraph-image and summary-image relations. © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
引用
收藏
页码:15653 / 15666
页数:13
相关论文
共 50 条
  • [1] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
    Long, Sifan
    Zhao, Zhen
    Yuan, Junkun
    Tan, Zichang
    Liu, Jiangjiang
    Zhou, Luping
    Wang, Shengsheng
    Wang, Jingdong
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
  • [2] Multi-task Learning of Hierarchical Vision-Language Representation
    Duy-Kien Nguyen
    Okatani, Takayuki
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10484 - 10493
  • [3] MMA: Multi-Modal Adapter for Vision-Language Models
    Yang, Lingxiao
    Zhang, Ru-Yuan
    Wang, Yanchen
    Xie, Xiaohua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23826 - +
  • [4] Multi-Modal Attribute Prompting for Vision-Language Models
    Liu, Xin
    Wu, Jiamin
    Yang, Wenfei
    Zhou, Xu
    Zhang, Tianzhu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11579 - 11591
  • [5] Fine-grained multi-modal prompt learning for vision-language models
    Liu, Yunfei
    Deng, Yunziwei
    Liu, Anqi
    Liu, Yanan
    Li, Shengyang
    NEUROCOMPUTING, 2025, 636
  • [6] MCPL: Multi-Modal Collaborative Prompt Learning for Medical Vision-Language Model
    Wang, Pengyu
    Zhang, Huaqi
    Yuan, Yixuan
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (12) : 4224 - 4235
  • [7] MultiNet: Multi-Modal Multi-Task Learning for Autonomous Driving
    Chowdhuri, Sauhaarda
    Pankaj, Tushar
    Zipser, Karl
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 1496 - 1504
  • [8] Multi-modal microblog classification via multi-task learning
    Sicheng Zhao
    Hongxun Yao
    Sendong Zhao
    Xuesong Jiang
    Xiaolei Jiang
    Multimedia Tools and Applications, 2016, 75 : 8921 - 8938
  • [9] Multi-modal microblog classification via multi-task learning
    Zhao, Sicheng
    Yao, Hongxun
    Zhao, Sendong
    Jiang, Xuesong
    Jiang, Xiaolei
    MULTIMEDIA TOOLS AND APPLICATIONS, 2016, 75 (15) : 8921 - 8938
  • [10] Multi-Modal Multi-Task Learning for Automatic Dietary Assessment
    Liu, Qi
    Zhang, Yue
    Liu, Zhenguang
    Yuan, Ye
    Cheng, Li
    Zimmermann, Roger
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2347 - 2354