Learning Combinatorial Prompts for Universal Controllable Image Captioning

被引:0
|
作者
Wang, Zhen [1 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
Gao, Fei [2 ]
Shao, Jian [1 ]
Chen, Long [3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ Technol, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Controllable image captioning (CIC); Prompt learning; Pretrained model;
D O I
10.1007/s11263-024-02179-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Controllable Image Captioning (CIC)-generating natural language descriptions about images under the guidance of given control signals-is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By "combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
引用
收藏
页码:129 / 150
页数:22
相关论文
共 50 条
  • [1] Imageability- and Length-Controllable Image Captioning
    Kastner, Marc A.
    Umemura, Kazuki
    Ide, Ichiro
    Kawanishi, Yasutomo
    Hirayama, Takatsugu
    Doman, Keisuke
    Deguchi, Daisuke
    Murase, Hiroshi
    Satoh, Shin'Ichi
    IEEE ACCESS, 2021, 9 (09): : 162951 - 162961
  • [2] Learning Image Captioning as a Structured Transduction Task
    Bacciu, Davide
    Serramazza, Davide
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EAAAI/EANN 2022, 2022, 1600 : 235 - 246
  • [3] Deep Learning Approaches on Image Captioning: A Review
    Ghandi, Taraneh
    Pourreza, Hamidreza
    Mahyar, Hamidreza
    ACM COMPUTING SURVEYS, 2024, 56 (03)
  • [4] A Comprehensive Survey of Deep Learning for Image Captioning
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    ACM COMPUTING SURVEYS, 2019, 51 (06)
  • [5] Facilitated Deep Learning Models for Image Captioning
    Azhar, Imtinan
    Afyouni, Imad
    Elnagar, Ashraf
    2021 55TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2021,
  • [6] Neural Symbolic Representation Learning for Image Captioning
    Wang, Xiaomei
    Ma, Lin
    Fu, Yanwei
    Xue, Xiangyang
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 312 - 321
  • [7] Collaborative Learning Method for Natural Image Captioning
    Wang, Rongzhao
    Liu, Libo
    DATA SCIENCE (ICPCSEE 2022), PT I, 2022, 1628 : 249 - 261
  • [8] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [9] Image Captioning using Reinforcement Learning with BLUDEr Optimization
    Devi, P. R.
    Thrivikraman, V
    Kashyap, D.
    Shylaja, S. S.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2020, 30 (04) : 607 - 613
  • [10] High-Order Interaction Learning for Image Captioning
    Wang, Yanhui
    Xu, Ning
    Liu, An-An
    Li, Wenhui
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4417 - 4430