Learning Combinatorial Prompts for Universal Controllable Image Captioning

被引:0
|
作者
Wang, Zhen [1 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
Gao, Fei [2 ]
Shao, Jian [1 ]
Chen, Long [3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ Technol, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Controllable image captioning (CIC); Prompt learning; Pretrained model;
D O I
10.1007/s11263-024-02179-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Controllable Image Captioning (CIC)-generating natural language descriptions about images under the guidance of given control signals-is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By "combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
引用
收藏
页码:129 / 150
页数:22
相关论文
共 50 条
  • [41] ArCo: Attention-reinforced transformer with contrastive learning for image captioning
    Wang, Zhongan
    Shi, Shuai
    Zhai, Zirong
    Wu, Yingna
    Yang, Rui
    IMAGE AND VISION COMPUTING, 2022, 128
  • [42] A Novel Technique for Image Captioning Based on Hierarchical Clustering and Deep Learning
    Rizwan Ur Rahman
    Pavan Kumar
    Aditya Mohan
    Rabia Musheer Aziz
    Deepak Singh Tomar
    SN Computer Science, 6 (4)
  • [43] Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning
    Dong, Xinzhi
    Long, Chengjiang
    Xu, Wenju
    Xiao, Chunxia
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2615 - 2624
  • [44] Double-Stream Position Learning Transformer Network for Image Captioning
    Jiang, Weitao
    Zhou, Wei
    Hu, Haifeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) : 7706 - 7718
  • [45] Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks
    Castro, Roberto
    Pineda, Israel
    Lim, Wansu
    Morocho-Cayamcela, Manuel Eugenio
    IEEE ACCESS, 2022, 10 : 33679 - 33694
  • [46] Metaheuristics Optimization with Deep Learning Enabled Automated Image Captioning System
    Al Duhayyim, Mesfer
    Alazwari, Sana
    Mengash, Hanan Abdullah
    Marzouk, Radwa
    Alzahrani, Jaber S.
    Mahgoub, Hany
    Althukair, Fahd
    Salama, Ahmed S.
    APPLIED SCIENCES-BASEL, 2022, 12 (15):
  • [47] Unpaired Image Captioning With semantic-Constrained Self-Learning
    Ben, Huixia
    Pan, Yingwei
    Li, Yehao
    Yao, Ting
    Hong, Richang
    Wang, Meng
    Mei, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 904 - 916
  • [48] A Decoupling Paradigm With Prompt Learning for Remote Sensing Image Change Captioning
    Liu, Chenyang
    Zhao, Rui
    Chen, Jianqi
    Qi, Zipeng
    Zou, Zhengxia
    Shi, Zhenwei
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [49] Modeling of Hyperparameter Tuned Deep Learning Model for Automated Image Captioning
    Omri, Mohamed
    Abdel-Khalek, Sayed
    Khalil, Eied M.
    Bouslimi, Jamel
    Joshi, Gyanendra Prasad
    MATHEMATICS, 2022, 10 (03)
  • [50] Privacy-Preserving Image Captioning with Partial Encryption and Deep Learning
    Martin, Antoinette Deborah
    Moon, Inkyu
    MATHEMATICS, 2025, 13 (04)