Learning Combinatorial Prompts for Universal Controllable Image Captioning

被引:0
|
作者
Wang, Zhen [1 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
Gao, Fei [2 ]
Shao, Jian [1 ]
Chen, Long [3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ Technol, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Controllable image captioning (CIC); Prompt learning; Pretrained model;
D O I
10.1007/s11263-024-02179-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Controllable Image Captioning (CIC)-generating natural language descriptions about images under the guidance of given control signals-is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By "combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
引用
收藏
页码:129 / 150
页数:22
相关论文
共 50 条
  • [31] Image-Text Surgery: Efficient Concept Learning in Image Captioning by Generating Pseudopairs
    Fu, Kun
    Li, Jin
    Jin, Junqi
    Zhang, Changshui
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (12) : 5910 - 5921
  • [32] Coastal Image Captioning
    Yang, Qiaoqiao
    Wang, Guangxing
    Zhang, Xiaoyu
    Grecos, Christos
    Ren, Peng
    JOURNAL OF COASTAL RESEARCH, 2020, : 145 - 150
  • [33] COLLOQUIAL IMAGE CAPTIONING
    Ge, Xuri
    Chen, Fuhai
    Shen, Chen
    Ji, Rongrong
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 356 - 361
  • [34] Automated image captioning
    Puscasiu, Adela
    Fanca, Alexandra
    Gota, Dan-Ioan
    Valean, Honoriu
    PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON AUTOMATION, QUALITY AND TESTING, ROBOTICS (AQTR), 2020, : 361 - 366
  • [35] Balanced image captioning with task-aware decoupled learning and fusion
    Ding, Yuxuan
    Liu, Lingqiao
    Tian, Chunna
    Zhang, Xiangnan
    Tian, Xilan
    NEUROCOMPUTING, 2023, 538
  • [36] DrunaliaCap: Image Captioning for Drug-Related Paraphernalia With Deep Learning
    Zhao, Beigeng
    IEEE ACCESS, 2020, 8 : 161326 - 161336
  • [37] Deep Learning for automatically describing images in natural language - Image Captioning
    Hotaran, Anca Mihaela
    Vrejoiu, Mihnea Horia
    ROMANIAN JOURNAL OF INFORMATION TECHNOLOGY AND AUTOMATIC CONTROL-REVISTA ROMANA DE INFORMATICA SI AUTOMATICA, 2020, 30 (01): : 87 - 100
  • [38] Learning visual relationship and context-aware attention for image captioning
    Wang, Junbo
    Wang, Wei
    Wang, Liang
    Wang, Zhiyong
    Feng, David Dagan
    Tan, Tieniu
    PATTERN RECOGNITION, 2020, 98
  • [39] Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning
    Wang, Cheng
    Yang, Haojin
    Meinel, Christoph
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (02)
  • [40] Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
    Yang, Xu
    Zhang, Hanwang
    Gao, Chongyang
    Cai, Jianfei
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (01) : 82 - 100