Learning Combinatorial Prompts for Universal Controllable Image Captioning

被引:0
|
作者
Wang, Zhen [1 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
Gao, Fei [2 ]
Shao, Jian [1 ]
Chen, Long [3 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Zhejiang Univ Technol, Hangzhou, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Controllable image captioning (CIC); Prompt learning; Pretrained model;
D O I
10.1007/s11263-024-02179-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Controllable Image Captioning (CIC)-generating natural language descriptions about images under the guidance of given control signals-is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By "combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.
引用
收藏
页码:129 / 150
页数:22
相关论文
共 50 条
  • [21] CONICA: A Contrastive Image Captioning Framework with Robust Similarity Learning
    Deng, Lin
    Zhong, Yuzhong
    Wang, Maoning
    Zhang, Jianwei
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5109 - 5119
  • [22] CASCADE ATTENTION: MULTIPLE FEATURE BASED LEARNING FOR IMAGE CAPTIONING
    Shi, Jiahe
    Li, Yali
    Wang, Shengjin
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 1970 - 1974
  • [23] Learning Text-to-Video Retrieval from Image Captioning
    Ventura, Lucas
    Schmid, Cordelia
    Varol, Gul
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 1834 - 1854
  • [24] Deep Learning Image Captioning in Construction Management: A Feasibility Study
    Xiao, Bo
    Wang, Yiheng
    Kang, Shih-Chung
    JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2022, 148 (07)
  • [25] A Deep Learning Approach for Nepali Image Captioning and Speech Generation
    Sharma, Sagar
    Chapagain, Samikshya
    Acharya, Sachin
    Panday, Sanjeeb Prasad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2025, 16 (02) : 1258 - 1264
  • [26] Region-Aware Image Captioning via Interaction Learning
    Liu, An-An
    Zhai, Yingchen
    Xu, Ning
    Nie, Weizhi
    Li, Wenhui
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (06) : 3685 - 3696
  • [27] Automatic image captioning system using a deep learning approach
    Deepak, Gerard
    Gali, Sowmya
    Sonker, Abhilash
    Jos, Bobin Cherian
    Sagar, K. V. Daya
    Singh, Charanjeet
    SOFT COMPUTING, 2023,
  • [28] Transformer based Multitask Learning for Image Captioning and Object Detection
    Basak, Debolena
    Srijith, P. K.
    Desarkar, Maunendra Sankar
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT II, PAKDD 2024, 2024, 14646 : 260 - 272
  • [29] AraCap: A hybrid deep learning architecture for Arabic Image Captioning
    Afyouni, Imad
    Azhar, Imtinan
    Elnagar, Ashraf
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 382 - 389
  • [30] Fully-attentive iterative networks for region-based controllable image and video captioning
    Cornia, Marcella
    Baraldi, Lorenzo
    Tal, Ayellet
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237