CDZL: a controllable diversity zero-shot image caption model using large language models

被引:0
作者
Zhao, Xin [1 ,2 ]
Kong, Weiwei [1 ,2 ]
Liu, Zongyao [1 ,2 ]
Wang, Menghao [1 ,2 ]
Li, Yiwen [1 ,2 ]
机构
[1] Xian Univ Posts & Telecommun, Xian 710121, Shannxi, Peoples R China
[2] Shaanxi Key Lab Network Data Anal & Intelligent Pr, Xian 710121, Shannxi, Peoples R China
关键词
Zero-shot; Image caption; Large language models; Diversity; Controllability;
D O I
10.1007/s11760-025-03871-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Zero-shot image caption generation enables machines to produce descriptions without meticulously curated training data. Addressing the issues present in current zero-shot image caption models, such as slow speed and low-quality captions, this paper proposes a controllable and diverse zero-shot image caption generation model based on a large model (Controlled Diverse Zero-Shot Image Captioning Model, CDZL). The CDZL model does not update parameters; instead, it uses an iterative approach to merge the predicted distributions of target words from various models to generate diverse image captions. By adding control signals, it can produce image captions with controllability. Leveraging the knowledge of large language models, CDZL makes the generated captions more aligned with the images. During the iterative process, the model incorporates the Metropolis-Hastings sampling idea, reducing the number of iterations by rejecting samples with excessively low scores, thereby accelerating the generation speed. Experimental results show that our method outperforms current state-of-the-art (SOTA) methods in most evaluation metrics and exhibits faster generation speed.
引用
收藏
页数:8
相关论文
共 32 条
  • [1] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [2] Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning
    Aneja, Jyoti
    Agrawal, Harsh
    Batra, Dhruv
    Schwing, Alexander
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4260 - 4269
  • [3] Ashwin V., 2018, P AAAI C ART INT, V32
  • [4] Banerjee Satanjeev, 2005, ACL WORKSHOPS, P65
  • [5] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
  • [6] Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
    Deshpande, Aditya
    Aneja, Jyoti
    Wang, Liwei
    Schwing, Alexander
    Forsyth, David
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10687 - 10696
  • [7] Injecting Semantic Concepts into End-to-End Image Captioning
    Fang, Zhiyuan
    Wang, Jianfeng
    Hu, Xiaowei
    Liang, Lin
    Gan, Zhe
    Wang, Lijuan
    Yang, Yezhou
    Liu, Zicheng
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17988 - 17998
  • [8] Fei JJ, 2023, Arxiv, DOI arXiv:2307.16525
  • [9] StyleNet: Generating Attractive Visual Captions with Styles
    Gan, Chuang
    Gan, Zhe
    He, Xiaodong
    Gao, Jianfeng
    Deng, Li
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 955 - 964
  • [10] MSCap: Multi-Style Image Captioning with Unpaired Stylized Text
    Guo, Longteng
    Liu, Jing
    Yao, Peng
    Li, Jiangwei
    Lu, Hanqing
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4199 - 4208