CDZL: a controllable diversity zero-shot image caption model using large language models

被引:0
作者
Zhao, Xin [1 ,2 ]
Kong, Weiwei [1 ,2 ]
Liu, Zongyao [1 ,2 ]
Wang, Menghao [1 ,2 ]
Li, Yiwen [1 ,2 ]
机构
[1] Xian Univ Posts & Telecommun, Xian 710121, Shannxi, Peoples R China
[2] Shaanxi Key Lab Network Data Anal & Intelligent Pr, Xian 710121, Shannxi, Peoples R China
关键词
Zero-shot; Image caption; Large language models; Diversity; Controllability;
D O I
10.1007/s11760-025-03871-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Zero-shot image caption generation enables machines to produce descriptions without meticulously curated training data. Addressing the issues present in current zero-shot image caption models, such as slow speed and low-quality captions, this paper proposes a controllable and diverse zero-shot image caption generation model based on a large model (Controlled Diverse Zero-Shot Image Captioning Model, CDZL). The CDZL model does not update parameters; instead, it uses an iterative approach to merge the predicted distributions of target words from various models to generate diverse image captions. By adding control signals, it can produce image captions with controllability. Leveraging the knowledge of large language models, CDZL makes the generated captions more aligned with the images. During the iterative process, the model incorporates the Metropolis-Hastings sampling idea, reducing the number of iterations by rejecting samples with excessively low scores, thereby accelerating the generation speed. Experimental results show that our method outperforms current state-of-the-art (SOTA) methods in most evaluation metrics and exhibits faster generation speed.
引用
收藏
页数:8
相关论文
共 32 条
  • [11] Hessel J., 2021, arXiv
  • [12] Hu X., 2022, CVPR, P17980
  • [13] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
  • [14] Lin C.-Y., 2004, P WORKSH TEXT SUMM B, P74
  • [15] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
  • [16] Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
    Liu, Pengfei
    Yuan, Weizhe
    Fu, Jinlan
    Jiang, Zhengbao
    Hayashi, Hiroaki
    Neubig, Graham
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (09)
  • [17] Liwei W., 2017, Advances in Neural Information Processing Systems, V30
  • [18] Mokady R., 2021, arXiv
  • [19] OpenAI, 2021, GPT-3.5 Architecture
  • [20] BLEU: a method for automatic evaluation of machine translation
    Papineni, K
    Roukos, S
    Ward, T
    Zhu, WJ
    [J]. 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 311 - 318