CDZL: a controllable diversity zero-shot image caption model using large language models

被引：0

作者：

Zhao, Xin ^{[1
,2
]}

Kong, Weiwei ^{[1
,2
]}

Liu, Zongyao ^{[1
,2
]}

Wang, Menghao ^{[1
,2
]}

Li, Yiwen ^{[1
,2
]}

机构：

[1] Xian Univ Posts & Telecommun, Xian 710121, Shannxi, Peoples R China

[2] Shaanxi Key Lab Network Data Anal & Intelligent Pr, Xian 710121, Shannxi, Peoples R China

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2025年 / 19卷 / 04期

关键词：

Zero-shot; Image caption; Large language models; Diversity; Controllability;

D O I：

10.1007/s11760-025-03871-9

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Zero-shot image caption generation enables machines to produce descriptions without meticulously curated training data. Addressing the issues present in current zero-shot image caption models, such as slow speed and low-quality captions, this paper proposes a controllable and diverse zero-shot image caption generation model based on a large model (Controlled Diverse Zero-Shot Image Captioning Model, CDZL). The CDZL model does not update parameters; instead, it uses an iterative approach to merge the predicted distributions of target words from various models to generate diverse image captions. By adding control signals, it can produce image captions with controllability. Leveraging the knowledge of large language models, CDZL makes the generated captions more aligned with the images. During the iterative process, the model incorporates the Metropolis-Hastings sampling idea, reducing the number of iterations by rejecting samples with excessively low scores, thereby accelerating the generation speed. Experimental results show that our method outperforms current state-of-the-art (SOTA) methods in most evaluation metrics and exhibits faster generation speed.

引用

页数：8

共 32 条

[11] Hessel J., 2021, arXiv
[12] Hu X., 2022, CVPR, P17980
[13] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[14] Lin C.-Y., 2004, P WORKSH TEXT SUMM B, P74
[15] Microsoft COCO: Common Objects in Context
Lin, Tsung-Yi
Maire, Michael
Belongie, Serge
Hays, James
Perona, Pietro
Ramanan, Deva
Dollar, Piotr
Zitnick, C. Lawrence
[J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755
[16] Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
Liu, Pengfei
Yuan, Weizhe
Fu, Jinlan
Jiang, Zhengbao
Hayashi, Hiroaki
Neubig, Graham
[J]. ACM COMPUTING SURVEYS, 2023, 55 (09)
[17] Liwei W., 2017, Advances in Neural Information Processing Systems, V30
[18] Mokady R., 2021, arXiv
[19] OpenAI, 2021, GPT-3.5 Architecture
[20] BLEU: a method for automatic evaluation of machine translation
Papineni, K
Roukos, S
Ward, T
Zhu, WJ
[J]. 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, : 311 - 318

← 1 2 3 4 →