CDZL: a controllable diversity zero-shot image caption model using large language models

被引：0

作者：

Zhao, Xin ^{[1
,2
]}

Kong, Weiwei ^{[1
,2
]}

Liu, Zongyao ^{[1
,2
]}

Wang, Menghao ^{[1
,2
]}

Li, Yiwen ^{[1
,2
]}

机构：

[1] Xian Univ Posts & Telecommun, Xian 710121, Shannxi, Peoples R China

[2] Shaanxi Key Lab Network Data Anal & Intelligent Pr, Xian 710121, Shannxi, Peoples R China

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2025年 / 19卷 / 04期

关键词：

Zero-shot; Image caption; Large language models; Diversity; Controllability;

D O I：

10.1007/s11760-025-03871-9

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Zero-shot image caption generation enables machines to produce descriptions without meticulously curated training data. Addressing the issues present in current zero-shot image caption models, such as slow speed and low-quality captions, this paper proposes a controllable and diverse zero-shot image caption generation model based on a large model (Controlled Diverse Zero-Shot Image Captioning Model, CDZL). The CDZL model does not update parameters; instead, it uses an iterative approach to merge the predicted distributions of target words from various models to generate diverse image captions. By adding control signals, it can produce image captions with controllability. Leveraging the knowledge of large language models, CDZL makes the generated captions more aligned with the images. During the iterative process, the model incorporates the Metropolis-Hastings sampling idea, reducing the number of iterations by rejecting samples with excessively low scores, thereby accelerating the generation speed. Experimental results show that our method outperforms current state-of-the-art (SOTA) methods in most evaluation metrics and exhibits faster generation speed.

引用

页数：8

共 32 条

[1] SPICE: Semantic Propositional Image Caption Evaluation
Anderson, Peter
Fernando, Basura
Johnson, Mark
Gould, Stephen
[J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
[2] Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning
Aneja, Jyoti
Agrawal, Harsh
Batra, Dhruv
Schwing, Alexander
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4260 - 4269
[3] Ashwin V., 2018, P AAAI C ART INT, V32
[4] Banerjee Satanjeev, 2005, ACL WORKSHOPS, P65
[5] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
[6] Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech
Deshpande, Aditya
Aneja, Jyoti
Wang, Liwei
Schwing, Alexander
Forsyth, David
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10687 - 10696
[7] Injecting Semantic Concepts into End-to-End Image Captioning
Fang, Zhiyuan
Wang, Jianfeng
Hu, Xiaowei
Liang, Lin
Gan, Zhe
Wang, Lijuan
Yang, Yezhou
Liu, Zicheng
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17988 - 17998
[8] Fei JJ, 2023, Arxiv, DOI arXiv:2307.16525
[9] StyleNet: Generating Attractive Visual Captions with Styles
Gan, Chuang
Gan, Zhe
He, Xiaodong
Gao, Jianfeng
Deng, Li
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 955 - 964
[10] MSCap: Multi-Style Image Captioning with Unpaired Stylized Text
Guo, Longteng
Liu, Jing
Yao, Peng
Li, Jiangwei
Lu, Hanqing
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4199 - 4208

← 1 2 3 4 →