Image Captioning with Masked Diffusion Model

被引:0
作者
Tian, Weidong [1 ]
Xu, Wenzheng [1 ]
Zhao, Junxiang [1 ]
Zhao, Zhongqiu [1 ,2 ,3 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China
[2] HFUT, Intelligent Mfg Inst, Hefei, Peoples R China
[3] Guangxi Acad Sci, Nanning, Guangxi, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VIII, ICIC 2024 | 2024年 / 14869卷
基金
中国国家自然科学基金;
关键词
Image Captioning; Diffusion Model; Time Varying Mask; Features Fusion; CLIP;
D O I
10.1007/978-981-97-5603-2_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Some image captioning models adopt a non-autoregressive approach to independently generate each word, thereby speeding up the generation process. However, this generation method often sacrifices the quality of the generated captions. This paper aims to address this issue by proposing a novel diffusion model based on a non-autoregressive approach for image captioning tasks. Our model integrates a time-varying masking mechanism, gradually adding mask in the reverse diffusion process to guide image features selectively. Additionally, to further enhance the quality of generation, we introduce the CLIP model and fuse it with regional features to incorporate semantic information into the image features. This comprehensive utilization of visual and semantic information aids in generating richer and more accurate caption descriptions. To validate the performance of our model, we conducted extensive experiments and ablation studies on the MSCOCO benchmark. The experimental results demonstrate that our masked diffusion model combined with the CLIP model achieves highly competitive performance in caption generation tasks. Not only does it significantly improve generation speed, but it also yields satisfactory results in terms of generation quality. This study highlights the potential applications and importance of our approach in the field of image captioning.
引用
收藏
页码:216 / 227
页数:12
相关论文
共 27 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
Banerjee S, 2005, P 2 WORKSH STAT MACH, P228, DOI [10.3115/1626355.1626389, DOI 10.3115/1626355.1626389]
[4]  
Chen T, arXiv
[5]  
Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
[6]  
Dhariwal P, 2021, ADV NEUR IN, V34
[7]  
Fei ZC, 2019, Arxiv, DOI arXiv:1912.06365
[8]  
Fei ZC, 2021, AAAI CONF ARTIF INTE, V35, P1309
[9]   Fast R-CNN [J].
Girshick, Ross .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448
[10]  
Ho J., 2020, Advances in Neural Information Processing Systems, V33, P6840, DOI [DOI 10.48550/ARXIV.2006.11239, 10.48550/arXiv.2006.11239]