Image Captioning with Masked Diffusion Model

被引：0

作者：

Tian, Weidong ^{[1
]}

Xu, Wenzheng ^{[1
]}

Zhao, Junxiang ^{[1
]}

Zhao, Zhongqiu ^{[1
,2
,3
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China

[2] HFUT, Intelligent Mfg Inst, Hefei, Peoples R China

[3] Guangxi Acad Sci, Nanning, Guangxi, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VIII, ICIC 2024 | 2024年 / 14869卷

基金：

中国国家自然科学基金;

关键词：

Image Captioning; Diffusion Model; Time Varying Mask; Features Fusion; CLIP;

D O I：

10.1007/978-981-97-5603-2_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Some image captioning models adopt a non-autoregressive approach to independently generate each word, thereby speeding up the generation process. However, this generation method often sacrifices the quality of the generated captions. This paper aims to address this issue by proposing a novel diffusion model based on a non-autoregressive approach for image captioning tasks. Our model integrates a time-varying masking mechanism, gradually adding mask in the reverse diffusion process to guide image features selectively. Additionally, to further enhance the quality of generation, we introduce the CLIP model and fuse it with regional features to incorporate semantic information into the image features. This comprehensive utilization of visual and semantic information aids in generating richer and more accurate caption descriptions. To validate the performance of our model, we conducted extensive experiments and ablation studies on the MSCOCO benchmark. The experimental results demonstrate that our masked diffusion model combined with the CLIP model achieves highly competitive performance in caption generation tasks. Not only does it significantly improve generation speed, but it also yields satisfactory results in terms of generation quality. This study highlights the potential applications and importance of our approach in the field of image captioning.

引用

页码：216 / 227

页数：12

共 27 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

Banerjee Satanjeev, 2005, P ACL WORKSHOP INTRI

[4]

Chen Tsai-Shien, arXiv

[5] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[6]

Dhariwal P, 2021, ADV NEUR IN, V34

[7]

Fei ZC, 2019, Arxiv, DOI arXiv:1912.06365

[8]

Fei ZC, 2021, AAAI CONF ARTIF INTE, V35, P1309

[9] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[10]

Ho J, 2020, P 34 INT C NEUR INF, P6840

← 1 2 3 →