Weakly supervised grounded image captioning with semantic matching

被引：1

作者：

Du, Sen ^{[1
]}

Zhu, Hong ^{[1
]}

Lin, Guangfeng ^{[2
]}

Liu, Yuanyuan ^{[1
]}

Wang, Dong ^{[1
]}

Shi, Jing ^{[1
]}

Wu, Zhong ^{[1
]}

机构：

[1] Xian Univ Technol, Automat & Informat Engn, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China

[2] Xian Univ Technol, Informat Sci Dept, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 05期

关键词：

Image captioning; Visual grounding; Semantic matching; ATTENTION; TRANSFORMER; LANGUAGE; GUIDANCE;

D O I：

10.1007/s10489-024-05389-y

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual attention has been extensively adopted in many tasks, such as image captioning. It not only improves the performance of image captioning but is also used to enhance the quality of caption rationality. Rationality can be understood as the ability to maintain attention on the correct regions while generating words or phrases. This is critical for alleviating the problems of object hallucinations. Recently, many researchers have devoted to improving grounding accuracy by linking generated object words or phrases to appropriate regions of the image. However, collecting word-region alignment is expensive and limited, and the generated object words may not appear in the annotation sentences. To address this challenge, we propose a weakly supervised grounded image captioning method. Specifically, we design a region-word matching block to estimate the match scores for the candidate nouns with all regions. Compared to manual annotations, the match score may contain some mistakes. To make the captioning model compatible with these mistakes, we design a reinforcement loss that takes into account both attention weights and match scores. This allows the captioning model to generate a more accurate and grounded sentence. Experimental results on two commonly used benchmark datasets (MSCOCO, Flickr30k) demonstrate the superiority of the proposed blocks. Extensive ablation studies also validate the effectiveness and robustness of the proposed modules. Last but not least, our blocks are available in a variety of captioning models and do not require additional label or extra time consumption in inference stage.

引用

页码：4300 / 4318

页数：19

共 50 条

[1] Weakly supervised grounded image captioning with semantic matching
Sen Du
Hong Zhu
Guangfeng Lin
Yuanyuan Liu
Dong Wang
Jing Shi
Zhong Wu
Applied Intelligence, 2024, 54 : 4300 - 4318
[2] Top-down framework for weakly-supervised grounded image captioning
Cai, Chen
Wang, Suchen
Yap, Kim-Hui
Wang, Yi
KNOWLEDGE-BASED SYSTEMS, 2024, 287
[3] CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation
Xie, Jinheng
Hou, Xianxu
Ye, Kai
Shen, Linlin
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4473 - 4482
[4] SPATIAL-SEMANTIC ATTENTION FOR GROUNDED IMAGE CAPTIONING
Hu, Wenzhe
Wang, Lanxiao
Xu, Linfeng
2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 61 - 65
[5] More Grounded Image Captioning by Distilling Image-Text Matching Model
Zhou, Yuanen
Wang, Meng
Liu, Daqing
Hu, Zhenzhen
Zhang, Hanwang
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
[6] Local-to-Global Semantic Supervised Learning for Image Captioning
Wang, Juan
Duan, Yiping
Tao, Xiaoming
Lu, Jianhua
ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
[7] Weakly-supervised image captioning based on rich contextual information
Zheng, Hai-Tao
Wang, Zhe
Ma, Ningning
Chen, Jinyuan
Xiao, Xi
Sangaiah, Arun Kumar
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (14) : 18583 - 18599
[8] Weakly-supervised image captioning based on rich contextual information
Hai-Tao Zheng
Zhe Wang
Ningning Ma
Jinyuan Chen
Xi Xiao
Arun Kumar Sangaiah
Multimedia Tools and Applications, 2018, 77 : 18583 - 18599
[9] Weakly Supervised Captioning of Ultrasound Images
Alsharid, Mohammad
Sharma, Harshita
Drukker, Lior
Papageorgiou, Aris T.
Noble, J. Alison
MEDICAL IMAGE UNDERSTANDING AND ANALYSIS, MIUA 2022, 2022, 13413 : 187 - 198
[10] Weakly Supervised Dense Video Captioning
Shen, Zhiqiang
Li, Jianguo
Su, Zhou
Li, Minjun
Chen, Yurong
Jiang, Yu-Gang
Xue, Xiangyang
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167

← 1 2 3 4 5 →