Weakly supervised grounded image captioning with semantic matching

被引:1
|
作者
Du, Sen [1 ]
Zhu, Hong [1 ]
Lin, Guangfeng [2 ]
Liu, Yuanyuan [1 ]
Wang, Dong [1 ]
Shi, Jing [1 ]
Wu, Zhong [1 ]
机构
[1] Xian Univ Technol, Automat & Informat Engn, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China
[2] Xian Univ Technol, Informat Sci Dept, 5 South Jinhua Rd, Xian 710048, Shaanxi, Peoples R China
关键词
Image captioning; Visual grounding; Semantic matching; ATTENTION; TRANSFORMER; LANGUAGE; GUIDANCE;
D O I
10.1007/s10489-024-05389-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual attention has been extensively adopted in many tasks, such as image captioning. It not only improves the performance of image captioning but is also used to enhance the quality of caption rationality. Rationality can be understood as the ability to maintain attention on the correct regions while generating words or phrases. This is critical for alleviating the problems of object hallucinations. Recently, many researchers have devoted to improving grounding accuracy by linking generated object words or phrases to appropriate regions of the image. However, collecting word-region alignment is expensive and limited, and the generated object words may not appear in the annotation sentences. To address this challenge, we propose a weakly supervised grounded image captioning method. Specifically, we design a region-word matching block to estimate the match scores for the candidate nouns with all regions. Compared to manual annotations, the match score may contain some mistakes. To make the captioning model compatible with these mistakes, we design a reinforcement loss that takes into account both attention weights and match scores. This allows the captioning model to generate a more accurate and grounded sentence. Experimental results on two commonly used benchmark datasets (MSCOCO, Flickr30k) demonstrate the superiority of the proposed blocks. Extensive ablation studies also validate the effectiveness and robustness of the proposed modules. Last but not least, our blocks are available in a variety of captioning models and do not require additional label or extra time consumption in inference stage.
引用
收藏
页码:4300 / 4318
页数:19
相关论文
共 50 条
  • [1] Weakly supervised grounded image captioning with semantic matching
    Sen Du
    Hong Zhu
    Guangfeng Lin
    Yuanyuan Liu
    Dong Wang
    Jing Shi
    Zhong Wu
    Applied Intelligence, 2024, 54 : 4300 - 4318
  • [2] Top-down framework for weakly-supervised grounded image captioning
    Cai, Chen
    Wang, Suchen
    Yap, Kim-Hui
    Wang, Yi
    KNOWLEDGE-BASED SYSTEMS, 2024, 287
  • [3] CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation
    Xie, Jinheng
    Hou, Xianxu
    Ye, Kai
    Shen, Linlin
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 4473 - 4482
  • [4] SPATIAL-SEMANTIC ATTENTION FOR GROUNDED IMAGE CAPTIONING
    Hu, Wenzhe
    Wang, Lanxiao
    Xu, Linfeng
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 61 - 65
  • [5] More Grounded Image Captioning by Distilling Image-Text Matching Model
    Zhou, Yuanen
    Wang, Meng
    Liu, Daqing
    Hu, Zhenzhen
    Zhang, Hanwang
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 4776 - 4785
  • [6] Local-to-Global Semantic Supervised Learning for Image Captioning
    Wang, Juan
    Duan, Yiping
    Tao, Xiaoming
    Lu, Jianhua
    ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
  • [7] Weakly-supervised image captioning based on rich contextual information
    Zheng, Hai-Tao
    Wang, Zhe
    Ma, Ningning
    Chen, Jinyuan
    Xiao, Xi
    Sangaiah, Arun Kumar
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (14) : 18583 - 18599
  • [8] Weakly-supervised image captioning based on rich contextual information
    Hai-Tao Zheng
    Zhe Wang
    Ningning Ma
    Jinyuan Chen
    Xi Xiao
    Arun Kumar Sangaiah
    Multimedia Tools and Applications, 2018, 77 : 18583 - 18599
  • [9] Weakly Supervised Captioning of Ultrasound Images
    Alsharid, Mohammad
    Sharma, Harshita
    Drukker, Lior
    Papageorgiou, Aris T.
    Noble, J. Alison
    MEDICAL IMAGE UNDERSTANDING AND ANALYSIS, MIUA 2022, 2022, 13413 : 187 - 198
  • [10] Weakly Supervised Dense Video Captioning
    Shen, Zhiqiang
    Li, Jianguo
    Su, Zhou
    Li, Minjun
    Chen, Yurong
    Jiang, Yu-Gang
    Xue, Xiangyang
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 5159 - 5167