Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models

被引:3
作者
Mavroudi, Effrosyni [1 ]
Vidal, Rene [1 ]
机构
[1] Johns Hopkins Univ, Math Inst Data Sci, Dept Biomed Engn, Baltimore, MD 21218 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01510
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given weak supervision from image- or video-caption pairs, we address the problem of grounding (localizing) each object word of a ground-truth or generated sentence describing a visual input. Recent weakly-supervised approaches leverage region proposals and ground words based on the region attention coefficients of captioning models. To predict each next word in the sentence they attend over regions using a summary of the previous words as a query, and then ground the word by selecting the most attended regions. However, this leads to sub-optimal grounding, since attention coefficients are computed without taking into account the word that needs to be localized. To address this shortcoming, we propose a novel Grounded Visual Description Conditional Variational Autoencoder (GVD-CVAE) and leverage its latent variables for grounding. In particular, we introduce a discrete random variable that models each word-to-region alignment, and learn its approximate posterior distribution given the full sentence. Experiments on challenging image and video datasets (Flickr30k Entities, YouCook2, ActivityNet Entities) validate the effectiveness of our conditional generative model, showing that it can substantially outperform soft-attention-based baselines in grounding.
引用
收藏
页码:15523 / 15533
页数:11
相关论文
共 77 条
[61]   Few-Shot Visual Grounding for Natural Human-Robot Interaction [J].
Tziafas, Giorgos ;
Kasaei, Hamidreza .
2021 IEEE INTERNATIONAL CONFERENCE ON AUTONOMOUS ROBOT SYSTEMS AND COMPETITIONS (ICARSC), 2021, :50-55
[62]  
Vedantam R, 2015, PROC CVPR IEEE, P4566, DOI 10.1109/CVPR.2015.7299087
[63]  
Wang Liwei, 2017, NEURAL INFORM PROCES, V2017
[64]  
Wang Wei, 2021, ACM INT C MULT
[65]   Weakly-supervised Visual Grounding of Phrases with Linguistic Structures [J].
Xiao, Fanyi ;
Sigal, Leonid ;
Lee, Yong Jae .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5253-5262
[66]  
Xu K, 2015, PR MACH LEARN RES, V37, P2048
[67]  
Yang Xun, 2020, ACM INT C MULT
[68]  
Yu Haonan, 2013, ANN M ASS COMP LING, V1
[69]  
Zaheer Manzil, 2017, INT C MACH LEARN, V8
[70]   Neural Motifs: Scene Graph Parsing with Global Context [J].
Zellers, Rowan ;
Yatskar, Mark ;
Thomson, Sam ;
Choi, Yejin .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5831-5840