Weakly-Supervised Generation and Grounding of Visual Descriptions with Conditional Generative Models

被引:3
作者
Mavroudi, Effrosyni [1 ]
Vidal, Rene [1 ]
机构
[1] Johns Hopkins Univ, Math Inst Data Sci, Dept Biomed Engn, Baltimore, MD 21218 USA
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.01510
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given weak supervision from image- or video-caption pairs, we address the problem of grounding (localizing) each object word of a ground-truth or generated sentence describing a visual input. Recent weakly-supervised approaches leverage region proposals and ground words based on the region attention coefficients of captioning models. To predict each next word in the sentence they attend over regions using a summary of the previous words as a query, and then ground the word by selecting the most attended regions. However, this leads to sub-optimal grounding, since attention coefficients are computed without taking into account the word that needs to be localized. To address this shortcoming, we propose a novel Grounded Visual Description Conditional Variational Autoencoder (GVD-CVAE) and leverage its latent variables for grounding. In particular, we introduce a discrete random variable that models each word-to-region alignment, and learn its approximate posterior distribution given the full sentence. Experiments on challenging image and video datasets (Flickr30k Entities, YouCook2, ActivityNet Entities) validate the effectiveness of our conditional generative model, showing that it can substantially outperform soft-attention-based baselines in grounding.
引用
收藏
页码:15523 / 15533
页数:11
相关论文
共 77 条
[1]  
Alemi AA, 2018, PR MACH LEARN RES, V80
[2]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[3]   Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning [J].
Aneja, Jyoti ;
Agrawal, Harsh ;
Batra, Dhruv ;
Schwing, Alexander .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4260-4269
[4]  
[Anonymous], 2017, NEURAL INFORM PROCES
[5]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[6]  
Bhattacharyya Apratim, 2018, IEEE C COMP VIS PATT
[7]  
Bowman Samuel R., 2016, 20 SIGNLL C COMP NAT
[8]  
Carbonetto P, 2004, LECT NOTES COMPUT SC, V3021, P350
[9]  
Chen GZ, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS, ROBOTICS AND AUTOMATION (ICMRA), P188, DOI 10.1109/ICMRA.2018.8490580
[10]   Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].
Chen, Kan ;
Gao, Jiyang ;
Nevatia, Ram .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050