Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

被引:108
作者
Wu, Hao [1 ,3 ,4 ,6 ]
Mao, Jiayuan [5 ,6 ]
Zhang, Yufeng [6 ]
Jiang, Yuning [2 ,6 ]
Li, Lei [6 ]
Sun, Weiwei [1 ,3 ,4 ]
Ma, Wei-Ying [6 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[2] Fudan Univ, Sch Econ, Shanghai, Peoples R China
[3] Fudan Univ, Syst & Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[4] Tsinghua Univ, Shanghai Inst Intelligent Elect & Syst, Beijing, Peoples R China
[5] Tsinghua Univ, Inst Interdisciplinary Informat Sci, ITCS, Beijing, Peoples R China
[6] Bytedance AI Lab, Beijing, Peoples R China
来源
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) | 2019年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2019.00677
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learningajoint space of visual representation and textual semantics. The model unifies the embeddings of concepts at different levels: objects, attributes,relations, andfull scenes. We view the sentential semantics as a combination of different semantic components such as objects and relations; their embeddings are aligned with different image regions. A contrastive learningapproachis proposedfor the effective learning of this fine-grained alignmentfrom only image-captionpairs. We also present a simple yet effective approach that enforces the coverage of caption embeddings on the semantic components that appearin the sentence. We demonstrate that the Unified VSE outperforms baselines on cross-modal retrievaltasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarialattacks. Moreover our model empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
引用
收藏
页码:6602 / 6611
页数:10
相关论文
共 45 条
[1]   Bootstrapping language acquisition [J].
Abend, Omri ;
Kwiatkowski, Tom ;
Smith, Nathaniel J. ;
Goldwater, Sharon ;
Steedman, Mark .
COGNITION, 2017, 164 :116-143
[2]  
[Anonymous], ARXIV180503716
[3]  
[Anonymous], 2016, P INT C LEARN REPR I
[4]  
[Anonymous], 2017, P IEEE C COMP VIS PA
[5]  
[Anonymous], 2015, 3 INT C LEARN REPR I
[6]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[7]  
Banarescu L, 2013, LING ANN WORKSH INT
[8]  
Christie G., 2016, ARXIV160402125
[9]  
Chung J, 2014, NIPS 2014 WORKSH DEE, DOI DOI 10.48550/ARXIV.1412.3555
[10]  
Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878