Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding

被引:80
作者
Liu, Xuejing [1 ,2 ]
Li, Liang [1 ]
Wang, Shuhui [1 ]
Zha, Zheng-Jun [3 ]
Meng, Dechao [1 ,2 ]
Huang, Qingming [1 ,2 ,4 ]
机构
[1] Chinese Acad Sci, Inst Comput Tech, Key Lab Intell Info Proc, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Univ Sci & Technol China, Hefei, Peoples R China
[4] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV.2019.00270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised referring expression grounding aims at localizing the referential object in an image according to the linguistic query, where the mapping between the referential object and query is unknown in the training stage. To address this problem, we propose a novel end-to-end adaptive reconstruction network (ARN). It builds the correspondence between image region proposal and query in an adaptive manner: adaptive grounding and collaborative reconstruction. Specifically, we first extract the subject, location and context features to represent the proposals and the query respectively. Then, we design the adaptive grounding module to compute the matching score between each proposal and query by a hierarchical attention model. Finally, based on attention score and proposal features, we reconstruct the input query with a collaborative loss of language reconstruction loss, adaptive reconstruction loss, and attribute classification loss. This adaptive mechanism helps our model to alleviate the variance of different referring expressions. Experiments on four large-scale datasets show ARN outperforms existing state-of-the-art methods by a large margin. Qualitative results demonstrate that the proposed ARN can better handle the situation where multiple objects of a particular category situated together(1).
引用
收藏
页码:2611 / 2620
页数:10
相关论文
共 46 条
[11]  
FitzGerald Nicholas., 2013, EMNLP, P1914
[12]   IQA: Visual Question Answering in Interactive Environments [J].
Gordon, Daniel ;
Kembhavi, Aniruddha ;
Rastegari, Mohammad ;
Redmon, Joseph ;
Fox, Dieter ;
Farhadi, Ali .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4089-4098
[13]  
Grubinger Michael, 2006, WORKSH ONT
[14]  
Guadarrama S., 2014, Robotics: science and systems
[15]   Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels [J].
Han, Bo ;
Yao, Quanming ;
Yu, Xingrui ;
Niu, Gang ;
Xu, Miao ;
Hu, Weihua ;
Tsang, Ivor W. ;
Sugiyama, Masashi .
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[16]   Natural Language Object Retrieval [J].
Hu, Ronghang ;
Xu, Huazhe ;
Rohrbach, Marcus ;
Feng, Jiashi ;
Saenko, Kate ;
Darrell, Trevor .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4555-4564
[17]   The segmented and annotated IAPR TC-12 benchmark [J].
Jair Escalante, Hugo ;
Hernandez, Carlos A. ;
Gonzalez, Jesus A. ;
Lopez-Lopez, A. ;
Montes, Manuel ;
Morales, Eduardo F. ;
Sucar, L. Enrique ;
Villasenor, Luis ;
Grubinger, Michael .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2010, 114 (04) :419-428
[18]  
Kingma DP, 2014, ADV NEUR IN, V27
[19]   Attentive Recurrent Neural Network for Weak-supervised Multi-label Image Classification [J].
Li, Liang ;
Wang, Shuhui ;
Jiang, Shuqiang ;
Huang, Qingming .
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :1092-1100
[20]   Partial-Duplicate Image Retrieval via Saliency-Guided Visual Matching [J].
Li, Liang ;
Jiang, Shuqiang ;
Zha, Zheng-Jun ;
Wu, Zhipeng ;
Huang, Qingming .
IEEE MULTIMEDIA, 2013, 20 (03) :13-23