Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

被引:30
作者
Qiu, Heqian [1 ]
Li, Hongliang [1 ]
Wu, Qingbo [1 ]
Meng, Fanman [1 ]
Shi, Hengcan [1 ]
Zhao, Taijin [1 ]
Ngan, King Ngi [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Peoples R China
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
基金
中国国家自然科学基金;
关键词
referring expression comprehension; language-aware fine-grained object representations; language-aware deformable convolution model (LDC); bidirectional interaction model (BIM); hierarchical fine-grained representation network (HFRN);
D O I
10.1145/3394171.3413850
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, such as object proposal regions and grid regions. They ignore some fine-grained object information like shapes and poses, which are often described in language expressions and important to localize objects. Additionally, rectangular object regions usually contain background contents and irrelevant foreground features, which also decrease the localization performance. To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. Rather than extracting rectangular object representations, LDC adaptively samples a set of key points based on the image and language to represent objects. This type of object representations can capture more fine-grained object information (e.g., shapes and poses) and suppress noises in accordance with language and thus, boosts the object localization performance. Based on the language-aware fine-grained object representation, we next design a bidirectional interaction model (BIM) that leverages a modified co-attention mechanism to build cross-modal bidirectional interactions to further improve the language and object representations. Furthermore, we propose a hierarchical fine-grained representation network (HFRN) to learn language-aware fine-grained object representations and cross-modal bidirectional interactions at local word level and global sentence level, respectively. Our proposed method outperforms the state-of-the-art methods on the RefCOCO, RefCOCO+ and RefCOCOg datasets.
引用
收藏
页码:4171 / 4180
页数:10
相关论文
共 54 条
[1]  
[Anonymous], PROC CVPR IEEE
[2]  
Chen Kai, 2019, arXiv:1906.07155
[3]   Query-guided Regression Network with Context Policy for Phrase Grounding [J].
Chen, Kan ;
Kovvuri, Rama ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :824-832
[4]  
Dai JF, 2016, ADV NEUR IN, V29
[5]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[6]   Visual Grounding via Accumulated Attention [J].
Deng, Chaorui ;
Wu, Qi ;
Wu, Qingyao ;
Hu, Fuyuan ;
Lyu, Fan ;
Tan, Mingkui .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7746-7755
[7]   CenterNet: Keypoint Triplets for Object Detection [J].
Duan, Kaiwen ;
Bai, Song ;
Xie, Lingxi ;
Qi, Honggang ;
Huang, Qingming ;
Tian, Qi .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6568-6577
[8]   Aligning Linguistic Words and Visual Semantic Units for Image Captioning [J].
Guo, Longteng ;
Liu, Jing ;
Tang, Jinhui ;
Li, Jiangwei ;
Luo, Wei ;
Lu, Hanqing .
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, :765-773
[9]  
He KM, 2017, IEEE I CONF COMP VIS, P2980, DOI [10.1109/ICCV.2017.322, 10.1109/TPAMI.2018.2844175]
[10]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778