Visual Grounding in Remote Sensing Images

被引：37

作者：

Sun, Yuxi ^{[1
]}

Feng, Shanshan ^{[1
]}

Li, Xutao ^{[1
]}

Ye, Yunming ^{[1
]}

Kang, Jian ^{[2
]}

Huang, Xu ^{[1
]}

机构：

[1] Harbin Inst Technol, Shenzhen, Peoples R China

[2] Soochow Univ, Suzhou, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

dataset; object retrieval; visual grounding; remote sensing; referring expression;

D O I：

10.1145/3503161.3548316

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG

引用

页数：9

共 40 条

[31] DOTA: A Large-scale Dataset for Object Detection in Aerial Images [J].

Xia, Gui-Song ;

Bai, Xiang ;

Ding, Jian ;

Zhu, Zhen ;

Belongie, Serge ;

Luo, Jiebo ;

Datcu, Mihai ;

Pelillo, Marcello ;

Zhang, Liangpei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3974-3983

[32] Relationship-Embedded Representation Learning for Grounding Referring Expressions [J].

Yang, Sibei ;

Li, Guanbin ;

Yu, Yizhou .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (08) :2765-2779

[33] Graph-Structured Referring Expression Reasoning in The Wild [J].

Yang, Sibei ;

Li, Guanbin ;

Yu, Yizhou .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9949-9958

[34] Dynamic Graph Attention for Referring Expression Comprehension [J].

Yang, Sibei ;

Li, Guanbin ;

Yu, Yizhou .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4643-4652

[35] Improving One-Stage Visual Grounding by Recursive Sub-query Construction [J].

Yang, Zhengyuan ;

Chen, Tianlang ;

Wang, Liwei ;

Luo, Jiebo .

COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :387-404

[36] A Fast and Accurate One-Stage Approach to Visual Grounding [J].

Yang, Zhengyuan ;

Gong, Boqing ;

Wang, Liwei ;

Huang, Wenbing ;

Yu, Dong ;

Luo, Jiebo .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4682-4692

[37] MAttNet: Modular Attention Network for Referring Expression Comprehension [J].

Yu, Licheng ;

Lin, Zhe ;

Shen, Xiaohui ;

Yang, Jimei ;

Lu, Xin ;

Bansal, Mohit ;

Berg, Tamara L. .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1307-1315

[38] Modeling Context in Referring Expressions [J].

Yu, Licheng ;

Poirson, Patrick ;

Yang, Shan ;

Berg, Alexander C. ;

Berg, Tamara L. .

COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 :69-85

[39] Grounding Referring Expressions in Images by Variational Context [J].

Zhang, Hanwang ;

Niu, Yulei ;

Chang, Shih-Fu .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4158-4166

[40] Parallel Attention: A Unified Framework for Visual Object Discovery through Dialogs and Queries [J].

Zhuang, Bohan ;

Wu, Qi ;

Shen, Chunhua ;

Reid, Ian ;

van den Hengel, Anton .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4252-4261

← 1 2 3 4 →