Visual Grounding in Remote Sensing Images

被引:23
作者
Sun, Yuxi [1 ]
Feng, Shanshan [1 ]
Li, Xutao [1 ]
Ye, Yunming [1 ]
Kang, Jian [2 ]
Huang, Xu [1 ]
机构
[1] Harbin Inst Technol, Shenzhen, Peoples R China
[2] Soochow Univ, Suzhou, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
中国国家自然科学基金;
关键词
dataset; object retrieval; visual grounding; remote sensing; referring expression;
D O I
10.1145/3503161.3548316
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG
引用
收藏
页数:9
相关论文
共 40 条
  • [1] Chen L, 2021, AAAI CONF ARTIF INTE, V35, P1036
  • [2] Chen Z., 2020, P IEEECVF C COMPUTER, P10086
  • [3] Exploiting SAR and VHR Optical Images to Quantify Damage Caused by the 2003 Bam Earthquake
    Chini, Marco
    Pierdicca, Nazzareno
    Emery, William J.
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2009, 47 (01): : 145 - 152
  • [4] GuessWhat?! Visual object discovery through multi-modal dialogue
    de Vries, Harm
    Strub, Florian
    Chandar, Sarath
    Pietquin, Olivier
    Larochelle, Hugo
    Courville, Aaron
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4466 - 4475
  • [5] Visual Grounding via Accumulated Attention
    Deng, Chaorui
    Wu, Qi
    Wu, Qingyao
    Hu, Fuyuan
    Lyu, Fan
    Tan, Mingkui
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7746 - 7755
  • [6] Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
  • [7] Learning RoI Transformer for Oriented Object Detection in Aerial Images
    Ding, Jian
    Xue, Nan
    Long, Yang
    Xia, Gui-Song
    Lu, Qikai
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 2844 - 2853
  • [8] Modeling Relationships in Referential Expressions with Compositional Modular Networks
    Hu, Ronghang
    Rohrbach, Marcus
    Andreas, Jacob
    Darrell, Trevor
    Saenko, Kate
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4418 - 4427
  • [9] Natural Language Object Retrieval
    Hu, Ronghang
    Xu, Huazhe
    Rohrbach, Marcus
    Feng, Jiashi
    Saenko, Kate
    Darrell, Trevor
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4555 - 4564
  • [10] Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding
    Huang, Binbin
    Lian, Dongze
    Luo, Weixin
    Gao, Shenghua
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 16883 - 16892