Zero-shot urban function inference with street view images through prompting a pretrained vision-language model

被引:10
作者
Huang, Weiming [1 ]
Wang, Jing [2 ]
Cong, Gao [1 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore, Singapore
[2] Singapore ETH Ctr, Future Cities Lab, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Urban land use; prompt engineering; CLIP; foundation model; street view image; CLASSIFICATION;
D O I
10.1080/13658816.2024.2347322
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP's zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications.
引用
收藏
页码:1414 / 1442
页数:29
相关论文
共 53 条
[1]  
Allingham JU, 2023, PR MACH LEARN RES, V202, P547
[2]   Geographic mapping with unsupervised multi-modal representation learning from VHR images and POIs [J].
Bai, Lubin ;
Huang, Weiming ;
Zhang, Xiuyuan ;
Du, Shihong ;
Cong, Gao ;
Wang, Haoyu ;
Liu, Bo .
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2023, 201 :193-208
[3]  
Balsebre P., 2023, P ACM MANAGEMENT DAT, V1, P1, DOI DOI 10.1145/3588947
[4]  
Bao H., 2022, Advances in Neural Information Processing Systems, V35, P32897, DOI DOI 10.1109/CVPR.2018.00636
[5]  
Bhalla U., 2024, PREPRINT
[6]   Street view imagery in urban analytics and GIS: A review [J].
Biljecki, Filip ;
Ito, Koichi .
LANDSCAPE AND URBAN PLANNING, 2021, 215
[7]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[8]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9]  
Dong X. Y., 2022, PREPRINT
[10]  
Dosovitskiy A, 2021, INT C LEARN REPR ICL