Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引：0

作者：

Xu, Liuchang ^{[1
,2
,5
]}

Zhao, Shuo ^{[1
]}

Lin, Qingming ^{[1
]}

Chen, Luyao ^{[1
]}

Luo, Qianqian ^{[1
]}

Wu, Sensen ^{[2
]}

Ye, Xinyue ^{[3
,4
]}

Feng, Hailin ^{[1
]}

Du, Zhenhong ^{[2
]}

机构：

[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China

[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China

[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA

[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA

[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China

来源：

INTERNATIONAL JOURNAL OF DIGITAL EARTH | 2025年 / 18卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;

D O I：

10.1080/17538947.2025.2480268

中图分类号：

P9 [自然地理学];

学科分类号：

0705 ; 070501 ;

摘要：

The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

引用

页数：32

共 50 条

[41] Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks
Hodson, Nathan
Williamson, Simon
JMIR AI, 2024, 3
[42] Sources of Hallucination by Large Language Models on Inference Tasks
McKenna, Nick
Li, Tianyi
Cheng, Liang
Hosseini, Mohammad Javad
Johnson, Mark
Steedman, Mark
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774
[43] Benchmarking large language models for biomedical natural language processing applications and recommendations
Chen, Qingyu
Hu, Yan
Peng, Xueqing
Xie, Qianqian
Jin, Qiao
Gilson, Aidan
Singer, Maxwell B.
Ai, Xuguang
Lai, Po-Ting
Wang, Zhizheng
Keloth, Vipina K.
Raja, Kalpana
Huang, Jimin
He, Huan
Lin, Fongci
Du, Jingcheng
Zhang, Rui
Zheng, W. Jim
Adelman, Ron A.
Lu, Zhiyong
Xu, Hua
NATURE COMMUNICATIONS, 2025, 16 (01)
[44] Facilitating Autonomous Driving Tasks With Large Language Models
Wu, Mengyao
Yu, F. Richard
Liu, Peter Xiaoping
He, Ying
IEEE INTELLIGENT SYSTEMS, 2025, 40 (01) : 45 - 52
[45] Benchmarking Large Language Models in Retrieval-Augmented Generation
Chen, Jiawei
Lin, Hongyu
Han, Xianpei
Sun, Le
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17754 - 17762
[46] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
[47] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
Fernando, Riya
Norton, Isabel
Dogra, Pranay
Sarnaik, Rohit
Wazir, Hasan
Ren, Zitang
Gunda, Niveta Sree
Mukhopadhyay, Anushka
Lutz, Michael
2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
[48] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
Karlsen, Egil
Luo, Xiao
Zincir-Heywood, Nur
Heywood, Malcolm
JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
[49] Robustness of GPT Large Language Models on Natural Language Processing Tasks
Xuanting C.
Junjie Y.
Can Z.
Nuo X.
Tao G.
Qi Z.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (05): : 1128 - 1142
[50] RMCBENCH: Benchmarking Large Language Models' Resistance to Malicious Code
Chen, Jiachi
Zhong, Qingyuan
Wang, Yanlin
Ning, Kaiwen
Liu, Yongkun
Xu, Zenan
Zhao, Zhe
Chen, Ting
Zheng, Zibin
PROCEEDINGS OF 2024 39TH ACM/IEEE INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2024, 2024, : 995 - 1006

← 1 2 3 4 5 →