DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model

被引:35
作者
Xu, Zhenhua [1 ]
Zhang, Yujia [2 ]
Xie, Enze [3 ]
Zhao, Zhen [4 ]
Guo, Yong [3 ]
Wong, Kwan-Yee K. [1 ]
Li, Zhenguo [3 ]
Zhao, Hengshuang [1 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Zhejiang Univ, Hangzhou 310027, Peoples R China
[3] Huawei Noahs Ark Lab, Montreal, PQ H3N 1X9, Canada
[4] Univ Sydney, Camperdown, NSW 2050, Australia
基金
中国国家自然科学基金;
关键词
Autonomous vehicles; Videos; Chatbots; Visualization; Cognition; Turning; Tuning; Autonomous driving; large language model;
D O I
10.1109/LRA.2024.3440097
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Multimodallarge language models (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion. These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V.
引用
收藏
页码:8186 / 8193
页数:8
相关论文
共 53 条
[1]   Explainable Artificial Intelligence for Autonomous Driving: A Comprehensive Overview and Field Guide for Future Research Directions [J].
Atakishiyev, Shahin ;
Salameh, Mohammad ;
Yao, Hengshuai ;
Goebel, Randy .
IEEE ACCESS, 2024, 12 :101603-101625
[2]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[3]  
Bojarski M, 2016, Arxiv, DOI arXiv:1604.07316
[4]  
Brohan A, 2023, Arxiv, DOI arXiv:2307.15818
[5]   nuScenes: A multimodal dataset for autonomous driving [J].
Caesar, Holger ;
Bankiti, Varun ;
Lang, Alex H. ;
Vora, Sourabh ;
Liong, Venice Erin ;
Xu, Qiang ;
Krishnan, Anush ;
Pan, Yu ;
Baldan, Giancarlo ;
Beijbom, Oscar .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628
[6]  
ChatGPT OpneAI, 2023, about us
[7]   End-to-End Autonomous Driving: Challenges and Frontiers [J].
Chen, Li ;
Wu, Penghao ;
Chitta, Kashyap ;
Jaeger, Bernhard ;
Geiger, Andreas ;
Li, Hongyang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) :10164-10183
[8]  
Chowdhery A, 2023, J MACH LEARN RES, V24
[9]   Talk2Car: Taking Control of Your Self-Driving Car [J].
Deruyttere, Thierry ;
Vandenhende, Simon ;
Grujicic, Dusan ;
Van Gool, Luc ;
Moens, Marie-Francine .
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, :2088-2098
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171