GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

被引:30
作者
Liao, Haicheng [1 ,2 ]
Shen, Huanming [3 ]
Li, Zhenning [1 ,2 ,4 ]
Wang, Chengyue [1 ,4 ]
Li, Guofa [5 ]
Bie, Yiming [6 ]
Xu, Chengzhong [1 ,2 ]
机构
[1] Univ Macau, State Key Lab Internet Things Smart City, Macau 999078, Peoples R China
[2] Univ Macau, Dept Comp & Informat Sci, Macau 999078, Peoples R China
[3] Univ Elect Sci & Technol China, Dept Informat & Software Engn, Chengdu 610000, Peoples R China
[4] Univ Macau, Dept Civil & Environm Engn, Macau 999078, Peoples R China
[5] Chongqing Univ, Coll Mech & Vehicle Engn, Chongqing 400030, Peoples R China
[6] Jilin Univ, Sch Transportat, Changchun 130000, Peoples R China
来源
COMMUNICATIONS IN TRANSPORTATION RESEARCH | 2024年 / 4卷
关键词
Autonomous driving; Visual grounding; Cross-modal attention; Large language models; Human-machine interaction;
D O I
10.1016/j.commtr.2023.100116
中图分类号
U [交通运输];
学科分类号
08 ; 0823 ;
摘要
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoderdecoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Emotion, Image, Context, and CrossModal-with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.
引用
收藏
页数:19
相关论文
共 65 条
[1]  
Hudson DA, 2018, Arxiv, DOI arXiv:1803.03067
[2]  
Bhattacharyya A, 2022, LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P4944
[3]   The social dilemma of autonomous vehicles [J].
Bonnefon, Jean-Francois ;
Shariff, Azim ;
Rahwan, Iyad .
SCIENCE, 2016, 352 (6293) :1573-1576
[4]   Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs [J].
Bugliarello, Emanuele ;
Cotterell, Ryan ;
Okazaki, Naoaki ;
Elliott, Desmond .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :978-994
[5]   nuScenes: A multimodal dataset for autonomous driving [J].
Caesar, Holger ;
Bankiti, Varun ;
Lang, Alex H. ;
Vora, Sourabh ;
Liong, Venice Erin ;
Xu, Qiang ;
Krishnan, Anush ;
Pan, Yu ;
Baldan, Giancarlo ;
Beijbom, Oscar .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628
[6]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[7]   Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention [J].
Chan, Hou Pong ;
Guo, Mingxi ;
Xu, Cheng-Zhong .
2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, :12464-12470
[8]  
Chen XP, 2018, Arxiv, DOI arXiv:1812.03426
[9]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[10]  
Cheng B., 2021, NeurIPS