GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

被引：30

作者：

Liao, Haicheng ^{[1
,2
]}

Shen, Huanming ^{[3
]}

Li, Zhenning ^{[1
,2
,4
]}

Wang, Chengyue ^{[1
,4
]}

Li, Guofa ^{[5
]}

Bie, Yiming ^{[6
]}

Xu, Chengzhong ^{[1
,2
]}

机构：

[1] Univ Macau, State Key Lab Internet Things Smart City, Macau 999078, Peoples R China

[2] Univ Macau, Dept Comp & Informat Sci, Macau 999078, Peoples R China

[3] Univ Elect Sci & Technol China, Dept Informat & Software Engn, Chengdu 610000, Peoples R China

[4] Univ Macau, Dept Civil & Environm Engn, Macau 999078, Peoples R China

[5] Chongqing Univ, Coll Mech & Vehicle Engn, Chongqing 400030, Peoples R China

[6] Jilin Univ, Sch Transportat, Changchun 130000, Peoples R China

来源：

COMMUNICATIONS IN TRANSPORTATION RESEARCH | 2024年 / 4卷

关键词：

Autonomous driving; Visual grounding; Cross-modal attention; Large language models; Human-machine interaction;

D O I：

10.1016/j.commtr.2023.100116

中图分类号：

U [交通运输];

学科分类号：

08 ; 0823 ;

摘要：

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoderdecoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders-Text, Emotion, Image, Context, and CrossModal-with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

引用

页数：19

共 65 条

[1]

Hudson DA, 2018, Arxiv, DOI arXiv:1803.03067

[2]

Bhattacharyya A, 2022, LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P4944

[3] The social dilemma of autonomous vehicles [J].

Bonnefon, Jean-Francois ;

Shariff, Azim ;

Rahwan, Iyad .

SCIENCE, 2016, 352 (6293) :1573-1576

[4] Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs [J].

Bugliarello, Emanuele ;

Cotterell, Ryan ;

Okazaki, Naoaki ;

Elliott, Desmond .

TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 :978-994

[5] nuScenes: A multimodal dataset for autonomous driving [J].

Caesar, Holger ;

Bankiti, Varun ;

Lang, Alex H. ;

Vora, Sourabh ;

Liong, Venice Erin ;

Xu, Qiang ;

Krishnan, Anush ;

Pan, Yu ;

Baldan, Giancarlo ;

Beijbom, Oscar .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628

[6] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[7] Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention [J].

Chan, Hou Pong ;

Guo, Mingxi ;

Xu, Cheng-Zhong .

2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, :12464-12470

[8]

Chen XP, 2018, Arxiv, DOI arXiv:1812.03426

[9] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[10]

Cheng B., 2021, NeurIPS

← 1 2 3 4 5 6 7 →