paper Exploring Contextual Representation and Multi-modality for End-to-end Autonomous Driving

被引：2

作者：

Azam, Shoaib ^{[1
,2
]}

Munir, Farzeen ^{[1
,2
]}

Kyrki, Ville ^{[1
,2
]}

Kucner, Tomasz Piotr ^{[1
,2
]}

Jeon, Moongu ^{[3
]}

Pedrycz, Witold ^{[4
,5
,6
]}

机构：

[1] Aalto Univ, Dept Elect Engn & Automat, Espoo, Finland

[2] Finnish Ctr Artificial Intelligence, Espoo, Finland

[3] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea

[4] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6R 2V4, Canada

[5] King Abdulaziz Univ, Fac Engn, Dept Elect & Comp Engn, Jeddah 21589, Saudi Arabia

[6] Polish Acad Sci, Syst Res Inst, PL-01447 Warsaw, Poland

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2024年 / 135卷

基金：

芬兰科学院;

关键词：

Vision-centric autonomous driving; Attention; Contextual representation; Imitation learning; Vision transformer; IMAGES;

D O I：

10.1016/j.engappai.2024.108767

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision -making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack global environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map -based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human -like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end -to -end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top -down bird -eye -view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self -attention mechanism, leading to an auto -regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed -loop settings. Our method achieves displacement error by 0 . 67 m in open -loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed -loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.

引用

页数：13

共 48 条

[41] GLAS: Global-to-Local Safe Autonomy Synthesis for Multi-Robot Motion Planning With End-to-End Learning [J].

Riviere, Benjamin ;

Honig, Wolfgang ;

Yue, Yisong ;

Chung, Soon-Jo .

IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (03) :4249-4256

[42] Cooperative-Net: An end-to-end multi-task interaction network for unified reconstruction and segmentation of MR image [J].

Li, Xiaodi ;

Hu, Yue .

COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 245

[43] Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition [J].

Wang, Jisung ;

Kim, Jihwan ;

Kim, Sangki ;

Lee, Yeha .

INTERSPEECH 2020, 2020, :1072-1075

[44] A Multi-Task Framework for Facial Attributes Classification through End-to-End Face Parsing and Deep Convolutional Neural Networks [J].

Khan, Khalil ;

Attique, Muhammad ;

Khan, Rehan Ullah ;

Syed, Ikram ;

Chung, Tae-Sun .

SENSORS, 2020, 20 (02)

[45] A New End-to-End Multi-Dimensional CNN Framework for Land Cover/Land Use Change Detection in Multi-Source Remote Sensing Datasets [J].

Seydi, Seyd Teymoor ;

Hasanlou, Mahdi ;

Amani, Meisam .

REMOTE SENSING, 2020, 12 (12)

[46] Prostate lesion segmentation based on a 3D end-to-end convolution neural network with deep multi-scale attention [J].

Song, Enmin ;

Long, Jiaosong ;

Ma, Guangzhi ;

Liu, Hong ;

Hung, Chih-Cheng ;

Jin, Renchao ;

Wang, Peijun ;

Wang, Wei .

MAGNETIC RESONANCE IMAGING, 2023, 99 :98-109

[47] Cer-ConvN3Unet: an end-to-end multi-parametric MRI-based pipeline for automated detection and segmentation of cervical cancer [J].

Xia, Shao-Jun ;

Zhao, Bo ;

Li, Yingming ;

Kong, Xiangxing ;

Wang, Zhi-Nan ;

Yang, Qingmo ;

Wu, Jia-Qi ;

Li, Haijiao ;

Cao, Kun ;

Zhu, Hai-Tao ;

Li, Xiao-Ting ;

Zhang, Xiao-Yan ;

Sun, Ying-Shi .

EUROPEAN RADIOLOGY EXPERIMENTAL, 2025, 9 (01)

[48] MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning [J].

Zou, Jialv ;

Liao, Bencheng ;

Zhang, Qian ;

Liu, Wenyu ;

Wang, Xinggang .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025,

← 1 2 3 4 5 →