paper Exploring Contextual Representation and Multi-modality for End-to-end Autonomous Driving

被引:1
作者
Azam, Shoaib [1 ,2 ]
Munir, Farzeen [1 ,2 ]
Kyrki, Ville [1 ,2 ]
Kucner, Tomasz Piotr [1 ,2 ]
Jeon, Moongu [3 ]
Pedrycz, Witold [4 ,5 ,6 ]
机构
[1] Aalto Univ, Dept Elect Engn & Automat, Espoo, Finland
[2] Finnish Ctr Artificial Intelligence, Espoo, Finland
[3] Gwangju Inst Sci & Technol, Sch Elect Engn & Comp Sci, Gwangju 61005, South Korea
[4] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6R 2V4, Canada
[5] King Abdulaziz Univ, Fac Engn, Dept Elect & Comp Engn, Jeddah 21589, Saudi Arabia
[6] Polish Acad Sci, Syst Res Inst, PL-01447 Warsaw, Poland
基金
芬兰科学院;
关键词
Vision-centric autonomous driving; Attention; Contextual representation; Imitation learning; Vision transformer;
D O I
10.1016/j.engappai.2024.108767
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision -making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack global environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map -based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human -like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end -to -end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top -down bird -eye -view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self -attention mechanism, leading to an auto -regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed -loop settings. Our method achieves displacement error by 0 . 67 m in open -loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed -loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.
引用
收藏
页数:13
相关论文
共 36 条
  • [21] Robust Behavioral Cloning for Autonomous Vehicles Using End-to-End Imitation Learning
    Samak T.V.
    Samak C.V.
    Kandhasamy S.
    SAE International Journal of Connected and Automated Vehicles, 2021, 4 (03):
  • [22] Multi-Head Decoder for End-to-End Speech Recognition
    Hayashi, Tomoki
    Watanabe, Shinji
    Toda, Tomoki
    Takeda, Kazuya
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 801 - 805
  • [23] Towards Safety Assured End-to-End Vision-Based Control for Autonomous Racing
    Kalaria, Dvij
    Lin, Qin
    Dolan, John M.
    IFAC PAPERSONLINE, 2023, 56 (02): : 2767 - 2773
  • [24] XBG: End-to-End Imitation Learning for Autonomous Behaviour in Human-Robot Interaction and Collaboration
    Cardenas-Perez, Carlos
    Romualdi, Giulio
    Elobaid, Mohamed
    Dafarra, Stefano
    L'Erario, Giuseppe
    Traversaro, Silvio
    Morerio, Pietro
    Del Bue, Alessio
    Pucci, Daniele
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (12): : 11617 - 11624
  • [25] EXPLORING PRE-TRAINING WITH ALIGNMENTS FOR RNN TRANSDUCER BASED END-TO-END SPEECH RECOGNITION
    Hu, Hu
    Zhao, Rui
    Li, Jinyu
    Lu, Liang
    Gong, Yifan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7079 - 7083
  • [26] Context-Sensitive Adapter: Contextual Biasing for Personalized End-to-End Speech Recognition with Attention Fusion and Bias Filtering
    Cai, Yineng
    Sun, Lixu
    Li, Yongchao
    Yolwas, Nurmemet
    Silamu, Wushouer
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024, 2024, 14866 : 354 - 364
  • [27] INTEGRATING MOTION PRIORS FOR END-TO-END ATTENTION-BASED MULTI-OBJECT TRACKING
    Ali, R.
    Mehltretter, M.
    Heipke, C.
    GEOSPATIAL WEEK 2023, VOL. 48-1, 2023, : 1619 - 1626
  • [28] Simultaneous End-to-End Vehicle and License Plate Detection With Multi-Branch Attention Neural Network
    Chen, Song-Lu
    Yang, Chun
    Ma, Jia-Wei
    Chen, Feng
    Yin, Xu-Cheng
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2020, 21 (09) : 3686 - 3695
  • [29] X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
    Ma, Yiwei
    Xu, Guohai
    Sun, Xiaoshuai
    Yan, Ming
    Zhang, Ji
    Ji, Rongrong
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [30] JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
    Kim, Suyoun
    Hori, Takaaki
    Watanabe, Shinji
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4835 - 4839