Pedestrian Vision Language Model for Intentions Prediction

被引：1

作者：

Munir, Farzeen ^{[1
,2
]}

Azam, Shoaib ^{[1
,2
]}

Mihaylova, Tsvetomila ^{[1
]}

Kyrki, Ville ^{[1
,2
]}

Kucner, Tomasz Piotr ^{[1
,2
]}

机构：

[1] Aalto Univ, Dept Elect Engn & Automat, Espoo 02150, Finland

[2] Aalto Univ, Finnish Ctr Artificial Intelligence, Espoo 02150, Finland

来源：

IEEE OPEN JOURNAL OF INTELLIGENT TRANSPORTATION SYSTEMS | 2025年 / 6卷

关键词：

Pedestrians; Predictive models; Autonomous vehicles; Visualization; Optical flow; Transformers; Trajectory; Linguistics; Large language models; Intelligent transportation systems; Pedestrian intention prediction; vision-language models (VLMs); prompt generation;

D O I：

10.1109/OJITS.2025.3554387

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Effective modeling of human behavior is crucial for the safe and reliable coexistence of humans and autonomous vehicles. Traditional deep learning methods have limitations in capturing the complexities of pedestrian behavior, often relying on simplistic representations or indirect inference from visual cues, which hinders their explainability. To address this gap, we introduce PedVLM, a vision-language model that leverages multiple modalities (RGB images, optical flow, and text) to predict pedestrian intentions and also provide explainability for pedestrian behavior. PedVLM comprises a CLIP-based vision encoder and a text-to-text transfer transformer (T5) language model, which together extract and combine visual and text embeddings to predict pedestrian actions and enhance explainability. Furthermore, to complement our PedVLM model and further facilitate research, we also publicly release the corresponding dataset, PedPrompt, which includes the prompts in the Question-Answer (QA) template for pedestrian intention prediction. PedVLM is evaluated on PedPrompt, JAAD, and PIE datasets demonstrates its efficacy compared to state-of-the-art methods. The dataset and code will be made available at https://github.com/munirfarzeen/Ped_VLM.

引用

页码：393 / 406

页数：14

共 73 条

[1]

Bouhsain SA, 2021, Arxiv, DOI arXiv:2010.10270

[2]

Azarmi M., 2023, P INT C ART INT SMAR, P1

[3]

Azarmi M, 2024, Arxiv, DOI arXiv:2409.07645

[4]

Azarmi M, 2024, Arxiv, DOI [arXiv:2402.12810, arXiv:2402.12810, 10.1109/TITS.2025.3570794]

[5]

Banerjee S., 2005, P ACL WORKSHOP INTRI, P65

[6] Pedestrian Graph plus : A Fast Pedestrian Crossing Prediction Model Based on Graph Convolutional Networks [J].

Cadena, Pablo Rodrigo Gantier ;

Qian, Yeqiang ;

Wang, Chunxiang ;

Yang, Ming .

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (11) :21050-21061

[7]

Cadena PRG, 2019, IEEE INT C INTELL TR, P2000, DOI [10.1109/itsc.2019.8917118, 10.1109/ITSC.2019.8917118]

[8]

Chaabane M, 2020, IEEE WINT CONF APPL, P2286, DOI [10.1109/wacv45572.2020.9093426, 10.1109/WACV45572.2020.9093426]

[9]

Chahe A, 2025, Arxiv, DOI arXiv:2408.03516

[10] Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving [J].

Chen, Long ;

Sinavski, Oleg ;

Hunermann, Jan ;

Karnsund, Alice ;

Willmott, Andrew James ;

Birch, Danny ;

Maund, Daniel ;

Shotton, Jamie .

2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, :14093-14100

← 1 2 3 4 5 6 7 8 →