Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network

被引:0
作者
Liu, Chunsheng [1 ]
Zhang, Xiao [1 ]
Chang, Faliang [1 ]
Li, Shuang [2 ]
Hao, Penghui [1 ]
Lu, Yansha [1 ]
Wang, Yinhai [3 ]
机构
[1] Shandong Univ, Sch Control Sci & Engn, Jinan 250061, Peoples R China
[2] Qilu Univ Technol, Shandong Acad Sci, Sch Informat & Automat Engn, Jinan 250353, Peoples R China
[3] Univ Washington, Dept Civil & Environm Engn, Seattle, WA 98195 USA
基金
中国国家自然科学基金;
关键词
Traffic scenario understanding; video captioning; guidance captioning; attention mechanism;
D O I
10.1109/TITS.2023.3323085
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
Describing a traffic scenario from the driver's perspective is a challenging process for Advanced Driving Assistance System (ADAS), involving different sub-tasks of detection, tracking, segmentation, etc. Previous methods mainly focus on independent sub-tasks and have difficulties to comprehensively describe the incidents. In this study, this problem is novelly treated as a video captioning task, and a Guidance Attention Captioning Network (GAC-Network) structure is proposed for describing the incidents in a concise single sentence. In GAC-Network, an Attention based Encoder-Decoder Net (AED-Net) is built as the main network; with the temporal spatial attention mechanisms, the AED-Net make it possible to effectively reject the unimportant traffic behaviors and redundant backgrounds. Considering various driving scenarios, the Spatio-Temporal Layer Normalization is used to improve the generalization ability. To generate captions for incidents in driving, the novel Guidance Module is proposed to boost the encoder-decoder model to generate words in a caption, which have better relationship to the past and future words. Because there is no public dataset for captioning of driving scenarios, the Traffic Video Captioning (TVC) dataset is released for the video captioning task in driving scenarios. Experimental results show that the proposed methods can fulfill the captioning task for complex driving scenarios, and achieve higher performance than the methods for comparison, including at least 2.5%, 1.8%, 3.6%, and 13.1% better results on BLEU_1, METEOR, ROUGE_L and CIDEr, respectively.
引用
收藏
页码:3615 / 3627
页数:13
相关论文
共 58 条
  • [1] [Anonymous], 2015, arXiv, DOI [DOI 10.1109/CVPR.2016.90, 10.1109/CVPR.2016.90]
  • [2] Ba J. L., 2016, NEURALIPS DEEP LEAR
  • [3] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [4] Visual Content based Video Retrieval on Natural Language Queries
    Bansal, Ravi
    Chakraborty, Sandip
    [J]. SAC '19: PROCEEDINGS OF THE 34TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING, 2019, : 212 - 219
  • [5] How Teens with Visual Impairments Take, Edit, and Share Photos on Social Media
    Bennett, Cynthia L.
    Jane, E.
    Mott, Martez E.
    Cutrell, Edward
    Morris, Meredith Ringel
    [J]. PROCEEDINGS OF THE 2018 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI 2018), 2018,
  • [6] Che Z., 2019, ARXIV
  • [7] Chen Xinlei, 2015, arXiv
  • [8] Chin-Yew Lin, 2004, Text Summarization Branches Out, P74
  • [9] Traffic Accident Detection via Self-Supervised Consistency Learning in Driving Scenarios
    Fang, Jianwu
    Qiao, Jiahuan
    Bai, Jie
    Yu, Hongkai
    Xue, Jianru
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2022, 23 (07) : 9601 - 9614
  • [10] Fang JW, 2019, IEEE INT C INTELL TR, P4303, DOI 10.1109/ITSC.2019.8917218