MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

被引:2
作者
Li, Mingxing [1 ,2 ]
Zhang, Hao [1 ,2 ]
Xu, Cheng [1 ,2 ]
Yan, Chenyang [1 ,2 ]
Liu, Hongzhe [1 ,2 ]
Li, Xuewei [1 ,2 ]
机构
[1] Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing 100101, Peoples R China
[2] Beijing Union Univ, Inst Brain & Cognit Sci, Beijing 100101, Peoples R China
基金
中国国家自然科学基金;
关键词
video caption; traffic scene; multimodal fusion; attention;
D O I
10.3390/electronics11192999
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the development of electronic technology, intelligent cars can gradually realize more complex artificial intelligence algorithms. The video caption algorithm is one of them. However, current video caption algorithms only consider single-visual information when applied to urban traffic scenes, which leads to the failure to generate accurate captions of complex sets. The multimodal fusion algorithm based on Transformer is one of the solutions to this problem. However, the existing algorithms have the difficulties of a low fusion performance and high computational complexity. We propose a new video caption Transformer-based model, the MFVC (Multimodal Fusion for Video Caption), to solve these issues. We introduce audio modal data and the attention bottleneck module to increase the available information to describe the generative model and improve the model effect with less operation costs through the attention bottleneck module. Finally, the experiment is conducted on the available datasets, MSR-VTT and MSVD. Meanwhile, to verify the effect of the model on the urban traffic scene, the experiment is carried out on the self-built traffic caption dataset BUUISE, and the evaluation index confirms the model. This model can achieve good results on both available datasets and urban traffic datasets and has excellent application prospects in the intelligent driving industry.
引用
收藏
页数:12
相关论文
共 25 条
  • [1] [Anonymous], 2016, P 24 ACM INT C MULT, DOI DOI 10.1145/2964284.2984066
  • [2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [3] Chen M., 2018, P ASIAN C MACHINE LE, P847
  • [4] Deep Learning-Based Context-Aware Video Content Analysis on IoT Devices
    Gad, Gad
    Gad, Eyad
    Cengiz, Korhan
    Fadlullah, Zubair
    Mokhtar, Bassem
    [J]. ELECTRONICS, 2022, 11 (11)
  • [5] Gao Y. T., 2022, ARXIV
  • [6] Le H, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P5612
  • [7] Jin Q., 2016, P 24 ACM INT C MULT, P1087
  • [8] Jin Tian, 2020, arXiv
  • [9] Context Aware Video Caption Generation with Consecutive Differentiable Neural Computer
    Kim, Jonghong
    Choi, Inchul
    Lee, Minho
    [J]. ELECTRONICS, 2020, 9 (07) : 1 - 15
  • [10] Lin K., 2022, P IEEECVF C COMPUTER, P17949