MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

被引：2

作者：

Li, Mingxing ^{[1
,2
]}

Zhang, Hao ^{[1
,2
]}

Xu, Cheng ^{[1
,2
]}

Yan, Chenyang ^{[1
,2
]}

Liu, Hongzhe ^{[1
,2
]}

Li, Xuewei ^{[1
,2
]}

机构：

[1] Beijing Union Univ, Beijing Key Lab Informat Serv Engn, Beijing 100101, Peoples R China

[2] Beijing Union Univ, Inst Brain & Cognit Sci, Beijing 100101, Peoples R China

来源：

ELECTRONICS | 2022年 / 11卷 / 19期

基金：

中国国家自然科学基金;

关键词：

video caption; traffic scene; multimodal fusion; attention;

D O I：

10.3390/electronics11192999

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the development of electronic technology, intelligent cars can gradually realize more complex artificial intelligence algorithms. The video caption algorithm is one of them. However, current video caption algorithms only consider single-visual information when applied to urban traffic scenes, which leads to the failure to generate accurate captions of complex sets. The multimodal fusion algorithm based on Transformer is one of the solutions to this problem. However, the existing algorithms have the difficulties of a low fusion performance and high computational complexity. We propose a new video caption Transformer-based model, the MFVC (Multimodal Fusion for Video Caption), to solve these issues. We introduce audio modal data and the attention bottleneck module to increase the available information to describe the generative model and improve the model effect with less operation costs through the attention bottleneck module. Finally, the experiment is conducted on the available datasets, MSR-VTT and MSVD. Meanwhile, to verify the effect of the model on the urban traffic scene, the experiment is carried out on the self-built traffic caption dataset BUUISE, and the evaluation index confirms the model. This model can achieve good results on both available datasets and urban traffic datasets and has excellent application prospects in the intelligent driving industry.

引用

页数：12

共 25 条

[11]

Nagrani A., 2021, ADV NEURAL INF PROCE

[12] End-to-End Video Captioning [J].

Olivastri, Silvio ;

Singh, Gurkirt ;

Cuzzolin, Fabio .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1474-1482

[13] Translating Video Content to Natural Language Descriptions [J].

Rohrbach, Marcus ;

Qiu, Wei ;

Titov, Ivan ;

Thater, Stefan ;

Pinkal, Manfred ;

Schiele, Bernt .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :433-440

[14]

Ryu H, 2021, AAAI CONF ARTIF INTE, V35, P2514

[15]

Tsotsos J.K, 2021, A computational perspective on visual attention

[16] ANALYZING VISION AT THE COMPLEXITY LEVEL [J].

TSOTSOS, JK .

BEHAVIORAL AND BRAIN SCIENCES, 1990, 13 (03) :423-444

[17] Enhancing the alignment between target words and corresponding frames for video captioning [J].

Tu, Yunbin ;

Zhou, Chang ;

Guo, Junjun ;

Gao, Shengxiang ;

Yu, Zhengtao .

PATTERN RECOGNITION, 2021, 111 (111)

[18]

Vaswani A, 2017, ADV NEUR IN, V30

[19] Sequence to Sequence - Video to Text [J].

Venugopalan, Subhashini ;

Rohrbach, Marcus ;

Donahue, Jeff ;

Mooney, Raymond ;

Darrell, Trevor ;

Saenko, Kate .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4534-4542

[20] How Blind People Interact with Visual Content on Social Networking Services [J].

Voykinska, Violeta ;

Azenkot, Shiri ;

Wu, Shaomei ;

Leshed, Gilly .

ACM CONFERENCE ON COMPUTER-SUPPORTED COOPERATIVE WORK AND SOCIAL COMPUTING (CSCW 2016), 2016, :1584-1595

← 1 2 3 →