MIVCN: Multimodal interaction video captioning network based on semantic association graph

被引:4
作者
Wang, Ying [1 ]
Huang, Guoheng [1 ]
Lin Yuming [1 ]
Yuan, Haoliang [2 ]
Pun, Chi-Man [3 ]
Ling, Wing-Kuen [4 ]
Cheng, Lianglun [1 ]
机构
[1] Guangdong Univ Technol, Sch Comp, Guangzhou 510006, Peoples R China
[2] Guangdong Univ Technol, Sch Automat, Guangzhou 510006, Peoples R China
[3] Univ Macau, Dept Comp & Informat Sci, Taipa 999078, Macao, Peoples R China
[4] Guangdong Univ Technol, Sch Informat Engn, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Graph convolutional network; Gated recurrent unit; Attention mechanism; Long-short term memory; Multimodal fusion;
D O I
10.1007/s10489-021-02612-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the field of computer vision, it is a challenging task to generate natural language captions from videos as input. To deal with this task, videos are usually regarded as feature sequences and input into Long-Short Term Memory (LSTM) to generate natural language. To get richer and more detailed video content representation, a Multimodal Interaction Video Captioning Network based on Semantic Association Graph (MIVCN) is developed towards this task. This network consists of two modules: Semantic association Graph Module (SAGM) and Multimodal Attention Constraint Module (MACM). Firstly, owing to lack of the semantic interdependence, existing methods often produce illogical sentence structures. Therefore, we propose a SAGM based on information association, which enables network to strengthen the connection between logically related languages and alienate the relations between logically unrelated languages. Secondly, features of each modality need to pay attention to different information among them, and the captured multimodal features are great informative and redundant. Based on the discovery, we propose a MACM based on LSTM, which can capture complementary visual features and filter redundant visual features. The MACM is applied to integrate multimodal features into LSTM, and make network to screen and focus on informative features. Through the association of semantic attributes and the interaction of multimodal features, the semantically contextual interdependent and visually complementary information can be captured by this network, and the informative representation in videos also can be better used for generating captioning. The proposed MIVCN realizes the best caption generation performance on MSVD: 56.8%, 36.4%, and 79.1% on BLEU@4, METEOR, and ROUGE-L evaluation metrics, respectively. Superior results are also reported on MSR-VTT about BLEU@4, METEOR, and ROUGE-L compared to state-of-the-art methods.
引用
收藏
页码:5241 / 5260
页数:20
相关论文
共 47 条
[1]   Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning [J].
Aafaq, Nayyer ;
Akhtar, Naveed ;
Liu, Wei ;
Gilani, Syed Zulqarnain ;
Mian, Ajmal .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12479-12488
[2]   A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration [J].
Chen, Yilin ;
He, Fazhi ;
Li, Haoran ;
Zhang, Dejun ;
Wu, Yiqi .
APPLIED SOFT COMPUTING, 2020, 93
[3]   Effect of rosuvastatin on progression of carotid intima-media thickness in low-risk individuals with subclinical atherosclerosis - The METEOR trial [J].
Crouse, John R., III ;
Raichlen, Joel S. ;
Riley, Ward A. ;
Evans, Gregory W. ;
Palmer, Mike K. ;
O'Leary, Daniel H. ;
Grobbee, Diederick E. ;
Bots, Michiel L. .
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2007, 297 (12) :1344-1353
[4]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[5]   A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching [J].
Das, Pradipto ;
Xu, Chenliang ;
Doell, Richard F. ;
Corso, Jason J. .
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, :2634-2641
[6]  
Freitag Markus, 2017, P 1 WORKSHOP NEURAL, P56
[7]   Semantic Compositional Networks for Visual Captioning [J].
Gan, Zhe ;
Gan, Chuang ;
He, Xiaodong ;
Pu, Yunchen ;
Tran, Kenneth ;
Gao, Jianfeng ;
Carin, Lawrence ;
Deng, Li .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1141-1150
[8]   Fused GRU with semantic-temporal attention for video captioning [J].
Gao, Lianli ;
Wang, Xuanhan ;
Song, Jingkuan ;
Liu, Yang .
NEUROCOMPUTING, 2020, 395 :222-228
[9]   Video Captioning With Attention-Based LSTM and Semantic Consistency [J].
Gao, Lianli ;
Guo, Zhao ;
Zhang, Hanwang ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) :2045-2055
[10]  
Hemalatha M, 2020, IEEE WINT CONF APPL, P1576, DOI [10.1109/wacv45572.2020.9093344, 10.1109/WACV45572.2020.9093344]