Incorporating the Graph Representation of Video and Text into Video Captioning

被引:0
作者
Lu, Min [1 ,2 ]
Li, Yuan [1 ]
机构
[1] Civil Aviat Univ China, Sch Comp Sci & Technol, Tianjin, Peoples R China
[2] CAAC, Key Lab Smart Airport Theory & Syst, Tianjin, Peoples R China
来源
2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI | 2022年
关键词
video captioning; graph representation; semantic feature;
D O I
10.1109/ICTAI56018.2022.00065
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is to translate the video content into the textual descriptions. In the encoding phase, the existing approaches encode the irrelevant background and uncorrelated visual objects into visual features. That leads to semantic aberration between the visual features and the expected textual caption. In the decoding phase, the word-by-word prediction infers the next word only from the previously generated caption. That local text context is insufficient for word prediction. To tackle the above two issues, the representations of video and text stem from the convolution on two graphs. The convolution on the video graph distills the visual features by filtering the irrelevant background and uncorrelated salient objects. The key issue is to figure out the similar videos according to the video semantic feature. The word graph is constructed to help incorporate global neighborhood among words into word representation. That word global neighborhood serves as the global text context and compensates the local text context. Results on two benchmark datasets show the advantage of the proposed method. Experimental analysis is also conducted to verify the effectiveness of the proposed modules.
引用
收藏
页码:396 / 401
页数:6
相关论文
共 22 条
[1]   Discriminative Latent Semantic Graph for Video Captioning [J].
Bai, Yang ;
Wang, Junyan ;
Long, Yang ;
Hu, Bingzhang ;
Song, Yang ;
Pagnucco, Maurice ;
Guan, Yu .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :3556-3564
[2]   Motion Guided Region Message Passing for Video Captioning [J].
Chen, Shaoxiang ;
Jiang, Yu-Gang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1523-1532
[3]   Semantic Compositional Networks for Visual Captioning [J].
Gan, Zhe ;
Gan, Chuang ;
He, Xiaodong ;
Pu, Yunchen ;
Tran, Kenneth ;
Gao, Jianfeng ;
Carin, Lawrence ;
Deng, Li .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1141-1150
[4]  
Hemalatha M, 2020, IEEE WINT CONF APPL, P1576, DOI [10.1109/WACV45572.2020.9093344, 10.1109/wacv45572.2020.9093344]
[5]   Dense-Captioning Events in Videos [J].
Krishna, Ranjay ;
Hata, Kenji ;
Ren, Frederic ;
Fei-Fei, Li ;
Niebles, Juan Carlos .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :706-715
[6]   REVNET: BRING REVIEWING INTO VIDEO CAPTIONING FOR A BETTER DESCRIPTION [J].
Li, Huidong ;
Song, Dandan ;
Liao, Lejian ;
Peng, Cuimei .
2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, :1312-1317
[7]   Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [J].
Pan, Boxiao ;
Cai, Haoye ;
Huang, De-An ;
Lee, Kuan-Hui ;
Gaidon, Adrien ;
Adeli, Ehsan ;
Niebles, Juan Carlos .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10867-10876
[8]   Jointly Modeling Embedding and Translation to Bridge Video and Language [J].
Pan, Yingwei ;
Mei, Tao ;
Yao, Ting ;
Li, Houqiang ;
Rui, Yong .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4594-4602
[9]   Memory-Attended Recurrent Network for Video Captioning [J].
Pei, Wenjie ;
Zhang, Jiyuan ;
Wang, Xiangrong ;
Ke, Lei ;
Shen, Xiaoyong ;
Tai, Yu-Wing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :8339-8348
[10]  
Ryu H, 2021, AAAI CONF ARTIF INTE, V35, P2514