Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

被引:204
作者
Pan, Boxiao [1 ]
Cai, Haoye [1 ]
Huang, De-An [1 ]
Lee, Kuan-Hui [2 ]
Gaidon, Adrien [2 ]
Adeli, Ehsan [1 ]
Niebles, Juan Carlos [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Toyota Res Inst, Cambridge, MA USA
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
关键词
D O I
10.1109/CVPR42600.2020.01088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.
引用
收藏
页码:10867 / 10876
页数:10
相关论文
共 52 条
[11]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941
[12]  
Feichtenhofer Christoph, 2018, arXiv
[13]  
Ghosh P., 2018, ARXIV
[14]  
Globerson Amir, 2018, ARXIV
[15]   YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition [J].
Guadarrama, Sergio ;
Krishnamoorthy, Niveda ;
Malkarnenkar, Girish ;
Venugopalan, Subhashini ;
Mooney, Raymond ;
Darrell, Trevor ;
Saenko, Kate .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2712-2719
[16]   Cross Modal Distillation for Supervision Transfer [J].
Gupta, Saurabh ;
Hoffman, Judy ;
Malik, Jitendra .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2827-2836
[17]  
He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
[18]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[19]  
Hinton G., 2015, ARXIV150302531
[20]   Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning [J].
Hou, Jingyi ;
Wu, Xinxiao ;
Zhao, Wentian ;
Luo, Jiebo ;
Jia, Yunde .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8917-8926