Spatio-Temporal Graph for Video Captioning with Knowledge Distillation

被引：204

作者：

Pan, Boxiao ^{[1
]}

Cai, Haoye ^{[1
]}

Huang, De-An ^{[1
]}

Lee, Kuan-Hui ^{[2
]}

Gaidon, Adrien ^{[2
]}

Adeli, Ehsan ^{[1
]}

Niebles, Juan Carlos ^{[1
]}

机构：

[1] Stanford Univ, Stanford, CA 94305 USA

[2] Toyota Res Inst, Cambridge, MA USA

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01088

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning is a challenging task that requires a deep understanding of visual scenes. State-of-the-art methods generate captions using either scene-level or object-level information but without explicitly modeling object interactions. Thus, they often fail to make visually grounded predictions, and are sensitive to spurious correlations. In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time. Our model builds interpretable links and is able to provide explicit visual grounding. To avoid unstable performance caused by the variable number of objects, we further propose an object-aware knowledge distillation mechanism, in which local object information is used to regularize global scene features. We demonstrate the efficacy of our approach through extensive experiments on two benchmarks, showing our approach yields competitive performance with interpretable predictions.

引用

页码：10867 / 10876

页数：10

共 52 条

[1]

[Anonymous], CVPR, DOI DOI 10.1007/S00467-024-06571-7

[2]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[3]

Banerjee Satanjeev, 2005, P ACL WORKSHOP INTRI

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5]

Chen M., Tvt: Twoview transformer network for video captioning, P847

[6]

Chen X., 2015, arXiv

[7] Less Is More: Picking Informative Frames for Video Captioning [J].

Chen, Yangyu ;

Wang, Shuhui ;

Zhang, Weigang ;

Huang, Qingming .

COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :367-384

[8]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[9] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[10]

Fan L., 2019, ARXIV

← 1 2 3 4 5 6 →