An Efficient Framework for Dense Video Captioning

被引：0

作者：

Suin, Maitreya ^{[1
]}

Rajagopalan, A. N. ^{[1
]}

机构：

[1] Indian Inst Technol Madras, Madras, Tamil Nadu, India

来源：

THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2020年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Dense video captioning is an extremely challenging task since an accurate and faithful description of events in a video requires a holistic knowledge of the video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first proposing event boundaries from a video and then captioning on a subset of the proposals. Generation of dense temporal annotations and corresponding captions from long videos can be dramatically source consuming. In this paper, we focus on the task of generating a dense description of temporally untrimmed videos and aim to significantly reduce the computational cost by processing fewer frames while maintaining accuracy. Existing video captioning methods sample frames with a predefined frequency over the entire video or use all the frames. Instead, we propose a deep reinforcement-based approach which enables an agent to describe multiple events in a video by watching a portion of the frames. The agent needs to watch more frames when it is processing an informative part of the video, and skip frames when there is redundancy. The agent is trained using actor-critic algorithm, where the actor determines the frames to be watched from a video and the critic assesses the optimality of the decisions taken by the actor. Such an efficient frame selection simplifies the event proposal task considerably. This has the added effect of reducing the occurrence of unwanted proposals. The encoded state representation of the frame selection agent is further utilized for guiding event proposal and caption generation tasks. We also leverage the idea of knowledge distillation to improve the accuracy. We conduct extensive evaluations on ActivityNet captions dataset to validate our method.

引用

页码：12039 / 12046

页数：8

共 29 条

[1]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[2]

Ba J.L., 2016, P ADV NEUR INF PROC

[3] Efficient Video Classification Using Fewer Frames [J].

Bhardwaj, Shweta ;

Srinivasan, Mukundhan ;

Khapra, Mitesh M. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :354-363

[4] Less Is More: Picking Informative Frames for Video Captioning [J].

Chen, Yangyu ;

Wang, Shuhui ;

Zhang, Weigang ;

Huang, Qingming .

COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :367-384

[5]

Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878

[6]

Fan H., 2018, IJCAI, V2, P6

[7] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[8]

Hinton G., 2015, ARXIV150302531

[9] SmileNet: Registration-Free Smiling Face Detection In The Wild [J].

Jang, Youngkyoon ;

Gunes, Hatice ;

Patras, Ioannis .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :1581-1589

[10]

Kingma DP, 2014, ADV NEUR IN, V27

← 1 2 3 →