Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation

被引：0

作者：

Cao, Qianwen ^{[1
]}

Huang, Heyan ^{[1
]}

Liao, Minpeng

Mao, Xianling ^{[1
]}

机构：

[1] Beijing Inst Technol, Sch Comp, Beijing, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年

基金：

中国国家自然科学基金;

关键词：

video captioning; efficient multimodal transformer; token pruning; self-distillation;

D O I：

10.1109/ICME55011.2023.00010

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning aims at producing textual descriptions for the given video. Benefiting from the self-attention mechanism for capturing long-distance dependencies between video patches and language sentences, the fully Transformer-based models achieve promising performance recently. However, due to continuous temporal information, there exists a large amount of redundant and unimportant visual content. Indiscriminate use of video patches results in expensive computation and inefficient use of resources. To tackle this issue, we propose Ada-SwinBERT, a novel approach that adaptively selects salient video tokens to achieve a balance between efficiency and performance for video captioning. Moreover, we devise a training strategy with online self-distillation to make up for the information loss caused by discarding video tokens. Video-text alignment knowledge distilled from the teacher leads to a robust training process. By pruning 78.1% input tokens hierarchically, our approach greatly reduces 62.0% FLOPs compared with the base model while achieving competitive performance with SOTA methods.

引用

页码：7 / 12

页数：6