Ada-SwinBERT: Adaptive Token Selection for Efficient Video Captioning with Online Self-Distillation

被引:0
|
作者
Cao, Qianwen [1 ]
Huang, Heyan [1 ]
Liao, Minpeng
Mao, Xianling [1 ]
机构
[1] Beijing Inst Technol, Sch Comp, Beijing, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
基金
中国国家自然科学基金;
关键词
video captioning; efficient multimodal transformer; token pruning; self-distillation;
D O I
10.1109/ICME55011.2023.00010
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims at producing textual descriptions for the given video. Benefiting from the self-attention mechanism for capturing long-distance dependencies between video patches and language sentences, the fully Transformer-based models achieve promising performance recently. However, due to continuous temporal information, there exists a large amount of redundant and unimportant visual content. Indiscriminate use of video patches results in expensive computation and inefficient use of resources. To tackle this issue, we propose Ada-SwinBERT, a novel approach that adaptively selects salient video tokens to achieve a balance between efficiency and performance for video captioning. Moreover, we devise a training strategy with online self-distillation to make up for the information loss caused by discarding video tokens. Video-text alignment knowledge distilled from the teacher leads to a robust training process. By pruning 78.1% input tokens hierarchically, our approach greatly reduces 62.0% FLOPs compared with the base model while achieving competitive performance with SOTA methods.
引用
收藏
页码:7 / 12
页数:6
相关论文
empty
未找到相关数据