Video captioning aims at producing textual descriptions for the given video. Benefiting from the self-attention mechanism for capturing long-distance dependencies between video patches and language sentences, the fully Transformer-based models achieve promising performance recently. However, due to continuous temporal information, there exists a large amount of redundant and unimportant visual content. Indiscriminate use of video patches results in expensive computation and inefficient use of resources. To tackle this issue, we propose Ada-SwinBERT, a novel approach that adaptively selects salient video tokens to achieve a balance between efficiency and performance for video captioning. Moreover, we devise a training strategy with online self-distillation to make up for the information loss caused by discarding video tokens. Video-text alignment knowledge distilled from the teacher leads to a robust training process. By pruning 78.1% input tokens hierarchically, our approach greatly reduces 62.0% FLOPs compared with the base model while achieving competitive performance with SOTA methods.