SViTT: Temporal Learning of Sparse Video-Text Transformers

被引:11
作者
Li, Yi [1 ]
Min, Kyle [2 ]
Tripathi, Subarna [2 ]
Vasconcelos, Nuno [1 ]
机构
[1] Univ Calif San Diego, La Jolla, CA 92093 USA
[2] Intel Labs, Santa Clara, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.01814
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.
引用
收藏
页码:18919 / 18929
页数:11
相关论文
共 57 条
[1]  
Alayrac Jean-Baptiste, 2022, P NEURIPS NEW ORL
[2]  
[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/ICPSASIA55496.2022.9949880
[3]  
[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039
[4]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[5]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[6]   STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID- [J].
Bao, Han ;
Zhou, Xun ;
Xie, Yiqun ;
Li, Yanhua ;
Jia, Xiaowei .
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, :1-10
[7]  
Battaglia P. W., 2018, ARXIV
[8]  
Beltagy I., 2020, LONGFORMER LONG DOCU, DOI DOI 10.48550/ARXIV.2004.05150
[9]  
Bertasius G, 2021, PR MACH LEARN RES, V139
[10]   Revisiting the "Video" in Video-Language Understanding [J].
Buch, Shyamal ;
Eyzaguirre, Cristobal ;
Gaidon, Adrien ;
Wu, Jiajun ;
Li Fei-Fei ;
Niebles, Juan Carlos .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2907-2917