SViTT: Temporal Learning of Sparse Video-Text Transformers

被引：11

作者：

Li, Yi ^{[1
]}

Min, Kyle ^{[2
]}

Tripathi, Subarna ^{[2
]}

Vasconcelos, Nuno ^{[1
]}

机构：

[1] Univ Calif San Diego, La Jolla, CA 92093 USA

[2] Intel Labs, Santa Clara, CA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

关键词：

D O I：

10.1109/CVPR52729.2023.01814

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.

引用

页码：18919 / 18929

页数：11

共 57 条

[1]

Alayrac Jean-Baptiste, 2022, P NEURIPS NEW ORL

[2]

[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/ICPSASIA55496.2022.9949880

[3]

[Anonymous], 2022, P IEEE CVF C COMP VI, DOI DOI 10.1109/SPIES55999.2022.10082039

[4] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[5] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[6] STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID- [J].

Bao, Han ;

Zhou, Xun ;

Xie, Yiqun ;

Li, Yanhua ;

Jia, Xiaowei .

2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, :1-10

[7]

Battaglia P. W., 2018, ARXIV

[8]

Beltagy I., 2020, LONGFORMER LONG DOCU, DOI DOI 10.48550/ARXIV.2004.05150

[9]

Bertasius G, 2021, PR MACH LEARN RES, V139

[10] Revisiting the "Video" in Video-Language Understanding [J].

Buch, Shyamal ;

Eyzaguirre, Cristobal ;

Gaidon, Adrien ;

Wu, Jiajun ;

Li Fei-Fei ;

Niebles, Juan Carlos .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2907-2917

← 1 2 3 4 5 6 →