VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

被引：0

作者：

Xu, Hu ^{[1
]}

Ghosh, Gargi ^{[1
]}

Huang, Po-Yao ^{[1
,2
]}

Arora, Prahal ^{[1
]}

Aminzadeh, Masoumeh ^{[1
]}

Feichtenhofer, Christoph ^{[1
]}

Metze, Florian ^{[1
]}

Zettlemoyer, Luke ^{[1
]}

机构：

[1] Facebook AI, Menlo Pk, CA 94205 USA

[2] Carnegie Mellon Univ, Pittsburgh, PA USA

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training(1).

引用

页码：4227 / 4239

页数：13

共 40 条

[11] Korbar Bruno, 2020, ARXIV200607203
[12] Lewis M, 2019, P 58 ANN M ASS COMP, DOI DOI 10.18653/V1/2020
[13] Li Gen, 2020, AAAI CONF ARTIF INTE, P11336
[14] Li LJ, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P2046
[15] RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Yinhan
Ott, Myle
Goyal, Naman
Du, Jingfei
Joshi, Mandar
Chen, Danqi
Levy, Omer
Lewis, Mike
Zettlemoyer, Luke
Stoyanov, Veselin
[J]. INFORMATION SYSTEMS RESEARCH, 2019,
[16] Lu JS, 2019, ADV NEUR IN, V32
[17] Luo Huaishao, 2020, ARXIV200206353
[18] Miech Antoine, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P9876, DOI 10.1109/CVPR42600.2020.00990
[19] HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Miech, Antoine
Zhukov, Dimitri
Alayrac, Jean-Baptiste
Tapaswi, Makarand
Laptev, Ivan
Sivic, Josef
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2630 - 2640
[20] Patrick Mandela, 2021, INT C LEARN REPR

← 1 2 3 4 →