VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

被引:0
作者
Xu, Hu [1 ]
Ghosh, Gargi [1 ]
Huang, Po-Yao [1 ,2 ]
Arora, Prahal [1 ]
Aminzadeh, Masoumeh [1 ]
Feichtenhofer, Christoph [1 ]
Metze, Florian [1 ]
Zettlemoyer, Luke [1 ]
机构
[1] Facebook AI, Menlo Pk, CA 94205 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training(1).
引用
收藏
页码:4227 / 4239
页数:13
相关论文
共 40 条
  • [11] Korbar Bruno, 2020, ARXIV200607203
  • [12] Lewis M, 2019, P 58 ANN M ASS COMP, DOI DOI 10.18653/V1/2020
  • [13] Li Gen, 2020, AAAI CONF ARTIF INTE, P11336
  • [14] Li LJ, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), P2046
  • [15] RoBERTa: A Robustly Optimized BERT Pretraining Approach
    Liu, Yinhan
    Ott, Myle
    Goyal, Naman
    Du, Jingfei
    Joshi, Mandar
    Chen, Danqi
    Levy, Omer
    Lewis, Mike
    Zettlemoyer, Luke
    Stoyanov, Veselin
    [J]. INFORMATION SYSTEMS RESEARCH, 2019,
  • [16] Lu JS, 2019, ADV NEUR IN, V32
  • [17] Luo Huaishao, 2020, ARXIV200206353
  • [18] Miech Antoine, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P9876, DOI 10.1109/CVPR42600.2020.00990
  • [19] HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
    Miech, Antoine
    Zhukov, Dimitri
    Alayrac, Jean-Baptiste
    Tapaswi, Makarand
    Laptev, Ivan
    Sivic, Josef
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 2630 - 2640
  • [20] Patrick Mandela, 2021, INT C LEARN REPR