VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

被引:0
作者
Xu, Hu [1 ]
Ghosh, Gargi [1 ]
Huang, Po-Yao [1 ,2 ]
Arora, Prahal [1 ]
Aminzadeh, Masoumeh [1 ]
Feichtenhofer, Christoph [1 ]
Metze, Florian [1 ]
Zettlemoyer, Luke [1 ]
机构
[1] Facebook AI, Menlo Pk, CA 94205 USA
[2] Carnegie Mellon Univ, Pittsburgh, PA USA
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training(1).
引用
收藏
页码:4227 / 4239
页数:13
相关论文
共 40 条
  • [1] Unsupervised Learning from Narrated Instruction Videos
    Alayrac, Jean-Baptiste
    Bojanowski, Piotr
    Agrawal, Nishant
    Sivic, Josef
    Laptev, Ivan
    Lacoste-Julien, Simon
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4575 - 4583
  • [2] Alayrac Jean -Baptiste, 2020, NeurIPS
  • [3] [Anonymous], 2019, VISUALBERT SIMPLE PE, DOI DOI 10.1109/ICCV.2019.01041
  • [4] What does BERT look at? An Analysis of BERT's Attention
    Clark, Kevin
    Khandelwal, Urvashi
    Levy, Omer
    Manning, Christopher D.
    [J]. BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 276 - 286
  • [5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [6] Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment
    Ding, Li
    Xu, Chenliang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6508 - 6516
  • [7] Gabeur V., 2020, EUR C COMP VIS, P214
  • [8] Ging Mohammadreza, 2020, ARXIV201100597
  • [9] Discursive negotiation of the self in situated talks - first-generation Chinese immigrants in Australia and their sociocultural group membership
    Huang, Hui
    Wang, Candy
    Xu, Jianwei
    [J]. JOURNAL OF MULTILINGUAL AND MULTICULTURAL DEVELOPMENT, 2024, 45 (05) : 1291 - 1304
  • [10] Kingma DP, 2015, C TRACK P