Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

被引：43

作者：

Xue, Hongwei ^{[1
]}

Hang, Tiankai ^{[1
]}

Zeng, Yanhong ^{[1
]}

Sun, Yuchong ^{[1
]}

Liu, Bei ^{[1
]}

Yang, Huan ^{[1
]}

Fu, Jianlong ^{[1
]}

Guo, Baining ^{[1
]}

机构：

[1] Microsoft Res Asia, Beijing, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

关键词：

D O I：

10.1109/CVPR52688.2022.00498

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study joint video and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream VL tasks. Existing works either extract low-quality video features or learn limited text embedding, while neglecting that high-resolution videos and diversified semantics can significantly improve cross-modality learning. In this paper, we propose a novel High-resolution and Diversified VIdeo-LAnguage pre-training model (HDVILA) for many visual tasks. In particular, we collect a large dataset with two distinct properties: 1) the first high-resolution dataset including 371.5k hours of 720p videos, and 2) the most diversified dataset covering 15 popular YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts. Our pre-training model achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks. For example, we outperform SOTA models with relative increases of 40.4% R@1 in zero-shot MSR-VTT text-tovideo retrieval task, and 55.4% in high-resolution dataset LSMDC. The learned VL embedding is also effective in generating visually pleasing and semantically relevant results in text-to-visual editing and super-resolution tasks.

引用

页码：5026 / 5035

页数：10

共 63 条

[1] Abu-El-Haija S, 2016, YouTube-8m: a large-scale video classification benchmark
[2] [Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01261-8_23
[3] [Anonymous], 2020, AAAI
[4] [Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00725
[5] [Anonymous], 2018, ECCV, DOI DOI 10.1007/978-3-030-01234-2_29
[6] [Anonymous], 2015, HLT NAACL
[7] [Anonymous], 2018, BMVC
[8] [Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00229
[9] [Anonymous], 2019, CVPR, DOI DOI 10.1109/ICCV.2019.00630
[10] Bain Max, 2021, ICCV

← 1 2 3 4 5 6 7 →