CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset

被引：4

作者：

Gan, Tian ^{[1
]}

Wang, Qing ^{[2
]}

Dong, Xingning ^{[1
]}

Ren, Xiangyuan ^{[2
]}

Nie, Liqiang ^{[3
]}

Guo, Qingpei ^{[2
]}

机构：

[1] Shandong Univ, Jinan, Peoples R China

[2] Ant Grp, Hangzhou, Peoples R China

[3] Harbin Inst Technol Shenzhen, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52729.2023.01423

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pre-train": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid3.5M with three mainstream pixel-level pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text pre-training. The dataset, codebase, and pre-trained models are available at https://github.com/CNVid/CNVid-3.5M.

引用

页码：14815 / 14824

页数：10

共 46 条

[1] Video Description: A Survey of Methods, Datasets, and Evaluation Metrics [J].

Aafaq, Nayyer ;

Mian, Ajmal ;

Liu, Wei ;

Gilani, Syed Zulqarnain ;

Shah, Mubarak .

ACM COMPUTING SURVEYS, 2020, 52 (06)

[2]

Akbari S, 2021, ADV NEUR IN, V34

[3] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[4] Pre-Training With Whole Word Masking for Chinese BERT [J].

Cui, Yiming ;

Che, Wanxiang ;

Liu, Ting ;

Qin, Bing ;

Yang, Ziqing .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514

[5]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Fang Han, 2021, ARXIV210611097

[8]

Fu Tsu-Jui, 2021, ARXIV211112681

[9] Bridging Video-text Retrieval with Multiple Choice Questions [J].

Ge, Yuying ;

Ge, Yixiao ;

Liu, Xihui ;

Li, Dian ;

Shan, Ying ;

Qie, Xiaohu ;

Luo, Ping .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16146-16155

[10]

Gu Jiaxi, 2022, ARXIV220206767

← 1 2 3 4 5 →