CNVid-3.5M: Build, Filter, and Pre-train the Large-scale Public Chinese Video-text Dataset

被引:4
作者
Gan, Tian [1 ]
Wang, Qing [2 ]
Dong, Xingning [1 ]
Ren, Xiangyuan [2 ]
Nie, Liqiang [3 ]
Guo, Qingpei [2 ]
机构
[1] Shandong Univ, Jinan, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
[3] Harbin Inst Technol Shenzhen, Shenzhen, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52729.2023.01423
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pre-train": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid3.5M with three mainstream pixel-level pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text pre-training. The dataset, codebase, and pre-trained models are available at https://github.com/CNVid/CNVid-3.5M.
引用
收藏
页码:14815 / 14824
页数:10
相关论文
共 46 条
[1]   Video Description: A Survey of Methods, Datasets, and Evaluation Metrics [J].
Aafaq, Nayyer ;
Mian, Ajmal ;
Liu, Wei ;
Gilani, Syed Zulqarnain ;
Shah, Mubarak .
ACM COMPUTING SURVEYS, 2020, 52 (06)
[2]  
Akbari S, 2021, ADV NEUR IN, V34
[3]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[4]   Pre-Training With Whole Word Masking for Chinese BERT [J].
Cui, Yiming ;
Che, Wanxiang ;
Liu, Ting ;
Qin, Bing ;
Yang, Ziqing .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3504-3514
[5]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Fang Han, 2021, ARXIV210611097
[8]  
Fu Tsu-Jui, 2021, ARXIV211112681
[9]   Bridging Video-text Retrieval with Multiple Choice Questions [J].
Ge, Yuying ;
Ge, Yixiao ;
Liu, Xihui ;
Li, Dian ;
Shan, Ying ;
Qie, Xiaohu ;
Luo, Ping .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :16146-16155
[10]  
Gu Jiaxi, 2022, ARXIV220206767