Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引：7

作者：

Dong, Sixun ^{[1
]}

Hu, Huazhang ^{[1
]}

Lian, Dongze ^{[2
]}

Luo, Weixin ^{[3
]}

Qian, Yicheng ^{[1
]}

Gao, Shenghua ^{[1
,4
,5
]}

机构：

[1] ShanghaiTech Univ, Shanghai, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Meituan, Beijing, Peoples R China

[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China

[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00241

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.

引用

页码：2437 / 2447

页数：11

共 63 条

[31] Action Shuffle Alternating Learning for Unsupervised Action Segmentation
Li, Jun
Todorovic, Sinisa
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12623 - 12631
[32] Li M., 2022, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, P16420
[33] Analysis of output coupling characteristics among multiple photovoltaic power stations based on correlation coefficient
Li, Qingsheng
Zhang, Yu
Liu, Wenxia
Li, Zhen
Chen, Julong
Hu, Jiang
Liang, Shuai
[J]. ENERGY REPORTS, 2022, 8 : 908 - 915
[34] RESOUND: Towards Action Recognition Without Representation Bias
Li, Yingwei
Li, Yi
Vasconcelos, Nuno
[J]. COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 520 - 535
[35] Cross-modal Representation Learning for Zero-shot Action Recognition
Lin, Chung-Ching
Lin, Kevin
Wang, Lijuan
Liu, Zicheng
Li, Linjie
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19946 - 19956
[36] Learning To Recognize Procedural Activities with Distant Supervision
Lin, Xudong
Petroni, Fabio
Bertasius, Gedas
Rohrbach, Marcus
Chang, Shih-Fu
Torresani, Lorenzo
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13843 - 13853
[37] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Liu, Ze
Lin, Yutong
Cao, Yue
Hu, Han
Wei, Yixuan
Zhang, Zheng
Lin, Stephen
Guo, Baining
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9992 - 10002
[38] Lu ZJ, 2022, P IEEE CVF C COMP VI, P19903
[39] CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning
Luo, Huaishao
Ji, Lei
Zhong, Ming
Chen, Yang
Lei, Wen
Duan, Nan
Li, Tianrui
[J]. NEUROCOMPUTING, 2022, 508 : 293 - 304
[40] MengmengWang Jiazheng Xing, 2021, ARXIV210908472

← 1 2 3 4 5 6 7 →