Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引：7

作者：

Dong, Sixun ^{[1
]}

Hu, Huazhang ^{[1
]}

Lian, Dongze ^{[2
]}

Luo, Weixin ^{[3
]}

Qian, Yicheng ^{[1
]}

Gao, Shenghua ^{[1
,4
,5
]}

机构：

[1] ShanghaiTech Univ, Shanghai, Peoples R China

[2] Natl Univ Singapore, Singapore, Singapore

[3] Meituan, Beijing, Peoples R China

[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China

[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年

基金：

国家重点研发计划;

关键词：

D O I：

10.1109/CVPR52729.2023.00241

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.

引用

页码：2437 / 2447

页数：11

共 63 条

[1] [Anonymous], 2015, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123
[2] [Anonymous], EUR C COMP VIS
[3] [Anonymous], 2021, PMLR
[4] Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation
Behrmann, Nadine
Golestaneh, S. Alireza
Kolter, Zico
Gall, Jurgen
Noroozi, Mehdi
[J]. COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 52 - 68
[5] The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
Ben-Shabat, Yizhak
Yu, Xin
Saleh, Fatemeh
Campbell, Dylan
Rodriguez-Opazo, Cristian
Li, Hongdong
Gould, Stephen
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 846 - 858
[6] Bertasius G, 2021, PR MACH LEARN RES, V139
[7] Cao Meng, 2022, ARXIV220710362
[8] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[9] D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation
Chang, Chien-Yi
Huang, De-An
Sui, Yanan
Li Fei-Fei
Niebles, Juan Carlos
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3541 - 3550
[10] Chen M, 2022, P IEEECVF C COMPUTER, P13801

← 1 2 3 4 5 6 7 →