Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 63 条
  • [1] [Anonymous], 2015, IEEE I CONF COMP VIS, DOI DOI 10.1109/ICCV.2015.123
  • [2] [Anonymous], EUR C COMP VIS
  • [3] [Anonymous], 2021, PMLR
  • [4] Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation
    Behrmann, Nadine
    Golestaneh, S. Alireza
    Kolter, Zico
    Gall, Jurgen
    Noroozi, Mehdi
    [J]. COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 52 - 68
  • [5] The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose
    Ben-Shabat, Yizhak
    Yu, Xin
    Saleh, Fatemeh
    Campbell, Dylan
    Rodriguez-Opazo, Cristian
    Li, Hongdong
    Gould, Stephen
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 846 - 858
  • [6] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [7] Cao Meng, 2022, ARXIV220710362
  • [8] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [9] D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation
    Chang, Chien-Yi
    Huang, De-An
    Sui, Yanan
    Li Fei-Fei
    Niebles, Juan Carlos
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3541 - 3550
  • [10] Chen M, 2022, P IEEECVF C COMPUTER, P13801