Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

被引:7
作者
Dong, Sixun [1 ]
Hu, Huazhang [1 ]
Lian, Dongze [2 ]
Luo, Weixin [3 ]
Qian, Yicheng [1 ]
Gao, Shenghua [1 ,4 ,5 ]
机构
[1] ShanghaiTech Univ, Shanghai, Peoples R China
[2] Natl Univ Singapore, Singapore, Singapore
[3] Meituan, Beijing, Peoples R China
[4] Shanghai Engn Res Ctr Intelligent Vis & Imaging, Shanghai, Peoples R China
[5] Shanghai Engn Res Ctr Energy Efficient & Custom I, Shanghai, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR | 2023年
基金
国家重点研发计划;
关键词
D O I
10.1109/CVPR52729.2023.00241
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https: //github.com/svip-lab/WeakSVR.
引用
收藏
页码:2437 / 2447
页数:11
相关论文
共 63 条
  • [31] Action Shuffle Alternating Learning for Unsupervised Action Segmentation
    Li, Jun
    Todorovic, Sinisa
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 12623 - 12631
  • [32] Li M., 2022, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, P16420
  • [33] Analysis of output coupling characteristics among multiple photovoltaic power stations based on correlation coefficient
    Li, Qingsheng
    Zhang, Yu
    Liu, Wenxia
    Li, Zhen
    Chen, Julong
    Hu, Jiang
    Liang, Shuai
    [J]. ENERGY REPORTS, 2022, 8 : 908 - 915
  • [34] RESOUND: Towards Action Recognition Without Representation Bias
    Li, Yingwei
    Li, Yi
    Vasconcelos, Nuno
    [J]. COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 520 - 535
  • [35] Cross-modal Representation Learning for Zero-shot Action Recognition
    Lin, Chung-Ching
    Lin, Kevin
    Wang, Lijuan
    Liu, Zicheng
    Li, Linjie
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19946 - 19956
  • [36] Learning To Recognize Procedural Activities with Distant Supervision
    Lin, Xudong
    Petroni, Fabio
    Bertasius, Gedas
    Rohrbach, Marcus
    Chang, Shih-Fu
    Torresani, Lorenzo
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 13843 - 13853
  • [37] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
    Liu, Ze
    Lin, Yutong
    Cao, Yue
    Hu, Han
    Wei, Yixuan
    Zhang, Zheng
    Lin, Stephen
    Guo, Baining
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9992 - 10002
  • [38] Lu ZJ, 2022, P IEEE CVF C COMP VI, P19903
  • [39] CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning
    Luo, Huaishao
    Ji, Lei
    Zhong, Ming
    Chen, Yang
    Lei, Wen
    Duan, Nan
    Li, Tianrui
    [J]. NEUROCOMPUTING, 2022, 508 : 293 - 304
  • [40] MengmengWang Jiazheng Xing, 2021, ARXIV210908472