Video Contrastive Learning with Global Context

被引:37
作者
Kuang, Haofei [1 ,3 ]
Zhu, Yi [2 ]
Zhang, Zhi [2 ]
Li, Xinyu [2 ]
Tighe, Joseph [2 ]
Schwertfeger, Soeren [3 ]
Stachniss, Cyrill [1 ]
Li, Mu [2 ]
机构
[1] Univ Bonn, Bonn, Germany
[2] Amazon Web Serv, Seattle, WA USA
[3] ShanghaiTech Univ, Shanghai, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021) | 2021年
关键词
D O I
10.1109/ICCVW54120.2021.00358
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contrastive learning has revolutionized the self-supervised image representation learning field and recently been adapted to the video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful loss objectives as long as we can find a reasonable way to formulate positive and negative samples to contrast. However, existing approaches rely heavily on the short-range spatiotemporal salience to form clip-level contrastive signals, thus limit themselves from using global context. In this paper, we propose a new video-level contrastive learning method based on segments to formulate positive pairs. Our formulation is able to capture the global context in a video, thus robust to temporal content change. We also incorporate a temporal order regularization term to enforce the inherent sequential structure of videos. Extensive experiments show that our video-level contrastive learning framework (VCLR) is able to outperform previous state-of-the-arts on five video datasets for downstream action classification, action localization, and video retrieval.
引用
收藏
页码:3188 / 3197
页数:10
相关论文
共 64 条
[31]  
Kim D, 2019, AAAI CONF ARTIF INTE, P8545
[32]  
Korbar B, 2018, ADV NEUR IN, V31
[33]  
Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543
[34]   Unsupervised Representation Learning by Sorting Sequences [J].
Lee, Hsin-Ying ;
Huang, Jia-Bin ;
Singh, Maneesh ;
Yang, Ming-Hsuan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :667-676
[35]  
Li Xinyu, 2021, ICCV, P2021
[36]   TSM: Temporal Shift Module for Efficient Video Understanding [J].
Lin, Ji ;
Gan, Chuang ;
Han, Song .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7082-7092
[37]   BMN: Boundary-Matching Network for Temporal Action Proposal Generation [J].
Lin, Tianwei ;
Liu, Xiao ;
Li, Xin ;
Ding, Errui ;
Wen, Shilei .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3888-3897
[38]  
Liu Yang, 2021, ARXIV210100820
[39]  
Luo DZ, 2020, AAAI CONF ARTIF INTE, V34, P11701
[40]   End-to-End Learning of Visual Representations from Uncurated Instructional Videos [J].
Miech, Antoine ;
Alayrac, Jean-Baptiste ;
Smaira, Lucas ;
Laptev, Ivan ;
Sivic, Josef ;
Zisserman, Andrew .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9876-9886