Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval

被引：2

作者：

Kumar, Vidit ^{[1
]}

Tripathi, Vikas ^{[1
]}

Pant, Bhaskar ^{[1
]}

机构：

[1] Graph Era Deemed Be Univ Dehradun, Dept Comp Sci & Engn, Dehra Dun, Uttarakhand, India

来源：

INTERNATIONAL JOURNAL OF MATHEMATICAL ENGINEERING AND MANAGEMENT SCIENCES | 2022年 / 7卷 / 02期

关键词：

Contrastive learning; Convolutional autoencoder; Content-based search; Deep learning; Video retrieval; Future prediction; Unsupervised learning; PREDICTION;

D O I：

10.33889/IJMEMS.2022.7.2.018

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

The rapid growth of tag-free user-generated videos (on the Internet), surgical recorded videos, and surveillance videos has necessitated the need for effective content-based video retrieval systems. Earlier methods for video representations are based on hand-crafted, which hardly performed well on the video retrieval tasks. Subsequently, deep learning methods have successfully demonstrated their effectiveness in both image and video-related tasks, but at the cost of creating massively labeled datasets. Thus, the economic solution is to use freely available unlabeled web videos for representation learning. In this regard, most of the recently developed methods are based on solving a single pretext task using 2D or 3D convolutional network. However, this paper designs and studies a 3D convolutional autoencoder (3D-CAE) for video representation learning (since it does not require labels). Further, this paper proposes a new unsupervised video feature learning method based on joint learning of past and future prediction using 3D-CAE with temporal contrastive learning. The experiments are conducted on UCF-101 and HMDB-51 datasets, where the proposed approach achieves better retrieval performance than state-of-the-art. In the ablation study, the action recognition task is performed by fine-tuning the unsupervised pre-trained model where it outperforms other methods, which further confirms the superiority of our method in learning underlying features. Such an unsupervised representation learning approach could also benefit the medical domain, where it is expensive to create large label datasets.

引用

页码：272 / 287

页数：16

共 58 条

[1]

[Anonymous], 2007, P 6 ACM INT C IM VID

[2]

[Anonymous], 2012, P 21 ACM INT C INF K

[3]

[Anonymous], 2012, CoRR

[4] Large-Scale Video Retrieval Using Image Queries [J].

Araujo, Andre ;

Girod, Bernd .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (06) :1406-1420

[5] Content Based Video Retrieval using SURF Descriptor [J].

Asha, S. ;

Sreeraj, M. .

2013 THIRD INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATIONS (ICACC 2013), 2013, :212-215

[6] Neural Codes for Image Retrieval [J].

Babenko, Artem ;

Slesarev, Anton ;

Chigorin, Alexandr ;

Lempitsky, Victor .

COMPUTER VISION - ECCV 2014, PT I, 2014, 8689 :584-599

[7] SpeedNet: Learning the Speediness in Videos [J].

Benaim, Sagie ;

Ephrat, Ariel ;

Lang, Oran ;

Mosseri, Inbar ;

Freeman, William T. ;

Rubinstein, Michael ;

Irani, Michal ;

Dekel, Tali .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9919-9928

[8]

Bengio Y., 2007, Advances in Neural Information Processing Systems, V19, P153

[9] Bridging semantic gap between high-level and low-level features in content-based video retrieval using multi-stage ESN-SVM classifier [J].

Brindha, N. ;

Visalakshi, P. .

SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2017, 42 (01) :1-10

[10] Improving Spatiotemporal Self-supervision by Deep Reinforcement Learning [J].

Buechler, Uta ;

Brattoli, Biagio ;

Ommer, Bjoern .

COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 :797-814

← 1 2 3 4 5 6 →