SWIN transformer based contrastive self-supervised learning for animal detection and classification

被引：14

作者：

Agilandeeswari, L. ^{[1
]}

Meena, S. Divya ^{[2
]}

机构：

[1] VIT, Sch Informat Technol & Engn, Vellore, TN, India

[2] VIT, Sch Comp Sci & Engn, Amaravathi, Andhra Pradesh, India

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2023年 / 82卷 / 07期

关键词：

Image classification; Swin transformer; Contrastive self-supervised learning; Clustering; And mutual information;

D O I：

10.1007/s11042-022-13629-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The subdomain of computer vision applications is Image Classification which helps in categorizing the images. The advent of handheld devices and image sensors leads to the availability of a huge amount of data without labels. Hence, to categorize these images, a supervised learning algorithm won't be suitable as it requires labels. On the other hand, unsupervised learning uses clustering that also not useful as its accuracy is not reliable as the data are not labeled in advance. Self-Supervised Learning techniques can be used to overcome this problem. In this work, we present a novel Swin Transformer based Contrastive Self-Supervised Learning (Swin-TCSSL), where the paired sample is formed using the transformation of the given input image and this paired sample is passed to the Swin-T transformer which produces a feature vector. The maximum Mutual Information of these feature vectors is used to form robust clusters and these cluster labels get propagates to the Swin Transformer block until the appropriate clusters are obtained. It is then followed by contrastive learning and finally produces the classified output. The experimental results prove that the proposed system is invariant to occlusion, viewpoint variation, and illumination effects. The proposed Swin-TSSCL achieves state-of-the-art results in 5 benchmark datasets namely CIFAR-10, Snapshot Serengeti, Stanford dogs, Animals with attributes, and ImageNet dataset. As evident from the rigorous experiments, the proposed Swin-TCSSL has set a new global state-of-the-art with an average accuracy of 97.63%, which is comparatively higher than the state-of-the-art systems.

引用

页码：10445 / 10470

页数：26

共 52 条

[1] How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes [J].

Al-Halah, Ziad ;

Stiefelhagen, Rainer .

2015 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2015, :837-843

[2]

[Anonymous], 2011, P CVPR WORKSHOP FINE

[3]

Bau D., 2019, ARXIV

[4] SELF-ORGANIZING NEURAL NETWORK THAT DISCOVERS SURFACES IN RANDOM-DOT STEREOGRAMS [J].

BECKER, S ;

HINTON, GE .

NATURE, 1992, 355 (6356) :161-163

[5] Emerging Properties in Self-Supervised Vision Transformers [J].

Caron, Mathilde ;

Touvron, Hugo ;

Misra, Ishan ;

Jegou, Herve ;

Mairal, Julien ;

Bojanowski, Piotr ;

Joulin, Armand .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9630-9640

[6]

Chapelle O., 2006, IEEE Transactions on Neural Networks, V20, P542, DOI [DOI 10.1109/TNN.2009.2015974, 10.1109/TNN.2009.2015974]

[7]

Chen T, 2020, PR MACH LEARN RES, V119

[8] An Empirical Study of Training Self-Supervised Vision Transformers [J].

Chen, Xinlei ;

Xie, Saining ;

He, Kaiming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629

[9]

Dhillon I. S., 2003, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, P89

[10]

Dosovitskiy A., 2021, arXiv

← 1 2 3 4 5 6 →