ViT-PGC: vision transformer for pedestrian gender classification on small-size dataset

被引：3

作者：

Abbas, Farhat ^{[1
]}

Yasmin, Mussarat ^{[1
]}

Fayyaz, Muhammad ^{[2
]}

Asim, Usman ^{[3
]}

机构：

[1] COMSATS Univ Islamabad, Dept Comp Sci, Wah Campus, WahCantt 47040, Pakistan

[2] FAST Natl Univ Comp & Emerging Sci NUCES, Dept Comp Sci, Chiniot Faisalabad Campus, Chiniot, Punjab, Pakistan

[3] DeltaX, 3F,24,Namdaemun Ro 9 Gil, Seoul, South Korea

来源：

PATTERN ANALYSIS AND APPLICATIONS | 2023年 / 26卷 / 04期

关键词：

Vision transformer; LSA and SPT; Deep CNN models; SS datasets; Pedestrian gender classification; CONVOLUTIONAL NEURAL-NETWORK; RECOGNITION;

D O I：

10.1007/s10044-023-01196-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pedestrian gender classification (PGC) is a key task in full-body-based pedestrian image analysis and has become an important area in applications like content-based image retrieval, visual surveillance, smart city, and demographic collection. In the last decade, convolutional neural networks (CNN) have appeared with great potential and with reliable choices for vision tasks, such as object classification, recognition, detection, etc. But CNN has a limited local receptive field that prevents them from learning information about the global context. In contrast, a vision transformer (ViT) is a better alternative to CNN because it utilizes a self-attention mechanism to attend to a different patch of an input image. In this work, generic and effective modules such as locality self-attention (LSA), and shifted patch tokenization (SPT)-based vision transformer model are explored for the PGC task. With the use of these modules in ViT, it is successfully able to learn from stretch even on small-size (SS) datasets and overcome the lack of locality inductive bias. Through extensive experimentation, we found that the proposed ViT model produced better results in terms of overall and mean accuracies. The better results confirm that ViT outperformed state-of-the-art (SOTA) PGC methods.

引用

页码：1805 / 1819

页数：15

共 24 条

[21] ViT-UNet: A Vision Transformer Based UNet Model for Coastal Wetland Classification Based on High Spatial Resolution Imagery
Zhou, Nan
Xu, Mingming
Shen, Biaoqun
Hou, Ke
Liu, Shanwei
Sheng, Hui
Liu, Yanfen
Wan, Jianhua
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 19575 - 19587
[22] Classification Accuracy Improvement for Small-Size Citrus Pests and Diseases Using Bridge Connections in Deep Neural Networks
Xing, Shuli
Lee, Malrey
SENSORS, 2020, 20 (17) : 1 - 16
[23] ViT-SENet-Tom: machine learning-based novel hybrid squeeze–excitation network and vision transformer framework for tomato fruits classification
S M Masfequier Rahman Swapno
S. M. Nuruzzaman Nobel
Md Babul Islam
Pronaya Bhattacharya
Ebrahim A. Mattar
Neural Computing and Applications, 2025, 37 (9) : 6583 - 6600
[24] Classification and Determination of Severity of Corneal Ulcer with Vision Transformer Based on the Analysis of Public Image Dataset of Fluorescein-Stained Corneas
Alakus, Talha Burak
Baykara, Muhammet
DIAGNOSTICS, 2024, 14 (08)

← 1 2 3 →