STDP-Net: Improved Pedestrian Attribute Recognition Using Swin Transformer and Semantic Self-Attention

被引：4

作者：

Lee, Geonu ^{[1
]}

Cho, Jungchan ^{[1
]}

机构：

[1] Gachon Univ, Coll Informat Technol, Seongnam Si 13120, Gyeonggi Do, South Korea

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Transformers; Semantics; Decoding; Head; Convolution; Task analysis; Image recognition; Deep learning; pedestrian attribute recognition; self-attention; transformer;

D O I：

10.1109/ACCESS.2022.3196650

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

An image location requiring focus to recognize a specific pedestrian attribute often depends on the state of the pedestrian within the image. In addition, various pedestrian attributes are closely related to each other. For example, the "Boots" and "ShortSkirt" attributes are related to the "Female" attribute. For these reasons, we propose a novel encoder-decoder network for pedestrian attribute recognition, called Swin Transformer and Decoder for Pedestrian attribute recognition Network (STDP-Net). First, we utilize a Swin Transformer that uses self-attention as the encoder. This allows the proposed method to understand the relative relationship between the spatial regions of the images, unlike conventional convolution-based methods. This enables an accurate recognition of the attributes, even in misaligned pedestrian image inputs. Second, we add a transformer decoder with learnable attribute queries to the encoder to understand the semantic relationships among the attributes. Using the decoder, the proposed method captures such relationships based on the self-attention of the attribute queries. Extensive experimental results demonstrate that the proposed method achieves a state-of-the-art performance on six pedestrian attribute recognition datasets. In addition, misalignment experiments on the PETA, PA100K, and RAP datasets show the superiority of the encoder-decoder structure in comparison with other state-of-the-art methods.

引用

页码：82656 / 82667

页数：12

共 34 条

[1] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[2] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[3]

Chen P., 2021, P IEEECVF INT C COMP, P11833

[4]

Chi C, 2020, AAAI CONF ARTIF INTE, V34, P10639

[5] Pedestrian Attribute Recognition At Far Distance [J].

Deng, Yubin ;

Luo, Ping ;

Loy, Chen Change ;

Tang, Xiaoou .

PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, :789-792

[6]

Dosovitskiy A., 2021, P 9 INT C LEARN REPR

[7]

Eom C., 2019, Advances in neural information processing systems, V32, P1

[8] Visual Attention Consistency under Image Transforms for Multi-Label Image Classification [J].

Guo, Hao ;

Zheng, Kang ;

Fan, Xiaochuan ;

Yu, Hongkai ;

Wang, Song .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :729-739

[9] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[10]

Jia J., 2021, PROC INT C COMPUT VI, P962

← 1 2 3 4 →