POAR: Towards Open Vocabulary Pedestrian Attribute Recognition

被引:1
作者
Zhang, Yue [1 ,2 ]
Wang, Suchen [3 ]
Kan, Shichao [4 ]
Weng, Zhenyu [3 ]
Cen, Yigang [5 ,6 ]
Tan, Yap-peng [3 ]
机构
[1] Henan Normal Univ, Coll Comp & Informat Engn, Xinxiang, Henan, Peoples R China
[2] Henan Normal Univ, Key Lab Artificial Intelligence & Personalized Le, Xinxiang, Peoples R China
[3] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore, Singapore
[4] Cent South Univ, Sch Comp Sci & Engn, Changsha, Peoples R China
[5] Beijing Jiaotong Univ, Inst Informat Sci, Beijing, Peoples R China
[6] Beijing Jiaotong Univ, Beijing Key Lab Adv Informat Sci & Network Techno, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Pedestrian attribute recognition; CLIP; Open-attribute recognition; NETWORK;
D O I
10.1145/3581783.3611719
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pedestrian attribute recognition (PAR) aims to predict the attributes of a target pedestrian. Recent methods often address the PAR problem by training a multi-label classifier with predefined attribute classes, but they can hardly exhaust all possible pedestrian attributes in the real world. To tackle this problem, we propose a novel Pedestrian Open-Attribute Recognition (POAR) approach by formulating the problem as a task of image-text search. Our approach employs a Transformer-based Encoder with a Masking Strategy (TEMS) to focus on the attributes of specific pedestrian parts (e.g., head, upper body, lower body, feet, etc.), and introduces a set of attribute tokens to encode the corresponding attributes into visual embeddings. Each attribute category is described as a natural language sentence and encoded by the text encoder. Then, we compute the similarity between the visual and text embeddings to find the best attribute descriptions for the input images. To handle multiple attributes of a single pedestrian, we propose a Many-To-Many Contrastive (MTMC) loss with masked tokens. In addition, we propose a Grouped Knowledge Distillation (GKD) method to minimize the disparity between visual embeddings and unseen attribute text embeddings. We evaluate our proposed method on three PAR datasets with an open-attribute setting. The results demonstrate the effectiveness of our method as a strong baseline for the POAR task. Our code is available at https://github.com/IvyYZ/POAR.
引用
收藏
页码:655 / 665
页数:11
相关论文
共 48 条
  • [1] Partially Shared Multi-Task Convolutional Neural Network with Local Constraint for Face Attribute Learning
    Cao, Jiajiong
    Li, Yingming
    Zhang, Zhongfei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4290 - 4299
  • [2] A Comprehensive Survey of Scene Graphs: Generation and Application
    Chang, Xiaojun
    Ren, Pengzhen
    Xu, Pengfei
    Li, Zhihui
    Chen, Xiaojiang
    Hauptmann, Alex
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 1 - 26
  • [3] Cheng Xijie, 2022, IEEE Transactions on Circuits and Systems for Video Technology
  • [4] ENHANCING CLASS UNDERSTANDING VIA PROMPT-TUNING FOR ZERO-SHOT TEXT CLASSIFICATION
    Dan, Yuhao
    Zhou, Jie
    Chen, Qin
    Bai, Qingchun
    He, Liang
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4303 - 4307
  • [5] Pedestrian Attribute Recognition At Far Distance
    Deng, Yubin
    Luo, Ping
    Loy, Chen Change
    Tang, Xiaoou
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 789 - 792
  • [6] Ding ZF, 2021, Arxiv, DOI arXiv:2107.12666
  • [7] Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
  • [8] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
    Du, Yu
    Wei, Fangyun
    Zhang, Zihe
    Shi, Miaojing
    Gao, Yue
    Li, Guoqi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 14064 - 14073
  • [9] Esmaeilpour S, 2022, AAAI CONF ARTIF INTE, P6568
  • [10] Correlation Graph Convolutional Network for Pedestrian Attribute Recognition
    Fan, Haonan
    Hu, Hai-Miao
    Liu, Shuailing
    Lu, Weiqing
    Pu, Shiliang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 49 - 60