NAS-PED: Neural Architecture Search for Pedestrian Detection

被引:4
作者
Tang, Yi [1 ]
Liu, Min [1 ]
Li, Baopu [2 ]
Wang, Yaonan [1 ]
Ouyang, Wanli [3 ]
机构
[1] Hunan Univ, Coll Elect & Informat Engn, Natl Engn Res Ctr Robot Visual Percept & Control T, Changsha 410082, Hunan, Peoples R China
[2] Baidu USA, Sunnyvale, CA 94087 USA
[3] Shanghai AI Lab, Shanghai 201201, Peoples R China
基金
中国国家自然科学基金;
关键词
Pedestrians; Feature extraction; Transformers; Computer architecture; Detectors; Computer vision; Search problems; Linear programming; Convolutional neural networks; Training; Pedestrian detection; human-Centric computer vision; neural architecture search; information bottleneck; ATTENTION;
D O I
10.1109/TPAMI.2024.3507918
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pedestrian detection currently suffers from two issues in crowded scenes: occlusion and dense boundary prediction, making it still challenging in complex real-world scenarios. In recent years, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shown their superiorities in addressing these issues, where ViTs capture global feature dependency to infer occlusion parts and CNNs make accurate dense predictions by local detailed features. Nevertheless, limited by the narrow receptive field, CNNs fail to infer occlusion parts, while ViTs tend to ignore local features that are vital to distinguish different pedestrians in the crowd. Therefore, it is essential to combine the advantages of CNN and ViT for pedestrian detection. However, manually designing a specific CNN and ViT hybrid network requires enormous time and resources for trial and error. To address this issue, we propose the first Neural Architecture Search (NAS) framework specifically designed for pedestrian detection named NAS-PED, which automatically designs an appropriate CNNs and ViTs hybrid backbone for the crowded pedestrian detection task. Specifically, we formulate transformers and convolutions with various kernel sizes in the same format, which provides an unconstrained space for diverse hybrid network search. Furthermore, to search for a suitable backbone, we propose an information bottleneck based NAS objective function, which treats the process of NAS as an information extraction process, preserving relevant information and suppressing redundant information from the dense pedestrians in crowd scenes Extensive experiments on CrowdHuman, CityPersons and EuroCity Persons datasets demonstrate the effectiveness of the proposed method. Our NAS-PED obtains absolute gains of 4.0% MR-2 and 1.9% AP over the state-of-the-art (SOTA) pedestrian detection framework on CrowdHuman datasets. For the CityPersons and EuroCity Persons datasets, the searched backbone achieves stable improvement across all three subsets, outperforming some large language-image pre-trained models.
引用
收藏
页码:1800 / 1817
页数:18
相关论文
共 90 条
[1]  
Belghazi MI, 2018, PR MACH LEARN RES, V80
[2]   Ten Years of Pedestrian Detection, What Have We Learned? [J].
Benenson, Rodrigo ;
Omran, Mohamed ;
Hosang, Jan ;
Schiele, Bernt .
COMPUTER VISION - ECCV 2014 WORKSHOPS, PT II, 2015, 8926 :613-627
[3]   EuroCity Persons: A Novel Benchmark for Person Detection in Traffic Scenes [J].
Braun, Markus ;
Krebs, Sebastian ;
Flohr, Fabian ;
Gavrila, Dariu M. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) :1844-1861
[4]  
Cai H, 2019, Arxiv, DOI arXiv:1812.00332
[5]   Cascade R-CNN: High Quality Object Detection and Instance Segmentation [J].
Cai, Zhaowei ;
Vasconcelos, Nuno .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) :1483-1498
[6]   PSTR: End-to-End One-Step Person Search With Transformers [J].
Cao, Jiale ;
Pang, Yanwei ;
Anwer, Rao Muhammad ;
Cholakkal, Hisham ;
Xie, Jin ;
Shah, Mubarak ;
Khan, Fahad Shahbaz .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :9448-9457
[7]   From Handcrafted to Deep Features for Pedestrian Detection: A Survey [J].
Cao, Jiale ;
Pang, Yanwei ;
Xie, Jin ;
Khan, Fahad Shahbaz ;
Shao, Ling .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (09) :4913-4934
[8]   GLiT: Neural Architecture Search for Global and Local Image Transformer [J].
Chen, Boyu ;
Li, Peixia ;
Li, Chuming ;
Li, Baopu ;
Bai, Lei ;
Lin, Chen ;
Sun, Ming ;
Yan, Junjie ;
Ouyang, Wanli .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12-21
[9]   AutoFormer: Searching Transformers for Visual Recognition [J].
Chen, Minghao ;
Peng, Houwen ;
Fu, Jianlong ;
Ling, Haibin .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12250-12260
[10]   Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks [J].
Chen, Weihua ;
Xu, Xianzhe ;
Jia, Jian ;
Luo, Hao ;
Wang, Yaohua ;
Wang, Fan ;
Jin, Rong ;
Sun, Xiuyu .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :15050-15061