Scale-Insensitive Object Detection via Attention Feature Pyramid Transformer Network

被引：1

作者：

Li, Lingling ^{[1
]}

Zheng, Changwen ^{[2
]}

Mao, Cunli ^{[3
]}

Deng, Haibo ^{[4
]}

Jin, Taisong ^{[5
]}

机构：

[1] Zhengzhou Univ Aeronaut, Sch Intelligent Engn, Zhengzhou 450046, Peoples R China

[2] Chinese Acad Sci, Inst Software, Beijing 100190, Peoples R China

[3] Kunming Univ Sci & Technol, Fac Informat Engn & Automat, Yunnan Key Lab Artificial Intelligence, Kunming 650500, Yunnan, Peoples R China

[4] Beijing Zhonghangzhi Technol Co Ltd, Beijing 100176, Peoples R China

[5] Xiamen Univ, Sch Informat, Xiamen 361005, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2022年 / 54卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Object detection; Feature pyramid; Attention; Convolutional network;

D O I：

10.1007/s11063-021-10645-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the progress of deep learning, object detection has attracted great attention in computer vision community. For object detection task, one key challenge is that object scale usually varies in a large range, which may make the existing detectors fail in real applications. To address this problem, we propose a novel end-to-end Attention Feature Pyramid Transformer Network framework to learn the object detectors with multi-scale feature maps via a transformer encoder-decoder fashion. AFPN learns to aggregate pyramid feature maps with attention mechanisms. Specifically, transformer-based attention blocks are used to scan through each spatial location of feature maps in the same pyramid layers and update it by aggregating information from deep to shadow layers. Furthermore, inter-level feature aggregation and intra-level information attention are repeated to encode multi-scale and self-attention feature representation. The extensive experiments on challenging MS COCO object detection dataset demonstrate that the proposed AFPN outperforms its baseline methods, i.e., DETR and Faster R-CNN methods, and achieves the state-of-the-art results.

引用

页码：581 / 595

页数：15

共 68 条

[1]

[Anonymous], 2017, P IEEE C COMP VIS PA, DOI DOI 10.1109/CVPR.2017.87

[2] Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks [J].

Bell, Sean ;

Zitnick, C. Lawrence ;

Bala, Kavita ;

Girshick, Ross .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2874-2883

[3] A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection [J].

Cai, Zhaowei ;

Fan, Quanfu ;

Feris, Rogerio S. ;

Vasconcelos, Nuno .

COMPUTER VISION - ECCV 2016, PT IV, 2016, 9908 :354-370

[4] Triply Supervised Decoder Networks for Joint Detection and Segmentation [J].

Cao, Jiale ;

Pang, Yanwei ;

Li, Xuelong .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7384-7393

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6] MegDet: A Large Mini-Batch Object Detector [J].

Peng, Chao ;

Xiao, Tete ;

Li, Zeming ;

Jiang, Yuning ;

Zhang, Xiangyu ;

Jia, Kai ;

Yu, Gang ;

Sun, Jian .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6181-6189

[7] Pre-Trained Image Processing Transformer [J].

Chen, Hanting ;

Wang, Yunhe ;

Guo, Tianyu ;

Xu, Chang ;

Deng, Yiping ;

Liu, Zhenhua ;

Ma, Siwei ;

Xu, Chunjing ;

Xu, Chao ;

Gao, Wen .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12294-12305

[8] Deep learning in video multi-object tracking: A survey [J].

Ciaparrone, Gioele ;

Luque Sanchez, Francisco ;

Tabik, Siham ;

Troiano, Luigi ;

Tagliaferri, Roberto ;

Herrera, Francisco .

NEUROCOMPUTING, 2020, 381 :61-88

[9]

Dai JF, 2016, ADV NEUR IN, V29

[10] Deformable Convolutional Networks [J].

Dai, Jifeng ;

Qi, Haozhi ;

Xiong, Yuwen ;

Li, Yi ;

Zhang, Guodong ;

Hu, Han ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773

← 1 2 3 4 5 6 7 →