HA-Transformer: Harmonious aggregation from local to global for object detection

被引：4

作者：

Chen, Yang ^{[1
]}

Chen, Sihan ^{[1
]}

Deng, Yongqiang ^{[2
]}

Wang, Kunfeng ^{[1
]}

机构：

[1] Beijing Univ Chem Technol, Coll Informat Sci & Technol, Beijing 100029, Peoples R China

[2] VanJee Technol, Beijing 100193, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2023年 / 230卷

基金：

中国国家自然科学基金;

关键词：

Object detection; Transformer; multi-head self-attention; global interaction; transition module;

D O I：

10.1016/j.eswa.2023.120539

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, the Vision Transformer (ViT) with global modeling capability has shown its excellent performance in classification task, which innovates the development direction for a series of vision tasks. However, due to the enormous cost of multi-head self-attention, reducing computational cost while holding the capability of global interaction remains a big challenge. In this paper, we propose a new architecture by establishing an end-to-end connection from local to global via bridge tokens, so that the global interaction is completed at the window level, effectively solving the quadratic complexity problem of transformer. Besides, we consider a hierarchy of information from short-distance to long-distance, which adds a transition module from local to global to make a more harmonious aggregation of information. Our proposed method is named HA-Transformer. The experimental results on COCO dataset show excellent performance of HA-Transformer for object detection, outperforming several state-of-the-art methods.

引用

页数：9

共 51 条

[1] Cascade R-CNN: High Quality Object Detection and Instance Segmentation
Cai, Zhaowei
Vasconcelos, Nuno
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) : 1483 - 1498
[2] Carion N., 2020, P EUR C COMP VIS, P213
[3] Chen K, 2019, Arxiv, DOI arXiv:1906.07155
[4] Chu XX, 2021, ADV NEUR IN
[5] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers
Dai, Zhigang
Cai, Bolun
Lin, Yugeng
Chen, Junying
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1601 - 1610
[6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Dong, Xiaoyi
Bao, Jianmin
Chen, Dongdong
Zhang, Weiming
Yu, Nenghai
Yuan, Lu
Chen, Dong
Guo, Baining
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
[9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10] Multiscale Vision Transformers
Fan, Haoqi
Xiong, Bo
Mangalam, Karttikeya
Li, Yanghao
Yan, Zhicheng
Malik, Jitendra
Feichtenhofer, Christoph
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6804 - 6815

← 1 2 3 4 5 6 →