HA-Transformer: Harmonious aggregation from local to global for object detection

被引:4
作者
Chen, Yang [1 ]
Chen, Sihan [1 ]
Deng, Yongqiang [2 ]
Wang, Kunfeng [1 ]
机构
[1] Beijing Univ Chem Technol, Coll Informat Sci & Technol, Beijing 100029, Peoples R China
[2] VanJee Technol, Beijing 100193, Peoples R China
基金
中国国家自然科学基金;
关键词
Object detection; Transformer; multi-head self-attention; global interaction; transition module;
D O I
10.1016/j.eswa.2023.120539
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, the Vision Transformer (ViT) with global modeling capability has shown its excellent performance in classification task, which innovates the development direction for a series of vision tasks. However, due to the enormous cost of multi-head self-attention, reducing computational cost while holding the capability of global interaction remains a big challenge. In this paper, we propose a new architecture by establishing an end-to-end connection from local to global via bridge tokens, so that the global interaction is completed at the window level, effectively solving the quadratic complexity problem of transformer. Besides, we consider a hierarchy of information from short-distance to long-distance, which adds a transition module from local to global to make a more harmonious aggregation of information. Our proposed method is named HA-Transformer. The experimental results on COCO dataset show excellent performance of HA-Transformer for object detection, outperforming several state-of-the-art methods.
引用
收藏
页数:9
相关论文
共 51 条
  • [1] Cascade R-CNN: High Quality Object Detection and Instance Segmentation
    Cai, Zhaowei
    Vasconcelos, Nuno
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) : 1483 - 1498
  • [2] Carion N., 2020, P EUR C COMP VIS, P213
  • [3] Chen K, 2019, Arxiv, DOI arXiv:1906.07155
  • [4] Chu XX, 2021, ADV NEUR IN
  • [5] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers
    Dai, Zhigang
    Cai, Bolun
    Lin, Yugeng
    Chen, Junying
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1601 - 1610
  • [6] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [7] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [8] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
    Dong, Xiaoyi
    Bao, Jianmin
    Chen, Dongdong
    Zhang, Weiming
    Yu, Nenghai
    Yuan, Lu
    Chen, Dong
    Guo, Baining
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
  • [9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [10] Multiscale Vision Transformers
    Fan, Haoqi
    Xiong, Bo
    Mangalam, Karttikeya
    Li, Yanghao
    Yan, Zhicheng
    Malik, Jitendra
    Feichtenhofer, Christoph
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6804 - 6815