EViT: An Eagle Vision Transformer With Bi-Fovea Self-Attention

被引:1
作者
Shi, Yulong [1 ]
Sun, Mingwei [1 ]
Wang, Yongshuai [1 ]
Ma, Jiahao [1 ]
Chen, Zengqiang [1 ,2 ]
机构
[1] Nankai Univ, Coll Artificial Intelligence, Tianjin 300350, Peoples R China
[2] Nankai Univ, Key Lab Intelligent Robot Tianjin, Tianjin 300350, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Computer vision; Transformers; Physiology; Photoreceptors; Computational modeling; Biological information theory; Computational complexity; Sun; Stacking; bi-fovea feedforward network (BFFN); Bi-fovea self-attention (BFSA); eagle vision transformer (EViTs);
D O I
10.1109/TCYB.2025.3532282
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Owing to advancements in deep learning technology, vision transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, the potential advantages of combining eagle vision with ViTs are explored. A bi-fovea visual interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes is introduced. Based on this structural design approach, a novel bi-fovea self-attention (BFSA) mechanism and bi-fovea feedforward network (BFFN) are proposed. These components are employed to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, thereby enabling networks to learn the feature representations of the targets in a coarse-to-fine manner. Furthermore, a bionic eagle vision (BEV) block is designed as the basic building unit based on the BFSA mechanism and the BFFN. By stacking the BEV blocks, a unified and efficient family of pyramid backbone networks called eagle ViTs (EViTs) is developed. Experimental results indicate that the EViTs exhibit highly competitive performance in various computer vision tasks, demonstrating their potential as backbone networks. In terms of computational efficiency and scalability, EViTs show significant advantages compared with other counterparts. The developed code is available at https://github.com/nkusyl/EViT.
引用
收藏
页码:1288 / 1300
页数:13
相关论文
共 77 条
[1]   A Survey on Graph Neural Networks and Graph Transformers in Computer Vision: A Task-Oriented Perspective [J].
Chen, Chaoqi ;
Wu, Yushuang ;
Dai, Qiyuan ;
Zhou, Hong-Yu ;
Xu, Mutian ;
Yang, Sibei ;
Han, Xiaoguang ;
Yu, Yizhou .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) :10297-10318
[2]  
Chen MZ, 2023, AAAI CONF ARTIF INTE, P7042
[3]  
Chen X., 2024, Digit. Signal Process., V158
[4]   Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps [J].
Choi, Jeong-Hwan ;
Yang, Joon-Young ;
Chang, Joon-Hyuk .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :4580-4595
[5]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[6]   KMT-PLL: K-Means Cross-Attention Transformer for Partial Label Learning [J].
Fan, Jinfu ;
Huang, Linqing ;
Gong, Chaoyu ;
You, Yang ;
Gan, Min ;
Wang, Zhongjie .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (02) :2789-2800
[7]  
Fan Q., 2024, P 37 INT C NEUR INF, V36, P15234
[8]  
Guo J., 2022, P IEEE CVF C COMP VI, P12175, DOI [DOI 10.48550/ARXIV.2107.06263, DOI 10.1109/CVPR52688.2022.01186]
[9]   DeepFoveaNet: Deep Fovea Eagle-Eye Bioinspired Model to Detect Moving Objects [J].
Guzman-Pando, Abimael ;
Chacon-Murguia, Mario I. .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :7090-7100
[10]  
Han DC, 2024, Arxiv, DOI [arXiv:2405.16605, DOI 10.48550/ARXIV.2405.16605]