BinaryFormer: A Hierarchical-Adaptive Binary Vision Transformer (ViT) for Efficient Computing

被引:1
作者
Wang, Miaohui [1 ,2 ]
Xu, Zhuowei [3 ,4 ]
Zheng, Bin [3 ]
Xie, Wuyuan [5 ]
机构
[1] Shenzhen Univ, State Key Lab Radio Frequency Heterogeneous Integr, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[3] Shenzhen Univ, Coll Elect & Informat Engn, Shenzhen 518060, Peoples R China
[4] Creat Life TCL New Technol Ltd, Huizhou 516001, Peoples R China
[5] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Convolution; Quantization (signal); Training; Computational modeling; Informatics; Task analysis; Binary compression; model optimization; Vision Transformer (ViT);
D O I
10.1109/TII.2024.3396520
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision Transformer (ViT) has recently demonstrated impressive nonlinear modeling capabilities and achieved state-of-the-art performance in various industrial applications, such as object recognition, anomaly detection, and robot control. However, their practical deployment can be hindered by high storage requirements and computational intensity. To alleviate these challenges, we propose a binary transformer called BinaryFormer, which quantizes the learned weights of the ViT module from 32-b precision to 1 b. Furthermore, we propose a hierarchical-adaptive architecture that replaces expensive matrix operations with more affordable addition and bit operations by switching between two attention modes. As a result, BinaryFormer is able to effectively compress the model size as well as reduce the computation cost of ViT. Experimental results on the ImageNet-1K benchmark datasets show that BinaryFormer reduces the size of a typical ViT model by an average of 27.7x and converts over 99% of multiplication operations into bit operations while maintaining reasonable accuracy.
引用
收藏
页码:10657 / 10668
页数:12
相关论文
共 42 条
[1]  
Bulat A., 2020, P INT C LEARN REPR, P1
[2]   Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space [J].
Chavan, Arnav ;
Shen, Zhiqiang ;
Liu, Zhuang ;
Liu, Zechun ;
Cheng, Kwang-Ting ;
Xing, Eric .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4921-4931
[3]   Transformer-Based Imitative Reinforcement Learning for Multirobot Path Planning [J].
Chen, Lin ;
Wang, Yaonan ;
Miao, Zhiqiang ;
Mo, Yang ;
Feng, Mingtao ;
Zhou, Zhen ;
Wang, Hesheng .
IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (10) :10233-10243
[4]   Low-bit Quantization of Neural Networks for Efficient Inference [J].
Choukroun, Yoni ;
Kravchik, Eli ;
Yang, Fan ;
Kisilev, Pavel .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :3009-3018
[5]   The Cityscapes Dataset for Semantic Urban Scene Understanding [J].
Cordts, Marius ;
Omran, Mohamed ;
Ramos, Sebastian ;
Rehfeld, Timo ;
Enzweiler, Markus ;
Benenson, Rodrigo ;
Franke, Uwe ;
Roth, Stefan ;
Schiele, Bernt .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223
[6]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[7]   The Pascal Visual Object Classes (VOC) Challenge [J].
Everingham, Mark ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338
[8]  
Fan A, 2021, PROC ICML, P1
[9]   Comparison of the Utility of High-Resolution CT-DWI and T2WI-DWI Fusion Images for the Localization of Cholesteatoma [J].
Fan, X. ;
Ding, C. ;
Liu, Z. .
AMERICAN JOURNAL OF NEURORADIOLOGY, 2022, :1029-1035
[10]   General Bitwidth Assignment for Efficient Deep Convolutional Neural Network Quantization [J].
Fei, Wen ;
Dai, Wenrui ;
Li, Chenglin ;
Zou, Junni ;
Xiong, Hongkai .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (10) :5253-5267