Quantization and pruning optimization method for attention mechanism

被引:0
作者
He Y. [1 ,2 ]
Jiang J. [1 ,2 ]
Xu J. [1 ,2 ]
机构
[1] College of Computer Science and Technology, National University of Defense Technology, Changsha
[2] National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha
来源
Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology | 2024年 / 46卷 / 01期
关键词
attention mechanism; natural language processing; pruning; quantization;
D O I
10.11887/j.cn.202401012
中图分类号
学科分类号
摘要
To address the significant computation and memory overhead of models based on attention mechanism, model compression techniques, such as collaborative optimization of quantization and pruning, were studied. A symmetric linear fixed point quantization method was proposed for four activation matrices of query, key, value and probability in the attention mechanism. Meanwhile, a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss. Experimental results on different datasets show that for the typical attention-based model BERT, this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss, which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models. © 2024 National University of Defense Technology. All rights reserved.
引用
收藏
页码:113 / 120
页数:7
相关论文
共 29 条
  • [1] DEVLIN J, CHANG M W, LEE K, Et al., BERT:pre-training of deep bidirectional transformers for language understanding
  • [2] BROWN T B, MANN B, RYDER N, Et al., Language models are few-shot learners
  • [3] WANG H R, ZHANG Z K, HAN S., SpAtten:efficient sparse attention architecture with cascade token and head pruning, Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), (2021)
  • [4] MIKOLOV T, KARAFIAT M, BURGET L, Et al., Recurrent neural network based language model, Proceedings of the 11th Annual Conference of the International Speech Communication Association, (2010)
  • [5] GRAVES A., Long short-term memory, Supervised sequence labelling with recurrent neural networks, pp. 37-45, (2012)
  • [6] VASWANI A, SHAZEER N, PARMAR N, Et al., Attention is all you need, Proceedings of the 31st Conference on Neural Information Processing Systems, (2017)
  • [7] LU L Q, JIN Y C, BI H R, Et al., Sanger:a co-design framework for enabling sparse attention using reconfigurable architecture, Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture, (2021)
  • [8] ZAFRIR O, BOUDOUKH G, IZSAK P, Et al., Q8BERT:quantized 8 bit BERT, Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), (2019)
  • [9] SHEN S, DONG Z, YE J Y, Et al., Q-BERT:hessian based ultra low precision quantization of BERT, Proceedings of the AAAI Conference on Artificial Intelligence, (2020)
  • [10] ZADEH A H, EDO I, AWAD O M, Et al., GOBO:quantizing attention-based NLP models for low latency and energy efficient inference, Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), (2020)