QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

被引:1
|
作者
Ashar, Neha [1 ]
Raut, Gopal [1 ,2 ]
Trivedi, Vasundhara [1 ]
Vishvakarma, Santosh Kumar [1 ]
Kumar, Akash [3 ]
机构
[1] Indian Inst Technol Indore, Dept Elect Engn, Indore 453552, India
[2] Ctr Dev Adv Comp, Bengaluru 560100, India
[3] Tech Univ Dresden, Chair Processor Design, Ctr Adv Elect Dresden, D-01169 Dresden, Germany
关键词
Computer architecture; Hardware; Artificial neural networks; Throughput; Artificial intelligence; Arithmetic; Convolution; Approximate compute; bit-truncation; CORDIC; deep neural network; hardware accelerator; quantize processing element; DEEP NEURAL-NETWORKS; ACCELERATOR;
D O I
10.1109/ACCESS.2024.3379906
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f > representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 > format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.
引用
收藏
页码:43600 / 43614
页数:15
相关论文
共 50 条
  • [1] Design of High Performance Multiply-Accumulate Computation Unit
    Ahish, S.
    Kumar, Y. B. N.
    Sharma, Dheeraj
    Vasantha, M. H.
    2015 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2015, : 915 - 918
  • [2] Design and Performance Analysis of Multiply-Accumulate (MAC) Unit
    SaiKumar, Maroju
    Kumar, D. Ashok
    Samundiswary, P.
    2014 IEEE INTERNATIONAL CONFERENCE ON CIRCUIT, POWER AND COMPUTING TECHNOLOGIES (ICCPCT-2014), 2014, : 1084 - 1089
  • [3] Time-Domain Multiply-Accumulate Unit
    Locatelli, Pedro Sartori
    Colombo, Dalton Martini
    El-Sankary, Kamal
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (06) : 762 - 775
  • [4] Modified Fused Multiply-Accumulate Chained Unit
    Nasiri, Nasibeh
    Segal, Oren
    Margala, Martin
    2014 IEEE 57TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2014, : 889 - 892
  • [5] Architecture and implementation of a vector/SIMD multiply-accumulate unit
    Danysh, A
    Tan, D
    IEEE TRANSACTIONS ON COMPUTERS, 2005, 54 (03) : 284 - 293
  • [6] DESIGN OF A DELAY-INSENSITIVE MULTIPLY-ACCUMULATE UNIT
    NIELSEN, CD
    MARTIN, AJ
    INTEGRATION-THE VLSI JOURNAL, 1993, 15 (03) : 291 - 311
  • [7] New design of an RSFQ parallel multiply-accumulate unit
    Kataeva, Irina
    Engseth, Henrik
    Kidiyarova-Shevchenko, Anna
    SUPERCONDUCTOR SCIENCE & TECHNOLOGY, 2006, 19 (05): : S381 - S386
  • [8] Double Throughput Multiply-Accumulate Unit for FlexCore Processor Enhancements
    Hoang, Tung Thanh
    Sjalander, Magnus
    Larsson-Edefors, Per
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 2821 - 2827
  • [9] Efficient Hardware Implementation of Convolution Layers Using Multiply-Accumulate Blocks
    Nojehdeh, Mohammadreza Esmali
    Parvin, Sajjad
    Altun, Mustafa
    2021 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2021), 2021, : 402 - 405
  • [10] An Approximate Multiply-Accumulate Unit with Low Power and Reduced Area
    Yang, Tongxin
    Sato, Toshinori
    Ukezono, Tomoaki
    2019 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2019), 2019, : 386 - 391