QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

被引:1
|
作者
Ashar, Neha [1 ]
Raut, Gopal [1 ,2 ]
Trivedi, Vasundhara [1 ]
Vishvakarma, Santosh Kumar [1 ]
Kumar, Akash [3 ]
机构
[1] Indian Inst Technol Indore, Dept Elect Engn, Indore 453552, India
[2] Ctr Dev Adv Comp, Bengaluru 560100, India
[3] Tech Univ Dresden, Chair Processor Design, Ctr Adv Elect Dresden, D-01169 Dresden, Germany
关键词
Computer architecture; Hardware; Artificial neural networks; Throughput; Artificial intelligence; Arithmetic; Convolution; Approximate compute; bit-truncation; CORDIC; deep neural network; hardware accelerator; quantize processing element; DEEP NEURAL-NETWORKS; ACCELERATOR;
D O I
10.1109/ACCESS.2024.3379906
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f > representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 > format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.
引用
收藏
页码:43600 / 43614
页数:15
相关论文
共 50 条
  • [41] FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit
    Cho, Mannhee
    Kim, Youngmin
    ELECTRONICS, 2021, 10 (22)
  • [42] Low-Complexity Precision-Scalable Multiply-Accumulate Unit Architectures for Deep Neural Network Accelerators
    Li, Wenjie
    Hu, Aokun
    Wang, Gang
    Xu, Ningyi
    He, Guanghui
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (04) : 1610 - 1614
  • [43] Dynamic Fused Multiply-Accumulate Posit Unit With Variable Exponent Size For Low-Precision DSP Applications
    Neves, Nuno
    Tomas, Pedro
    Roma, Nuno
    2020 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS), 2020, : 152 - 157
  • [44] Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing
    Camusy, Vincent
    Meiy, Linyan
    Enz, Christian
    Verhelst, Marian
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (04) : 697 - 711
  • [45] Efficient Fixed/Floating-Point Merged Mixed-Precision Multiply-Accumulate Unit for Deep Learning Processors
    Zhang, Hao
    Lee, Hyuk Jae
    Ko, Seok-Bum
    2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [46] A novel algorithm for signed-digit online multiply-accumulate operation and its purely signed-binary hardware implementation
    Natter, WG
    Nowrouzian, B
    ISCAS 2000: IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS - PROCEEDINGS, VOL V: EMERGING TECHNOLOGIES FOR THE 21ST CENTURY, 2000, : 329 - 332
  • [47] A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature
    Liao, YY
    Roberts, DB
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2002, 37 (07) : 926 - 931
  • [48] High-performance multiply-accumulate unit by integrating binary carry select adder and counter-based modular wallace tree multiplier for embedding system
    Ponraj, Jeyakumar
    Jeyabharath, R.
    Veena, P.
    Srihari, Tharumar
    INTEGRATION-THE VLSI JOURNAL, 2023, 93
  • [49] RETRACTED: A nano-scale design of a multiply-accumulate unit for digital signal processing based on quantum computing (Retracted Article)
    Ahmadpour, Seyed-Sajad
    Navimipour, Nima Jafari
    Yalcin, Senay
    Bakhshayeshi Avval, Danial
    Ul Ain, Noor
    OPTICAL AND QUANTUM ELECTRONICS, 2024, 56 (01)
  • [50] Sensitivity-Based Error Resilient Techniques With Heterogeneous Multiply-Accumulate Unit for Voltage Scalable Deep Neural Network Accelerators
    Shin, Dongyeob
    Choi, Wonseok
    Park, Jongsun
    Ghosh, Swaroop
    IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (03) : 520 - 531