QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

被引：1

作者：

Ashar, Neha ^{[1
]}

Raut, Gopal ^{[1
,2
]}

Trivedi, Vasundhara ^{[1
]}

Vishvakarma, Santosh Kumar ^{[1
]}

Kumar, Akash ^{[3
]}

机构：

[1] Indian Inst Technol Indore, Dept Elect Engn, Indore 453552, India

[2] Ctr Dev Adv Comp, Bengaluru 560100, India

[3] Tech Univ Dresden, Chair Processor Design, Ctr Adv Elect Dresden, D-01169 Dresden, Germany

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Computer architecture; Hardware; Artificial neural networks; Throughput; Artificial intelligence; Arithmetic; Convolution; Approximate compute; bit-truncation; CORDIC; deep neural network; hardware accelerator; quantize processing element; DEEP NEURAL-NETWORKS; ACCELERATOR;

D O I：

10.1109/ACCESS.2024.3379906

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f > representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 > format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.

引用

页码：43600 / 43614

页数：15

共 50 条

[1] Design of High Performance Multiply-Accumulate Computation Unit
Ahish, S.
Kumar, Y. B. N.
Sharma, Dheeraj
Vasantha, M. H.
2015 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2015, : 915 - 918
[2] Design and Performance Analysis of Multiply-Accumulate (MAC) Unit
SaiKumar, Maroju
Kumar, D. Ashok
Samundiswary, P.
2014 IEEE INTERNATIONAL CONFERENCE ON CIRCUIT, POWER AND COMPUTING TECHNOLOGIES (ICCPCT-2014), 2014, : 1084 - 1089
[3] Time-Domain Multiply-Accumulate Unit
Locatelli, Pedro Sartori
Colombo, Dalton Martini
El-Sankary, Kamal
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (06) : 762 - 775
[4] Modified Fused Multiply-Accumulate Chained Unit
Nasiri, Nasibeh
Segal, Oren
Margala, Martin
2014 IEEE 57TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2014, : 889 - 892
[5] Architecture and implementation of a vector/SIMD multiply-accumulate unit
Danysh, A
Tan, D
IEEE TRANSACTIONS ON COMPUTERS, 2005, 54 (03) : 284 - 293
[6] DESIGN OF A DELAY-INSENSITIVE MULTIPLY-ACCUMULATE UNIT
NIELSEN, CD
MARTIN, AJ
INTEGRATION-THE VLSI JOURNAL, 1993, 15 (03) : 291 - 311
[7] New design of an RSFQ parallel multiply-accumulate unit
Kataeva, Irina
Engseth, Henrik
Kidiyarova-Shevchenko, Anna
SUPERCONDUCTOR SCIENCE & TECHNOLOGY, 2006, 19 (05): : S381 - S386
[8] Double Throughput Multiply-Accumulate Unit for FlexCore Processor Enhancements
Hoang, Tung Thanh
Sjalander, Magnus
Larsson-Edefors, Per
2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 2821 - 2827
[9] Efficient Hardware Implementation of Convolution Layers Using Multiply-Accumulate Blocks
Nojehdeh, Mohammadreza Esmali
Parvin, Sajjad
Altun, Mustafa
2021 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2021), 2021, : 402 - 405
[10] An Approximate Multiply-Accumulate Unit with Low Power and Reduced Area
Yang, Tongxin
Sato, Toshinori
Ukezono, Tomoaki
2019 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2019), 2019, : 386 - 391

← 1 2 3 4 5 →