QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

被引：1

作者：

Ashar, Neha ^{[1
]}

Raut, Gopal ^{[1
,2
]}

Trivedi, Vasundhara ^{[1
]}

Vishvakarma, Santosh Kumar ^{[1
]}

Kumar, Akash ^{[3
]}

机构：

[1] Indian Inst Technol Indore, Dept Elect Engn, Indore 453552, India

[2] Ctr Dev Adv Comp, Bengaluru 560100, India

[3] Tech Univ Dresden, Chair Processor Design, Ctr Adv Elect Dresden, D-01169 Dresden, Germany

来源：

IEEE ACCESS | 2024年 / 12卷

关键词：

Computer architecture; Hardware; Artificial neural networks; Throughput; Artificial intelligence; Arithmetic; Convolution; Approximate compute; bit-truncation; CORDIC; deep neural network; hardware accelerator; quantize processing element; DEEP NEURAL-NETWORKS; ACCELERATOR;

D O I：

10.1109/ACCESS.2024.3379906

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f > representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 > format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.

引用

页码：43600 / 43614

页数：15

共 50 条

[41] FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit
Cho, Mannhee
Kim, Youngmin
ELECTRONICS, 2021, 10 (22)
[42] Low-Complexity Precision-Scalable Multiply-Accumulate Unit Architectures for Deep Neural Network Accelerators
Li, Wenjie
Hu, Aokun
Wang, Gang
Xu, Ningyi
He, Guanghui
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (04) : 1610 - 1614
[43] Dynamic Fused Multiply-Accumulate Posit Unit With Variable Exponent Size For Low-Precision DSP Applications
Neves, Nuno
Tomas, Pedro
Roma, Nuno
2020 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS), 2020, : 152 - 157
[44] Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing
Camusy, Vincent
Meiy, Linyan
Enz, Christian
Verhelst, Marian
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (04) : 697 - 711
[45] Efficient Fixed/Floating-Point Merged Mixed-Precision Multiply-Accumulate Unit for Deep Learning Processors
Zhang, Hao
Lee, Hyuk Jae
Ko, Seok-Bum
2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
[46] A novel algorithm for signed-digit online multiply-accumulate operation and its purely signed-binary hardware implementation
Natter, WG
Nowrouzian, B
ISCAS 2000: IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS - PROCEEDINGS, VOL V: EMERGING TECHNOLOGIES FOR THE 21ST CENTURY, 2000, : 329 - 332
[47] A high-performance and low-power 32-bit multiply-accumulate unit with single-instruction-multiple-data (SIMD) feature
Liao, YY
Roberts, DB
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2002, 37 (07) : 926 - 931
[48] High-performance multiply-accumulate unit by integrating binary carry select adder and counter-based modular wallace tree multiplier for embedding system
Ponraj, Jeyakumar
Jeyabharath, R.
Veena, P.
Srihari, Tharumar
INTEGRATION-THE VLSI JOURNAL, 2023, 93
[49] RETRACTED: A nano-scale design of a multiply-accumulate unit for digital signal processing based on quantum computing (Retracted Article)
Ahmadpour, Seyed-Sajad
Navimipour, Nima Jafari
Yalcin, Senay
Bakhshayeshi Avval, Danial
Ul Ain, Noor
OPTICAL AND QUANTUM ELECTRONICS, 2024, 56 (01)
[50] Sensitivity-Based Error Resilient Techniques With Heterogeneous Multiply-Accumulate Unit for Voltage Scalable Deep Neural Network Accelerators
Shin, Dongyeob
Choi, Wonseok
Park, Jongsun
Ghosh, Swaroop
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, 2019, 9 (03) : 520 - 531

← 1 2 3 4 5 →