Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

被引：0

作者：

Alireza Dehghanpour

Javad Khodamoradi Kordestani

Masoud Dehyadegari

机构：

[1] K. N. Toosi University of Technology,Faculty of Computer Engineering

[2] Institute for Research in Fundamental Sciences (IPM),School of Computer Science

来源：

Neural Processing Letters | 2023年 / 55卷

关键词：

Deep neural networks; Floating point; Sorting; AlexNet; Convolutional neural networks;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.

引用

页码：12061 / 12078

页数：17

共 50 条

[31] Variable Precision 16-Bit Floating-Point Vector Unit for Embedded Processors
Nannarelli, Alberto
2020 IEEE 27TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2020, : 96 - 102
[32] Novel adaptive quantization methodology for 8-bit floating-point DNN training
Hassani Sadi, Mohammad
Sudarshan, Chirag
Wehn, Norbert
DESIGN AUTOMATION FOR EMBEDDED SYSTEMS, 2024, 28 (02) : 91 - 110
[33] Beyond Floating-Point Ops: CNN Performance Prediction with Critical Datapath Length
Langerman, David
Johnson, Alex
Buettner, Kyle
George, Alan D.
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
[34] A precision- and range-independent tool for testing floating-point arithmetic II: Conversions
Verdonk, B
Cuyt, A
Verschaeren, D
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2001, 27 (01): : 119 - 140
[35] On the need for predictable floating-point arithmetic in the programming languages Fortran 90 and C/C++
Verschaeren, D
Cuyt, A
Verdonk, B
ACM SIGPLAN NOTICES, 1997, 32 (03) : 57 - 64
[36] A Latency-Effective Pipelined Divider for Double-Precision Floating-Point Numbers
Yun, Juwon
Lee, Jinyoung
Chung, Woo-Nam
Kim, Cheong Ghil
Park, Woo-Chan
IEEE ACCESS, 2020, 8 : 165740 - 165747
[37] Low Power Floating-Point Multiplication and Squaring Units with Shared Circuitry
Moore, Jason
Thornton, Mitchell A.
Matula, David W.
2013 IEEE 56TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2013, : 1395 - 1398
[38] FPGA Implementation of a Decimal Floating-Point Accurate Scalar Product Unit with a Parallel Fixed-Point Multiplier
Baesler, Malte
Teufel, Thomas
2009 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS, 2009, : 6 - 11
[39] Algorithm 814: Fortran 90 software for floating-point multiple precision arithmetic, gamma and related functions
Smith, DM
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2001, 27 (04): : 377 - 387
[40] Error Analysis in the Hardware Neural Networks Applications using Reduced Floating-point Numbers Representation
Pietras, Marcin
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE OF NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014), 2015, 1648

← 1 2 3 4 5 →