Accurate Low-Bit Length Floating-Point Arithmetic with Sorting Numbers

被引:0
|
作者
Alireza Dehghanpour
Javad Khodamoradi Kordestani
Masoud Dehyadegari
机构
[1] K. N. Toosi University of Technology,Faculty of Computer Engineering
[2] Institute for Research in Fundamental Sciences (IPM),School of Computer Science
来源
Neural Processing Letters | 2023年 / 55卷
关键词
Deep neural networks; Floating point; Sorting; AlexNet; Convolutional neural networks;
D O I
暂无
中图分类号
学科分类号
摘要
A 32-bit floating-point format is often used for the development and training of deep neural networks. Training and inference in deep learning-optimized codecs can result in enormous performance and energy efficiency advantages. However, training and inferring low-bit neural networks still pose a significant challenge. In this study, we propose a sorting method that maintains accuracy in numerical formats with a low number of bits. We tested this method on convolutional neural networks, including AlexNet. Using our method, we found that in our convolutional neural network, the accuracy achieved with 11 bits matches that of the IEEE 32-bit format. Similarly, in AlexNet, the accuracy achieved with 10 bits matches that of the IEEE 32-bit format. These results suggest that the sorting method shows promise for calculations with limited accuracy.
引用
收藏
页码:12061 / 12078
页数:17
相关论文
共 50 条
  • [31] Variable Precision 16-Bit Floating-Point Vector Unit for Embedded Processors
    Nannarelli, Alberto
    2020 IEEE 27TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2020, : 96 - 102
  • [32] Novel adaptive quantization methodology for 8-bit floating-point DNN training
    Hassani Sadi, Mohammad
    Sudarshan, Chirag
    Wehn, Norbert
    DESIGN AUTOMATION FOR EMBEDDED SYSTEMS, 2024, 28 (02) : 91 - 110
  • [33] Beyond Floating-Point Ops: CNN Performance Prediction with Critical Datapath Length
    Langerman, David
    Johnson, Alex
    Buettner, Kyle
    George, Alan D.
    2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2020,
  • [34] A precision- and range-independent tool for testing floating-point arithmetic II: Conversions
    Verdonk, B
    Cuyt, A
    Verschaeren, D
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2001, 27 (01): : 119 - 140
  • [35] On the need for predictable floating-point arithmetic in the programming languages Fortran 90 and C/C++
    Verschaeren, D
    Cuyt, A
    Verdonk, B
    ACM SIGPLAN NOTICES, 1997, 32 (03) : 57 - 64
  • [36] A Latency-Effective Pipelined Divider for Double-Precision Floating-Point Numbers
    Yun, Juwon
    Lee, Jinyoung
    Chung, Woo-Nam
    Kim, Cheong Ghil
    Park, Woo-Chan
    IEEE ACCESS, 2020, 8 : 165740 - 165747
  • [37] Low Power Floating-Point Multiplication and Squaring Units with Shared Circuitry
    Moore, Jason
    Thornton, Mitchell A.
    Matula, David W.
    2013 IEEE 56TH INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS (MWSCAS), 2013, : 1395 - 1398
  • [38] FPGA Implementation of a Decimal Floating-Point Accurate Scalar Product Unit with a Parallel Fixed-Point Multiplier
    Baesler, Malte
    Teufel, Thomas
    2009 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS, 2009, : 6 - 11
  • [39] Algorithm 814: Fortran 90 software for floating-point multiple precision arithmetic, gamma and related functions
    Smith, DM
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2001, 27 (04): : 377 - 387
  • [40] Error Analysis in the Hardware Neural Networks Applications using Reduced Floating-point Numbers Representation
    Pietras, Marcin
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE OF NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014), 2015, 1648