Speculative Hardware/Software Co-Designed Floating-Point Multiply-Add Fusion

被引:6
|
作者
Lupon, Marc [1 ]
Gibert, Enric [1 ]
Magklis, Grigorios [1 ]
Samudrala, Sridhar [1 ,2 ]
Martinez, Raul [1 ]
Stavrou, Kyriakos [1 ]
Ditzel, David R. [1 ,2 ]
机构
[1] Intel Labs, Intel Barcelona Res Ctr, Barcelona, Spain
[2] Intel Corp, Santa Clara, CA 95051 USA
关键词
Algorithms; Performance; Design; Binary translator; HW/SW co-designed processors; Instruction fusion; FMA; Combined Multiply-Add;
D O I
10.1145/2541940.2541978
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA-a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.
引用
收藏
页码:623 / 638
页数:16
相关论文
共 50 条
  • [1] Floating-point fused multiply-add architectures
    Quinnell, Eric
    Swartzlander, Earl E., Jr.
    Lemonds, Carl
    CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, 2007, : 331 - +
  • [2] Floating-point fused multiply-add: Reduced latency for floating-point addition
    Bruguera, JD
    Lang, T
    17TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC, PROCEEDINGS, 2005, : 42 - 51
  • [3] Fused Multiply-Add for Variable Precision Floating-Point
    Nannarelli, Alberto
    32ND IEEE INTERNATIONAL SYSTEM ON CHIP CONFERENCE (IEEE SOCC 2019), 2019, : 342 - 347
  • [4] Floating-point fused multiply-add with reduced latency
    Lang, T
    Bruguera, JD
    ICCD'2002: IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 2002, : 145 - 150
  • [5] Bridge Floating-Point Fused Multiply-Add Design
    Quinnell, Eric
    Swartzlander, Earl E., Jr.
    Lemonds, Carl
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2008, 16 (12) : 1726 - 1730
  • [6] Enhanced Floating-Point Multiply-Add with Full Denormal Support
    Sohn, Jongwook
    Dean, David K.
    Quintana, Eric
    Wong, Wing Shek
    2023 IEEE 30TH SYMPOSIUM ON COMPUTER ARITHMETIC, ARITH 2023, 2023, : 143 - 150
  • [7] Multiple path IEEE floating-point fused multiply-add
    Seidel, PM
    PROCEEDINGS OF THE 46TH IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS & SYSTEMS, VOLS 1-3, 2003, : 1359 - 1362
  • [8] Floating-Point Fused Multiply-Add under HUB Format
    Hormigo, Javier
    Villalba-Moreno, Julio
    Gonzalez-Navarro, Sonia
    2020 IEEE 27TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2020, : 1 - 8
  • [9] Decimal floating-point fused multiply-add with redundant internal encodings
    Han, Liu
    Zhang, Hao
    Ko, Seok-Bum
    IET COMPUTERS AND DIGITAL TECHNIQUES, 2016, 10 (04): : 147 - 156
  • [10] An efficient multiple precision floating-point Multiply-Add Fused unit
    Manolopoulos, K.
    Reisis, D.
    Chouliaras, V. A.
    MICROELECTRONICS JOURNAL, 2016, 49 : 10 - 18