Speculative Hardware/Software Co-Designed Floating-Point Multiply-Add Fusion

被引：6

作者：

Lupon, Marc ^{[1
]}

Gibert, Enric ^{[1
]}

Magklis, Grigorios ^{[1
]}

Samudrala, Sridhar ^{[1
,2
]}

Martinez, Raul ^{[1
]}

Stavrou, Kyriakos ^{[1
]}

Ditzel, David R. ^{[1
,2
]}

机构：

[1] Intel Labs, Intel Barcelona Res Ctr, Barcelona, Spain

[2] Intel Corp, Santa Clara, CA 95051 USA

来源：

ACM SIGPLAN NOTICES | 2014年 / 49卷 / 04期

关键词：

Algorithms; Performance; Design; Binary translator; HW/SW co-designed processors; Instruction fusion; FMA; Combined Multiply-Add;

D O I：

10.1145/2541940.2541978

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

A Fused Multiply-Add (FMA) instruction is currently available in many general-purpose processors. It increases performance by reducing latency of dependent operations and increases precision by computing the result as an indivisible operation with no intermediate rounding. However, since the arithmetic behavior of a single-rounding FMA operation is different than independent FP multiply followed by FP add instructions, some algorithms require significant revalidation and rewriting efforts to work as expected when they are compiled to operate with FMA-a cost that developers may not be willing to pay. Because of that, abundant legacy applications are not able to utilize FMA instructions. In this paper we propose a novel HW/SW collaborative technique that is able to efficiently execute workloads with increased utilization of FMA, by adding the option to get the same numerical result as separate FP multiply and FP add pairs. In particular, we extended the host ISA of a HW/SW co-designed processor with a new Combined Multiply-Add (CMA) instruction that performs an FMA operation with an intermediate rounding. This new instruction is used by a transparent dynamic translation software layer that uses a speculative instruction-fusion optimization to transform FP multiply and FP add sequences into CMA instructions. The FMA unit has been slightly modified to support both single-rounding and double-rounding fused instructions without increasing their latency and to provide a conservative fall-back path in case of mispeculation. Evaluation on a cycle-accurate timing simulator showed that CMA improved SPECfp performance by 6.3% and reduced executed instructions by 4.7%.

引用

页码：623 / 638

页数：16

共 50 条

[1] Floating-point fused multiply-add architectures
Quinnell, Eric
Swartzlander, Earl E., Jr.
Lemonds, Carl
CONFERENCE RECORD OF THE FORTY-FIRST ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS & COMPUTERS, VOLS 1-5, 2007, : 331 - +
[2] Floating-point fused multiply-add: Reduced latency for floating-point addition
Bruguera, JD
Lang, T
17TH IEEE SYMPOSIUM ON COMPUTER ARITHMETIC, PROCEEDINGS, 2005, : 42 - 51
[3] Fused Multiply-Add for Variable Precision Floating-Point
Nannarelli, Alberto
32ND IEEE INTERNATIONAL SYSTEM ON CHIP CONFERENCE (IEEE SOCC 2019), 2019, : 342 - 347
[4] Floating-point fused multiply-add with reduced latency
Lang, T
Bruguera, JD
ICCD'2002: IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 2002, : 145 - 150
[5] Bridge Floating-Point Fused Multiply-Add Design
Quinnell, Eric
Swartzlander, Earl E., Jr.
Lemonds, Carl
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2008, 16 (12) : 1726 - 1730
[6] Enhanced Floating-Point Multiply-Add with Full Denormal Support
Sohn, Jongwook
Dean, David K.
Quintana, Eric
Wong, Wing Shek
2023 IEEE 30TH SYMPOSIUM ON COMPUTER ARITHMETIC, ARITH 2023, 2023, : 143 - 150
[7] Multiple path IEEE floating-point fused multiply-add
Seidel, PM
PROCEEDINGS OF THE 46TH IEEE INTERNATIONAL MIDWEST SYMPOSIUM ON CIRCUITS & SYSTEMS, VOLS 1-3, 2003, : 1359 - 1362
[8] Floating-Point Fused Multiply-Add under HUB Format
Hormigo, Javier
Villalba-Moreno, Julio
Gonzalez-Navarro, Sonia
2020 IEEE 27TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2020, : 1 - 8
[9] Decimal floating-point fused multiply-add with redundant internal encodings
Han, Liu
Zhang, Hao
Ko, Seok-Bum
IET COMPUTERS AND DIGITAL TECHNIQUES, 2016, 10 (04): : 147 - 156
[10] An efficient multiple precision floating-point Multiply-Add Fused unit
Manolopoulos, K.
Reisis, D.
Chouliaras, V. A.
MICROELECTRONICS JOURNAL, 2016, 49 : 10 - 18

← 1 2 3 4 5 →