Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

被引：41

作者：

Gutierrez, Anthony ^{[1
]}

Beckmann, Bradford M. ^{[1
]}

Dutu, Alexandru ^{[1
]}

Gross, Joseph ^{[1
]}

Kalamatianos, John ^{[1
]}

Kayiran, Onur ^{[1
]}

LeBeane, Michael ^{[1
]}

Poremba, Matthew ^{[1
]}

Potter, Brandon ^{[1
]}

Puthoor, Sooraj ^{[1
]}

Sinclair, Matthew D. ^{[1
]}

Wyse, Mark ^{[1
]}

Yin, Jieming ^{[1
]}

Zhang, Xianwei ^{[1
]}

Jain, Akshay ^{[2
]}

Rogers, Timothy G. ^{[2
]}

机构：

[1] Adv Micro Devices Inc, AMD Res, Sunnyvale, CA 94088 USA

[2] Purdue Univ, Sch Elect & Comp Engn, W Lafayette, IN 47907 USA

来源：

2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA) | 2018年

关键词：

ABI; GPU; Intermediate Language; Intermediate Representation; ISA; Simulation;

D O I：

10.1109/HPCA.2018.00058

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation-agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA's intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2x that of the IL on average, but contention for vector registers is reduced by 3x due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.

引用

页码：608 / 619

页数：12

共 38 条

[11] Collange S., 2010, Proceedings 18th IEEE/ACM International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS 2010), P351, DOI 10.1109/MASCOTS.2010.43
[12] Desikan R, 2001, CONF PROC INT SYMP C, P266, DOI 10.1109/ISCA.2001.937455
[13] Ocelot: A Dynamic Optimization Framework for Bulk-Synchronous Applications in Heterogeneous Systems
Diamos, Gregory
Kerr, Andrew
Yalamanchili, Sudhakar
Clark, Nathan
[J]. PACT 2010: PROCEEDINGS OF THE NINETEENTH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 2010, : 353 - 364
[14] Gutierrez A, 2014, INT SYM PERFORM ANAL, P13, DOI 10.1109/ISPASS.2014.6844457
[15] Kim Hyesoon, 2012, MACSIM CPU GPU HETER
[16] Warped-Compression: Enabling Power Efficient GPUs through Register Compression
Lee, Sangpil
Kim, Keunsoo
Koo, Gunjae
Jeon, Hyeran
Ro, Won Woo
Annavaram, Murali
[J]. 2015 ACM/IEEE 42ND ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2015, : 502 - 514
[17] NVIDIA Tesla: A unified graphics and computing architecture
Lindholm, Erik
Nickolls, John
Oberman, Stuart
Montrym, John
[J]. IEEE MICRO, 2008, 28 (02) : 39 - 55
[18] G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs
Liu, Zhenhong
Gilani, Syed
Annavaram, Murali
Kim, Nam Sung
[J]. 2017 23RD IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2017, : 601 - 612
[19] Malhotra G., 2014, High Performance Computing (HiPC), 2014 21st International Conference on, P1
[20] Del Barrio VM, 2006, INT SYM PERFORM ANAL, P231

← 1 2 3 4 →