Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil

被引：8

作者：

You, Yang ^{[1
]}

Fu, Haohuan ^{[1
,4
]}

Song, Shuaiwen Leon ^{[2
]}

Dehnavi, Maryam Mehri ^{[3
]}

Gan, Lin ^{[1
,4
]}

Huang, Xiaomeng ^{[1
,4
]}

Yang, Guangwen ^{[1
,4
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China

[2] Pacific NW Natl Lab, Performance Anal Lab, Richland, WA 99352 USA

[3] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA

[4] Tsinghua Univ, Key Lab Earth Syst Modeling, Minist Educ, Beijing 100084, Peoples R China

来源：

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2014年 / 28卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Complex stencil; 3D wave forward modeling; Kepler GPU; Intel Xeon Phi; optimization techniques; performance power analysis; WAVE-PROPAGATION; HIGH-ORDER; GPU; PROCESSORS; POWER; CARDS;

D O I：

10.1177/1094342014524807

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time-consuming, which greatly limits their performance and power efficiency. In this paper, we accelerate the forward-modeling technique on the latest multi-core and many-core architectures such as Intel (R) Sandy Bridge CPUs, NVIDIA Fermi C2070 GPUs, NVIDIA Kepler K20X GPUs, and the Intel (R) Xeon Phi co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels. For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best performance. Although our stencil with 114 component variables poses several great challenges for performance optimization, and the low stencil ratio between computation and memory access is too inefficient to fully take advantage of our evaluated architectures, we manage to achieve performance efficiencies ranging from 4.730% to 20.02% of the theoretical peak. We also conduct cross-platform performance and power analysis (focusing on Kepler GPU and MIC) and the results could serve as insights for users selecting the most suitable accelerators for their targeted applications.

引用

页码：301 / 318

页数：18

共 27 条

[1] [Anonymous], 2009, P 2 WORKSHOP GEN PUR
[2] [Anonymous], 26 ANN INT C MACH LE
[3] [Anonymous], TECHNICAL REPORT
[4] PERFORMANCE COMPARISON OF FPGA, GPU AND CPU IN IMAGE PROCESSING
Asano, Shuichi
Maruyama, Tsutomu
Yamaguchi, Yoshiki
[J]. FPL: 2009 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS, 2009, : 126 - 131
[5] Balakrishnan M, 2012, COMM COM INF SC, V306, P3
[6] Combining Single and Packet-Ray Tracing for Arbitrary Ray Distributions on the Intel MIC Architecture
Benthin, Carsten
Wald, Ingo
Woop, Sven
Ernst, Manfred
Mark, William R.
[J]. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2012, 18 (09) : 1438 - 1448
[7] Blanch J, 2007, GEOPHYS J INT, V131, P381
[8] A portable programming interface for performance evaluation on modern processors
Browne, S
Dongarra, J
Garner, N
Ho, G
Mucci, P
[J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2000, 14 (03) : 189 - 204
[9] Selecting the right hardware for reverse time migration
Clapp R.G.
Fu H.
Lindtjorn O.
[J]. Leading Edge (Tulsa, OK), 2010, 29 (01) : 48 - 58
[10] THE APPLICATION OF HIGH-ORDER DIFFERENCING TO THE SCALAR WAVE-EQUATION
DABLAIN, MA
[J]. GEOPHYSICS, 1986, 51 (01) : 54 - 66

← 1 2 3 →