Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

被引：4

作者：

Tanaka, Hideyuki ^{[1
]}

Ishihara, Youhei ^{[2
]}

Sakamoto, Ryo ^{[4
]}

Nakamura, Takashi ^{[4
]}

Kimura, Yasuyuki ^{[1
]}

Nitadori, Keigo ^{[3
]}

Tsubouchi, Miyuki ^{[3
]}

Makino, Jun ^{[5
]}

机构：

[1] ExaScaler Inc, Chiyoda Ku, 2-1 Ogawa Machi, Tokyo 1010052, Japan

[2] Kyoto Univ, Yukawa Inst Theoret Phys, Sakyo Ku, Kyoto 6068502, Japan

[3] RIKEN, R CCS, Chuo Ku, 7-1-26 Minatojima Minami Machi, Kobe, Hyogo 6500047, Japan

[4] PEZY Comp KK, Chiyoda Ku, 1-11 Ogawa Machi, Tokyo 1010052, Japan

[5] Kobe Univ, Nada Ku, 1-1 Rokkodai Cho, Kobe, Hyogo 6578501, Japan

来源：

PROCEEDINGS OF 2018 IEEE/ACM 4TH INTERNATIONAL WORKSHOP ON EXTREME SCALE PROGRAMMING MODELS AND MIDDLEWARE (ESPM2 2018) | 2018年

关键词：

MERIDIONAL FLOW; ROTATION; SCHEME;

D O I：

10.1109/ESPM2.2018.00008

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In this paper we describe the basic idea, implementation and achieved performance of our DSL for stencil computation, Formura, on systems based on PEZY-SC2 many-core processor. Formura generates, from high-level description of the differential equation and simple description of finite-difference stencil, the entire simulation code with MPI parallelization with overlapped communication and calculation, advanced temporal blocking and parallelization for many-core processors. Achieved performance is 4.78 PF, or 21.5% of the theoretical peak performance for an explicit scheme for compressive CFD, with the accuracy of fourth-order in space and third-order in time. For a slightly modified implementation of the same scheme, efficiency was slightly lower (17.5%) but actual calculation time per one timestep was faster by 25%. Temporal blocking improved the performance by up to 70%. Even though the B/F number of PEZY-SC2 is low, around 0.02, we have achieved the efficiency comparable to those of highly optimized CFD codes on machines with much higher memory bandwidth such as K computer. We have demonstrated that automatic generation of the code with temporal blocking is a quite effective way to make use of very large-scale machines with low memory bandwidth for large-scale CFD calculations.

引用

页码：29 / 36

页数：8

共 13 条

[1]

[Anonymous], 2016, P PLATF ADV SCI COMP

[2]

[Anonymous], 2016, P INT C HIGH PERF CO

[3] Interpolated Differential Operator (IDO) scheme for solving partial differential equations [J].

Aoki, T .

COMPUTER PHYSICS COMMUNICATIONS, 1997, 102 (1-3) :132-146

[4] Multi-dimensional cosmological radiative transfer with a Variable Eddington Tensor formalism [J].

Gnedin, NY ;

Abel, T .

NEW ASTRONOMY, 2001, 6 (07) :437-455

[5] HIGH-RESOLUTION CALCULATIONS OF THE SOLAR GLOBAL CONVECTION WITH THE REDUCED SPEED OF SOUND TECHNIQUE. I. THE STRUCTURE OF THE CONVECTION AND THE MAGNETIC FIELD WITHOUT THE ROTATION [J].

Hotta, H. ;

Rempel, M. ;

Yokoyama, T. .

ASTROPHYSICAL JOURNAL, 2014, 786 (01)

[6]

MAKINO J, 1992, PUBL ASTRON SOC JPN, V44, P141

[7] OPTIMAL ORDER AND TIME-STEP CRITERION FOR AARSETH-TYPE N-BODY INTEGRATORS [J].

MAKINO, J .

ASTROPHYSICAL JOURNAL, 1991, 369 (01) :200-212

[8] Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation [J].

Muranushi, Takayuki ;

Nishizawa, Seiya ;

Tomita, Hirofumi ;

Nitadori, Keigo ;

Iwasawa, Masaki ;

Maruyama, Yutaka ;

Yashiro, Hisashi ;

Nakamura, Yoshifumi ;

Hotta, Hideyuki ;

Makino, Junichiro ;

Hosono, Natsuki ;

Inoue, Hikaru .

FHPC'16: PROCEEDINGS OF THE 5TH INTERNATIONAL WORKSHOP ON FUNCTIONAL HIGH-PERFORMANCE COMPUTING, 2016, :17-22

[9]

Muranushi T, 2016, SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, P23, DOI 10.1109/SC.2016.2

[10] Optimal Temporal Blocking for Stencil Computation [J].

Muranushi, Takayuki ;

Makino, Junichiro .

INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2015 COMPUTATIONAL SCIENCE AT THE GATES OF NATURE, 2015, 51 :1303-1312

← 1 2 →