Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

被引:4
作者
Tanaka, Hideyuki [1 ]
Ishihara, Youhei [2 ]
Sakamoto, Ryo [4 ]
Nakamura, Takashi [4 ]
Kimura, Yasuyuki [1 ]
Nitadori, Keigo [3 ]
Tsubouchi, Miyuki [3 ]
Makino, Jun [5 ]
机构
[1] ExaScaler Inc, Chiyoda Ku, 2-1 Ogawa Machi, Tokyo 1010052, Japan
[2] Kyoto Univ, Yukawa Inst Theoret Phys, Sakyo Ku, Kyoto 6068502, Japan
[3] RIKEN, R CCS, Chuo Ku, 7-1-26 Minatojima Minami Machi, Kobe, Hyogo 6500047, Japan
[4] PEZY Comp KK, Chiyoda Ku, 1-11 Ogawa Machi, Tokyo 1010052, Japan
[5] Kobe Univ, Nada Ku, 1-1 Rokkodai Cho, Kobe, Hyogo 6578501, Japan
来源
PROCEEDINGS OF 2018 IEEE/ACM 4TH INTERNATIONAL WORKSHOP ON EXTREME SCALE PROGRAMMING MODELS AND MIDDLEWARE (ESPM2 2018) | 2018年
关键词
MERIDIONAL FLOW; ROTATION; SCHEME;
D O I
10.1109/ESPM2.2018.00008
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper we describe the basic idea, implementation and achieved performance of our DSL for stencil computation, Formura, on systems based on PEZY-SC2 many-core processor. Formura generates, from high-level description of the differential equation and simple description of finite-difference stencil, the entire simulation code with MPI parallelization with overlapped communication and calculation, advanced temporal blocking and parallelization for many-core processors. Achieved performance is 4.78 PF, or 21.5% of the theoretical peak performance for an explicit scheme for compressive CFD, with the accuracy of fourth-order in space and third-order in time. For a slightly modified implementation of the same scheme, efficiency was slightly lower (17.5%) but actual calculation time per one timestep was faster by 25%. Temporal blocking improved the performance by up to 70%. Even though the B/F number of PEZY-SC2 is low, around 0.02, we have achieved the efficiency comparable to those of highly optimized CFD codes on machines with much higher memory bandwidth such as K computer. We have demonstrated that automatic generation of the code with temporal blocking is a quite effective way to make use of very large-scale machines with low memory bandwidth for large-scale CFD calculations.
引用
收藏
页码:29 / 36
页数:8
相关论文
共 13 条
[1]  
[Anonymous], 2016, P PLATF ADV SCI COMP
[2]  
[Anonymous], 2016, P INT C HIGH PERF CO
[3]   Interpolated Differential Operator (IDO) scheme for solving partial differential equations [J].
Aoki, T .
COMPUTER PHYSICS COMMUNICATIONS, 1997, 102 (1-3) :132-146
[4]   Multi-dimensional cosmological radiative transfer with a Variable Eddington Tensor formalism [J].
Gnedin, NY ;
Abel, T .
NEW ASTRONOMY, 2001, 6 (07) :437-455
[5]   HIGH-RESOLUTION CALCULATIONS OF THE SOLAR GLOBAL CONVECTION WITH THE REDUCED SPEED OF SOUND TECHNIQUE. I. THE STRUCTURE OF THE CONVECTION AND THE MAGNETIC FIELD WITHOUT THE ROTATION [J].
Hotta, H. ;
Rempel, M. ;
Yokoyama, T. .
ASTROPHYSICAL JOURNAL, 2014, 786 (01)
[6]  
MAKINO J, 1992, PUBL ASTRON SOC JPN, V44, P141
[7]   OPTIMAL ORDER AND TIME-STEP CRITERION FOR AARSETH-TYPE N-BODY INTEGRATORS [J].
MAKINO, J .
ASTROPHYSICAL JOURNAL, 1991, 369 (01) :200-212
[8]   Automatic Generation of Efficient Codes from Mathematical Descriptions of Stencil Computation [J].
Muranushi, Takayuki ;
Nishizawa, Seiya ;
Tomita, Hirofumi ;
Nitadori, Keigo ;
Iwasawa, Masaki ;
Maruyama, Yutaka ;
Yashiro, Hisashi ;
Nakamura, Yoshifumi ;
Hotta, Hideyuki ;
Makino, Junichiro ;
Hosono, Natsuki ;
Inoue, Hikaru .
FHPC'16: PROCEEDINGS OF THE 5TH INTERNATIONAL WORKSHOP ON FUNCTIONAL HIGH-PERFORMANCE COMPUTING, 2016, :17-22
[9]  
Muranushi T, 2016, SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, P23, DOI 10.1109/SC.2016.2
[10]   Optimal Temporal Blocking for Stencil Computation [J].
Muranushi, Takayuki ;
Makino, Junichiro .
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, ICCS 2015 COMPUTATIONAL SCIENCE AT THE GATES OF NATURE, 2015, 51 :1303-1312