Efficient CFD code implementation for the ARM-based Mont-Blanc architecture

被引:16
作者
Oyarzun, G. [1 ,2 ]
Borrell, R. [1 ,2 ]
Gorobets, A. [2 ,3 ]
Mantovani, F. [4 ]
Oliva, A. [2 ]
机构
[1] Termo Fluids SL, C Magi Colet 8, Sabadell 08204, Barcelona, Spain
[2] Tech Univ Catalonia, Heat & Mass Transfer Technol Ctr, ETSEIAT, C Colom 11, Terrassa 08222, Spain
[3] Keldysh Inst Appl Math RAS, 4A Miusskaya Sq, Moscow 125047, Russia
[4] Barcelona Supercomp Ctr, C Jordi Girona 3, Barcelona 08034, Spain
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2018年 / 79卷
基金
欧盟第七框架计划; 俄罗斯科学基金会;
关键词
ARM system; Heterogeneous computing; Parallel CFD; Energy-efficient computing; MATRIX-VECTOR MULTIPLICATION; NUMERICAL-SOLUTION; PERFORMANCE;
D O I
10.1016/j.future.2017.09.029
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Since 2011, the European project Mont-Blanc has been focused on enabling ARM-based technology for HPC, developing both hardware platforms and system software. The latest Mont-Blanc prototypes use system-on-chip (SoC) devices that combine a CPU and a GPU sharing a common main memory. Specific developments of parallel computing software and well-suited implementation approaches are of crucial importance for such a heterogeneous architecture in order to efficiently exploit its potential. This paper is devoted to the optimizations carried out in the TermoFluids CFD code to efficiently run it on the Mont-Blanc system. The underlying numerical method is based on an unstructured finite-volume discretization of the Navier Stokes equations for the numerical simulation of incompressible turbulent flows. It is implemented using a portable and modular operational approach based on a minimal set of linear algebra operations. An architecture -specific heterogeneous multilevel MPI+OpenMP+OpenCL implementation of such kernels is proposed. It includes optimizations of the storage formats, dynamic load balancing between the CPU and GPU devices and hiding of communication overheads by overlapping computations and data transfers. A detailed performance study shows time reductions of up to 2.1x on the kernels' execution with the new heterogeneous implementation, its scalability on up to 128 Mont Blanc nodes and the energy savings (around 40%) achieved with the Mont-Blanc system versus the high end hybrid supercomputer MinoTauro. (C) 2017 The Authors. Published by Elsevier B.V.
引用
收藏
页码:786 / 796
页数:11
相关论文
共 27 条
[1]   Flow and turbulent structures around simplified car models [J].
Aljure, D. E. ;
Lehmkuhl, O. ;
Rodriguez, I. ;
Oliva, A. .
COMPUTERS & FLUIDS, 2014, 96 :122-135
[2]   Optimising the Termofluids CFD code for petascale simulations [J].
Borrell, R. ;
Chiva, J. ;
Lehmkuhl, O. ;
Oyarzun, G. ;
Rodriguez, I. ;
Oliva, A. .
INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS, 2016, 30 (06) :425-430
[3]   Energy-Performance Tradeoffs for HPC Applications on Low Power Processors [J].
Calore, Enrico ;
Schifano, Sebastiano Fabio ;
Tripiccione, Raffaele .
EURO-PAR 2015: PARALLEL PROCESSING WORKSHOPS, 2015, 9523 :737-748
[4]   NUMERICAL SOLUTION OF NAVIER-STOKES EQUATIONS [J].
CHORIN, AJ .
MATHEMATICS OF COMPUTATION, 1968, 22 (104) :745-&
[5]  
Cuthill E., 1969, P 1969 24 NAT C ACM, P157, DOI [DOI 10.1145/800195.805928, 10.1145/800195.805928]
[7]   The International Exascale Software Project roadmap [J].
Dongarra, Jack ;
Beckman, Pete ;
Moore, Terry ;
Aerts, Patrick ;
Aloisio, Giovanni ;
Andre, Jean-Claude ;
Barkai, David ;
Berthou, Jean-Yves ;
Boku, Taisuke ;
Braunschweig, Bertrand ;
Cappello, Franck ;
Chapman, Barbara ;
Chi, Xuebin ;
Choudhary, Alok ;
Dosanjh, Sudip ;
Dunning, Thom ;
Fiore, Sandro ;
Geist, Al ;
Gropp, Bill ;
Harrison, Robert ;
Hereld, Mark ;
Heroux, Michael ;
Hoisie, Adolfy ;
Hotta, Koh ;
Jin, Zhong ;
Ishikawa, Yutaka ;
Johnson, Fred ;
Kale, Sanjay ;
Kenway, Richard ;
Keyes, David ;
Kramer, Bill ;
Labarta, Jesus ;
Lichnewsky, Alain ;
Lippert, Thomas ;
Lucas, Bob ;
Maccabe, Barney ;
Matsuoka, Satoshi ;
Messina, Paul ;
Michielse, Peter ;
Mohr, Bernd ;
Mueller, Matthias S. ;
Nagel, Wolfgang E. ;
Nakashima, Hiroshi ;
Papka, Michael E. ;
Reed, Dan ;
Sato, Mitsuhisa ;
Seidel, Ed ;
Shalf, John ;
Skinner, David ;
Snir, Marc .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2011, 25 (01) :3-60
[8]   Annealing-based heuristics and genetic algorithms for circuit partitioning in parallel test generation [J].
Gil, C ;
Ortega, J ;
Díaz, AF ;
Montoya, MDG .
FUTURE GENERATION COMPUTER SYSTEMS, 1998, 14 (5-6) :439-451
[9]   Energy efficiency vs. performance of the numerical solution of PDEs: An application study on a low-power ARM-based cluster [J].
Goeddeke, Dominik ;
Komatitsch, Dimitri ;
Geveler, Markus ;
Ribbrock, Dirk ;
Rajovic, Nikola ;
Puzovic, Nikola ;
Ramirez, Alex .
JOURNAL OF COMPUTATIONAL PHYSICS, 2013, 237 :132-150
[10]   Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU [J].
Grasso, Ivan ;
Radojkovic, Petar ;
Rajovic, Nikola ;
Gelado, Isaac ;
Ramirez, Alex .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,