Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS

被引：27

作者：

Reguly, Istvan Z. ^{[1
]}

Mudalige, Gihan R. ^{[2
]}

Giles, Michael B. ^{[3
]}

机构：

[1] PPCU ITK, H-1083 Budapest, Hungary

[2] Univ Warwick, Dept Comp Sci, Coventry CV4 7AL, W Midlands, England

[3] Univ Oxford, Maths Inst, Oxford OX1 2JD, England

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2018年 / 29卷 / 04期

基金：

英国工程与自然科学研究理事会;

关键词：

DSL; tiling; cache blocking; memory locality; OPS; stencil; structured mesh; LOCALITY; OPTIMIZATION;

D O I：

10.1109/TPDS.2017.2778161

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The key common bottleneck in most stencil codes is data movement, and prior research has shown that improving data locality through optimisations that optimise across loops do particularly well. However, in many large PDE applications it is not possible to apply such optimisations through compilers because there are many options, execution paths and data per grid point, many dependent on run-time parameters, and the code is distributed across different compilation units. In this paper, we adapt the data locality improving optimisation called tiling for use in large OPS applications both in shared-memory and distributed-memory systems, relying on run-time analysis and delayed execution. We evaluate our approach on a number of applications, observing speedups of 2 x on the Cloverleaf 2D/3D proxy applications, which contain 83(2D)/141(3D) loops, 3: 5 x on the linear solver TeaLeaf, and 1: 7 x on the compressible Navier-Stokes solver OpenSBLI. We demonstrate strong and weak scalability on up to 4608 cores of CINECA's Marconi supercomputer. We also evaluate our algorithms on Intel's Knights Landing, demonstrating maintained throughput as the problem size grows beyond 16GB, and we do scaling studies up to 8704 cores. The approach is generally applicable to any stencil DSL that provides per loop nest data access information.

引用

页码：873 / 886

页数：14

共 47 条

[1]

ANCOURT C, 1991, SIGPLAN NOTICES, V26, P39, DOI 10.1145/109626.109631

[2]

[Anonymous], 2015, TEALEAF UK MINI APP

[3]

[Anonymous], P 20 IEEE INT PAR DI

[4]

[Anonymous], 2013, CLOVERLEAF REFERENCE

[5]

[Anonymous], 2013, ICS 13

[6]

[Anonymous], 2013, OPS GITHUB REPOSITOR

[7]

Bandishti V., 2012, INT CONF HIGH PERFOR, P1, DOI DOI 10.1109/SC.2012.107

[8] Compiler-Assisted Dynamic Scheduling for Effective Parallelization of Loop Nests on Multicore Processors [J].

Baskaran, Muthu Manikandan ;

Vydyanathan, Nagavijayalakshmi ;

Bondhugula, Uday Kumar ;

Ramanujam, J. ;

Rountev, Atanas ;

Sadayappan, P. .

ACM SIGPLAN NOTICES, 2009, 44 (04) :219-228

[9]

Bertolacci IJ, 2016, PROCEEDINGS OF WACCPD 2016: THIRD WORKSHOP ON ACCELERATOR PROGRAMMING USING DIRECTIVES, P57, DOI [10.1109/WACCPD.2016.010, 10.1109/WACCPD.2016.5]

[10]

Bloss A., 1988, LISP and Symbolic Computation, V1, P147, DOI 10.1007/BF01806169

← 1 2 3 4 5 →