Understanding Stencil Code Performance On MultiCore Architectures

被引：23

作者：

Rahman, Shah M. Faizur ^{[1
]}

Yi, Qing ^{[1
]}

Qasem, Apan ^{[2
]}

机构：

[1] Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA

[2] Texas State Univ, Dept Comp Sci, San Marcos, TX USA

来源：

PROCEEDINGS OF THE 2011 8TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF 2011) | 2011年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1145/2016604.2016641

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Stencil computations are the foundation of many large applications in scientific computing. Previous research has shown that several optimization mechanisms, including rectangular blocking and time skewing combined with wavefront- and pipeline-based parallelization, can be used to significantly improve the performance of stencil kernels on multi-core architectures. However, the overall performance impact of these optimizations are difficult to predict due to the inter-play of load imbalance, synchronization overhead, and cache locality. This paper presents a detailed performance study of these optimizations by applying them with a wide variety of different configurations, using hardware counters to monitor the efficiency of architectural components, and then developing a set of formulas via regression analysis to model their overall performance impact in terms of the affected hardware counter numbers. We have applied our methodology to three stencil computation kernels, a 7-point jacobi, a 27-point jacobi, and a 7-point Gauss-Seidel computation. Our experimental results show that a precise formula can be developed for each kernel to accurately model the overall performance impact of varying optimizations and thereby effectively guide the performance analysis and tuning of these kernels.

引用

页数：10

共 29 条

[11]

Eranian Stephane., 2008, MSPC '08: Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness, P26

[12]

Fraguela B., 2009, PACT 09 PARALLEL ARC

[13]

*INT CORP, 2000, INT PENT 4 PROC OPT

[14]

Kamil S., 2010, P 14 INT S PAR DISTR

[15] Effective automatic parallelization of stencil computations [J].

Krishnamoorthy, Sriram ;

Baskaran, Muthu ;

Bondhugula, Uday ;

Ramanujam, J. ;

Rountev, Atanas ;

Sadayappan, P. .

ACM SIGPLAN NOTICES, 2007, 42 (06) :235-244

[16] Improving Parallelism and Locality with Asynchronous Algorithms [J].

Liu, Lixia ;

Li, Zhiyuan .

PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, :213-222

[17]

Marin G., 2008, P 2008 IEEE INT S PE

[18]

Peleg N., 2007, 16 INT C PAR ARCH CO

[19]

Rahman S.F., 2011, HIPEAC HIGH IN PRESS

[20]

Singh Karan, 2009, Computer Architecture News, V37, P46, DOI 10.1145/1577129.1577137

← 1 2 3 →