Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

被引:7
作者
Mathuriya, Amrita [1 ]
Luo, Ye [2 ]
Benali, Anouar [2 ]
Shulenburger, Luke [3 ]
Kim, Jeongnim [1 ]
机构
[1] Intel Corp, Santa Clara, CA 95051 USA
[2] Argonne Natl Lab, Argonne, IL 60439 USA
[3] Sandia Natl Labs, Livermore, CA 94550 USA
来源
2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) | 2017年
关键词
QMC; B-spline; SoA; AoSoA; vectorization; cache-blocking data-layouts and roofline; MONTE-CARLO; DIFFUSION;
D O I
10.1109/IPDPS.2017.33
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from array-of-structures to structure-of-arrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cache-coherent architectures and result in up to 5.6x performance enhancements on Intel (R) Xeon Phi (TM) processor 7250P (KNL), 5.7x on Intel (R) Xeon PhiTM coprocessor 7120P, 10x on an Intel (R) Xeon (R) processor E5v4 CPU and 9.5x on BlueGene/Q processor. Our nested threading implementation shows nearly ideal parallel efficiency on KNL up to 16 threads. We employ roofline performance analysis to model the impacts of our optimizations. This work combined with our current efforts of optimizing other QMC kernels, result in greater than 4.5x speedup of miniQMC on KNL.
引用
收藏
页码:213 / 223
页数:11
相关论文
共 20 条
[1]   Efficient localized basis set for quantum Monte Carlo calculations on condensed matter -: art. no. 161101 [J].
Alfè, D ;
Gillan, MJ .
PHYSICAL REVIEW B, 2004, 70 (16) :1-4
[2]  
[Anonymous], 2009, COMMUNICATIONS ACM
[3]   Application of Diffusion Monte Carlo to Materials Dominated by van der Waals Interactions [J].
Benali, Anouar ;
Shulenburger, Luke ;
Romero, Nichols A. ;
Kim, Jeongnim ;
von Lilienfeld, O. Anatole .
JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2014, 10 (08) :3417-3422
[4]   How to mesh up Ewald sums. I. A theoretical and numerical comparison of various particle mesh routines [J].
Deserno, M ;
Holm, C .
JOURNAL OF CHEMICAL PHYSICS, 1998, 109 (18) :7678-7693
[5]  
Esler K. P., EINSPLINE B SPLINE L
[6]   Accelerating Quantum Monte Carlo Simulations of Real Materials on GPU Clusters [J].
Esler, Kenneth P. ;
Kim, Jeongnim ;
Ceperley, David M. ;
Shulenburger, Luke .
COMPUTING IN SCIENCE & ENGINEERING, 2012, 14 (01) :40-51
[7]   Ab initio Quantum Monte Carlo Calculations of Spin Superexchange in Cuprates: The Benchmarking Case of Ca2CuO3 [J].
Foyevtsova, Kateryna ;
Krogel, Jaron T. ;
Kim, Jeongnim ;
Kent, P. R. C. ;
Dagotto, Elbio ;
Reboredo, Fernando A. .
PHYSICAL REVIEW X, 2014, 4 (03)
[8]   The design and implementation of FFTW3 [J].
Frigo, M ;
Johnson, SG .
PROCEEDINGS OF THE IEEE, 2005, 93 (02) :216-231
[9]   Binding and Diffusion of Lithium in Graphite: Quantum Monte Carlo Benchmarks and Validation of van der Waals Density Functional Methods [J].
Ganesh, P. ;
Kim, Jeongnim ;
Park, Changwon ;
Yoon, Mina ;
Reboredo, Fernando A. ;
Kent, Paul R. C. .
JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2014, 10 (12) :5318-5323
[10]   Cache-aware Roofline model: Upgrading the loft [J].
Ilic, Aleksandar ;
Pratas, Frederico ;
Sousa, Leonel .
IEEE COMPUTER ARCHITECTURE LETTERS, 2014, 13 (01) :21-24