A Case Study for Performance Portability Using OpenMP 4.5

被引:21
作者
Gayatri, Rahulkumar [1 ]
Yang, Charlene [1 ]
Kurth, Thorsten [1 ]
Deslippe, Jack [1 ]
机构
[1] Lawrence Berkeley Natl Lab LBNL, Natl Energy Res Sci Comp Ctr NERSC, Berkeley, CA 94720 USA
来源
ACCELERATOR PROGRAMMING USING DIRECTIVES | 2019年 / 11381卷
关键词
OpenMP; 3.0; 4.5; OpenACC; CUDA; Parallel programming models; P100; V100; Xeon Phi; Haswell;
D O I
10.1007/978-3-030-12274-4_4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, the HPC landscape has shifted away from traditional multi-core CPU systems to energy-efficient architectures, such as many-core CPUs and accelerators like GPUs, to achieve high performance. The goal of performance portability is to enable developers to rapidly produce applications which can run efficiently on a variety of these architectures, with little to no architecture specific code adoptions required. We implement a key kernel from a material science application using OpenMP 3.0, OpenMP 4.5, OpenACC, and CUDA on Intel architectures, Xeon and Xeon Phi, and NVIDIA GPUs, P100 and V100. We will compare the performance of the OpenMP 4.5 implementation with that of the more architecture-specific implementations, examine the performance of the OpenMP 4.5 implementation on CPUs after back-porting, and share our experience optimizing large reduction loops, as well as discuss the latest compiler status for OpenMP 4.5 and OpenACC.
引用
收藏
页码:75 / 95
页数:21
相关论文
共 11 条
[1]  
[Anonymous], 2008, 2008 IEEE Hot Chips 20 Symposium (HCS), DOI 10.1109/HOTCHIPS.2008.7476516
[2]   BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures [J].
Deslippe, Jack ;
Samsonidze, Georgy ;
Strubbe, David A. ;
Jain, Manish ;
Cohen, Marvin L. ;
Louie, Steven G. .
COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (06) :1269-1289
[3]   Kokkos: Enabling manycore performance portability through polymorphic memory access patterns [J].
Edwards, H. Carter ;
Trott, Christian R. ;
Sunderland, Daniel .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (12) :3202-3216
[4]   HASWELL: THE FOURTH-GENERATION INTEL CORE PROCESSOR [J].
Hammarlund, Per ;
Martinez, Alberto J. ;
Bajwa, Atiq A. ;
Hill, David L. ;
Hallnor, Erik ;
Jiang, Hong ;
Dixon, Martin ;
Derr, Michael ;
Hunsaker, Mikal ;
Kumar, Rajesh ;
Osborne, Randy B. ;
Rajwar, Ravi ;
Singhal, Ronak ;
D'Sa, Reynold ;
Chappell, Robert ;
Kaushik, Shiv ;
Chennupaty, Srinivas ;
Jourdan, Stephan ;
Gunther, Steve ;
Piazza, Tom ;
Burton, Ted .
IEEE MICRO, 2014, 34 (02) :6-20
[5]  
Hayashi A., 2016, 2016 3 WORKSH ACC PR
[6]  
Hornung R., 2014, LLNLTR661403
[7]  
Lopez MG, 2016, PROCEEDINGS OF WACCPD 2016: THIRD WORKSHOP ON ACCELERATOR PROGRAMMING USING DIRECTIVES, P13, DOI [10.1109/WACCPD.2016.9, 10.1109/WACCPD.2016.006]
[8]   IBM POWER9 PROCESSOR ARCHITECTURE [J].
Sadasivam, Satish Kumar ;
Thompto, Brian W. ;
Kalla, Ron ;
Starke, William J. .
IEEE MICRO, 2017, 37 (02) :40-51
[9]   IBM POWER8 processor core microarchitecture [J].
Sinharoy, B. ;
Van Norstrand, J. A. ;
Eickemeyer, R. J. ;
Le, H. Q. ;
Leenstra, J. ;
Nguyen, D. Q. ;
Konigsburg, B. ;
Ward, K. ;
Brown, M. D. ;
Moreira, J. E. ;
Levitan, D. ;
Tung, S. ;
Hrusecky, D. ;
Bishop, J. W. ;
Gschwind, M. ;
Boersma, M. ;
Kroener, M. ;
Kaltenbach, M. ;
Karkhanis, T. ;
Fernsler, K. M. .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2015, 59 (01)
[10]   Electron self-energy calculation using a general multi-pole approximation [J].
Soininen, JA ;
Rehr, JJ ;
Shirley, EL .
JOURNAL OF PHYSICS-CONDENSED MATTER, 2003, 15 (17) :2573-2586