Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications

被引:4
作者
Gutierrez, Samuel K. [1 ,2 ]
Davis, Kei [1 ]
Arnold, Dorian C. [2 ]
Baker, Randal S. [1 ]
Robey, Robert W. [1 ]
McCormick, Patrick [1 ]
Holladay, Daniel [1 ]
Dahl, Jon A. [1 ]
Zerr, R. Joe [1 ]
Weik, Florian [1 ]
Junghans, Christoph [1 ]
机构
[1] Los Alamos Natl Lab, Los Alamos, NM 87545 USA
[2] Univ New Mexico, Dept Comp Sci, Albuquerque, NM 87131 USA
来源
2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS) | 2017年
关键词
MPI; MPI plus X; Pthreads; OpenMP;
D O I
10.1109/IPDPS.2017.13
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Hybrid parallel program models that combine message passing and multithreading (MP+MT) are becoming more popular, extending the basic message passing (MP) model that uses single-threaded processes for both inter- and intra-node parallelism. A consequence is that coupled parallel applications increasingly comprise MP libraries together with MP+MT libraries with differing preferred degrees of threading, resulting in thread-level heterogeneity. Retroactively matching threading levels between independently developed and maintained libraries is difficult; the challenge is exacerbated because contemporary parallel job launchers provide only static resource binding policies over entire application executions. A standard approach for accommodating thread-level heterogeneity is to under-subscribe compute resources such that the library with the highest degree of threading per process has one processing element per thread. This results in libraries with fewer threads per process utilizing only a fraction of the available compute resources. We present and evaluate a novel approach for accommodating thread-level heterogeneity. Our approach enables full utilization of all available compute resources throughout an application's execution by providing programmable facilities to dynamically reconfigure runtime environments for compute phases with differing threading factors and memory affinities. We show that our approach can improve overall application performance by up to 5.8x in real-world production codes. Furthermore, the practicality and utility of our approach has been demonstrated by continuous production use for over one year, and by more recent incorporation into a number of production codes.
引用
收藏
页码:469 / 478
页数:10
相关论文
共 21 条
[1]  
[Anonymous], P INT C HIGH PERF CO
[2]  
Arnold A., 2013, MESHFREE METHODS PAR, V89, P1, DOI DOI 10.1007/978-3-642-32979-1_1
[3]  
Baker A.H., 2012, High-Performance Scientific Computing: Algorithms and Applications, P261, DOI DOI 10.1007/978-1-4471-2437-5_13
[4]  
Bauer M., 2012, P INT C HIGH PERF CO, P66
[5]   hwloc: a Generic Framework for Managing Hardware Affinities in HPC Applications [J].
Broquedis, Francois ;
Clet-Ortega, Jerome ;
Moreaud, Stephanie ;
Furmento, Nathalie ;
Goglin, Brice ;
Mercier, Guillaume ;
Thibault, Samuel ;
Namyst, Raymond .
PROCEEDINGS OF THE 18TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2010, :180-186
[6]  
Canning A., 2012, P CSC 12 C
[7]  
Chow E, 2001, UCRLJC143957 LLNL
[8]  
Drosinos N., 2004, PAR DISTR PROC S 200
[9]   Kokkos: Enabling manycore performance portability through polymorphic memory access patterns [J].
Edwards, H. Carter ;
Trott, Christian R. ;
Sunderland, Daniel .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (12) :3202-3216
[10]  
Goglin B, 2014, 2014 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS), P74, DOI 10.1109/HPCSim.2014.6903671