Towards Communication Profile, Topology and Node Failure aware Process Placement

被引:1
作者
Vardas, Ioannis [1 ]
Ploumidis, Manolis [1 ]
Marazakis, Manolis [1 ]
机构
[1] Fdn Res & Technol Hellas FORTH, Inst Comp Sci ICS, 100 N Plastira Av, GR-70013 Iraklion, Greece
来源
2020 IEEE 32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2020) | 2020年
关键词
Failure aware resource allocation; Resilience; MPI parallel jobs; LARGE-SCALE; PARALLEL; RELIABILITY; ALLOCATION; ALGORITHMS; CLUSTERS;
D O I
10.1109/SBAC-PAD49847.2020.00041
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.
引用
收藏
页码:241 / 248
页数:8
相关论文
共 41 条
[1]  
[Anonymous], 2008, TECH REP
[2]   THE NAS PARALLEL BENCHMARKS [J].
BAILEY, DH ;
BARSZCZ, E ;
BARTON, JT ;
BROWNING, DS ;
CARTER, RL ;
DAGUM, L ;
FATOOHI, RA ;
FREDERICKSON, PO ;
LASINSKI, TA ;
SCHREIBER, RS ;
SIMON, HD ;
VENKATAKRISHNAN, V ;
WEERATUNGA, SK .
INTERNATIONAL JOURNAL OF SUPERCOMPUTER APPLICATIONS AND HIGH PERFORMANCE COMPUTING, 1991, 5 (03) :63-73
[3]   An improved performance three-phase neutral-point clamped rectifier with simplified control scheme [J].
Bhat, Abdul Hamid ;
Agarwal, Pramod .
2006 IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS, VOLS 1-7, 2006, :1019-1024
[4]  
BOKHARI SH, 1981, IEEE T COMPUT, V30, P207, DOI 10.1109/TC.1981.1675756
[5]   Versatile, scalable, and accurate simulation of distributed applications and platforms [J].
Casanova, Henri ;
Giersch, Arnaud ;
Legrand, Arnaud ;
Quinson, Martin ;
Suter, Frederic .
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (10) :2899-2917
[6]   Processor scheduling and allocation for 3D torus multicomputer systems [J].
Choo, H ;
Yoo, SM ;
Youn, HY .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2000, 11 (05) :475-484
[7]   Simulating MPI Applications: The SMPI Approach [J].
Degomme, Augustin ;
Legrand, Arnaud ;
Markomanolis, George S. ;
Quinson, Martin ;
Stillwell, Mark ;
Suter, Frederic .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (08) :2387-2400
[8]   Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs [J].
Di Martino, Catello ;
Kalbarczyk, Zbigniew ;
Kramer, William ;
Iyer, Ravishankar .
2015 45TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, 2015, :25-36
[9]   Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System [J].
Di, Sheng ;
Guo, Hanqi ;
Gupta, Rinku ;
Pershey, Eric R. ;
Snir, Marc ;
Cappello, Franck .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (02) :361-374
[10]   Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing [J].
Dogan, A ;
Özgüner, F .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2002, 13 (03) :308-323