Prediction of the impact of network switch utilization on application performance via active measurement

被引:1
作者
Casas, Marc [1 ]
Bronevetsky, Greg [2 ]
机构
[1] Univ Politecn Cataluna, Barcelona Supercomp Ctr, Barcelona, Spain
[2] Google Corp, Mountain View, CA USA
基金
欧洲研究理事会;
关键词
Performance modeling; Resource sharing; Measurement techniques; MPI APPLICATIONS; DESIGN;
D O I
10.1016/j.parco.2017.06.005
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Although one of the key characteristics of High Performance Computing (HPC) infrastructures are their fast interconnecting networks, the increasingly large computational capacity of HPC nodes and the subsequent growth of data exchanges between them constitute a potential performance bottleneck. To achieve high performance in parallel executions despite network limitations, application developers require tools to measure their codes' network utilization and to correlate the network's communication capacity with the performance of their applications. This paper presents a new methodology to measure and understand network behavior. The approach is based in two different techniques that inject extra network communication. The first technique aims to measure the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second injects large amounts of network traffic to study how applications behave on less capable or fully utilized networks. The measurements obtained by these techniques are combined to predict the performance slowdown suffered by a particular software component when it shares the network with others. Predictions are obtained by considering several training sets that use raw data from the two measurement techniques. The sensitivity of the training set size is evaluated by considering 12 different scenarios. Our results find the optimum training set size to be around 200 training points. When optimal data sets are used, the proposed methodology provides predictions with an average error of 9.6% considering 36 scenarios. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:38 / 56
页数:19
相关论文
共 43 条
[1]  
[Anonymous], 2007, SCI APPL PERF CAND P
[2]  
[Anonymous], 2015, P EX MPI WORKSH ACM
[3]  
[Anonymous], 2001, P 2001 ACM IEEE C SU
[4]  
[Anonymous], LLNLTR490254
[5]  
[Anonymous], P INT C HIGH PERF CO
[6]  
Bauer G., 2012, Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), P652, DOI 10.1109/CCGrid.2012.123
[7]   Analyzing Network Health and Congestion in Dragonfly-based Supercomputers [J].
Bhatele, Abhinav ;
Jain, Nikhil ;
Livnat, Yarden ;
Pascucci, Valerio ;
Bremer, Peer-Timo .
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, :93-102
[8]   There Goes the Neighborhood: Performance Degradation due to Nearby Jobs [J].
Bhatele, Abhinav ;
Mohror, Kathryn ;
Langer, Steven H. ;
Isaacs, Katherine E. .
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
[9]   Active Measurement of Memory Resource Consumption [J].
Casas, Marc ;
Bronevetsky, Greg .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[10]   Active Measurement of the Impact of Network Switch Utilization on Application Performance [J].
Casas, Marc ;
Bronevetsky, Greg .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,