Simulation-based optimization and sensibility analysis of MPI applications: Variability matters

被引:2
作者
Cornebize, Tom [1 ]
Legrand, Arnaud [1 ]
机构
[1] Univ Grenoble Alpes, CNRS, Inria, Grenoble INP,LIG, F-38000 Grenoble, France
关键词
Simulation; Validation; Sensibility analysis; SimGrid; HPL;
D O I
10.1016/j.jpdc.2022.04.002
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Finely tuning MPI applications and understanding the influence of key parameters (number of processes, granularity, collective operation algorithms, virtual topology, and process placement) is critical to obtain good performance on supercomputers. With the high consumption of running applications at scale, doing so solely to optimize their performance is particularly costly. Having inexpensive but faithful predictions of expected performance could be a great help for researchers and system administrators. The methodology we propose decouples the complexity of the platform, which is captured through statistical models of the performance of its main components (MPI communications, BLAS operations), from the complexity of adaptive applications by emulating the application and skipping regular non-MPI parts of the code. We demonstrate the capability of our method with High-Performance Linpack (HPL), the benchmark used to rank supercomputers in the TOP500, which requires careful tuning. We briefly present (1) how the open-source version of HPL can be slightly modified to allow a fast emulation on a single commodity server at the scale of a supercomputer. Then we present (2) an extensive (in)validation study that compares simulation with real experiments and demonstrates our ability to predict the performance of HPL within a few percent consistently. This study allows us to identify the main modeling pitfalls (e.g., spatial and temporal node variability or network heterogeneity and irregular behavior) that need to be considered. Last, we show (3) how our "surrogate" allows studying several subtle HPL parameter optimization problems while accounting for uncertainty on the platform. (c) 2022 Elsevier Inc. All rights reserved.
引用
收藏
页码:111 / 125
页数:15
相关论文
共 32 条
  • [1] [Anonymous], 2014, HPGMG 1 0 BENCHMARK
  • [2] [Anonymous], 2015, UTEECS15736 TECHN IS
  • [3] [Anonymous], 2016, HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers
  • [4] [Anonymous], 2019, CONTINUOUS INTEGRATI
  • [5] Badia R., 2003, PROC WORKSHOP GRID A
  • [6] Bird R. F., 2013, Computer Performance Engineering. 9th European Workshop, EPEW 2012 and 28th UK Workshop, UKPEW 2012. Revised Selected Papers, P197, DOI 10.1007/978-3-642-36781-6_14
  • [7] Stan: A Probabilistic Programming Language
    Carpenter, Bob
    Gelman, Andrew
    Hoffman, Matthew D.
    Lee, Daniel
    Goodrich, Ben
    Betancourt, Michael
    Brubaker, Marcus A.
    Guo, Jiqiang
    Li, Peter
    Riddell, Allen
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2017, 76 (01): : 1 - 29
  • [8] Carrington Laura, 2013, 2013 IEEE International Symposium on Parallel and Distributed Processing, Workshops and PhD Forum (IPDPSW), P1667, DOI 10.1109/IPDPSW.2013.137
  • [9] Simulation of MPI applications with time-independent traces
    Casanova, Henri
    Desprez, Frederic
    Markomanolis, George S.
    Suter, Frederic
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2015, 27 (05) : 1145 - 1168
  • [10] Versatile, scalable, and accurate simulation of distributed applications and platforms
    Casanova, Henri
    Giersch, Arnaud
    Legrand, Arnaud
    Quinson, Martin
    Suter, Frederic
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2014, 74 (10) : 2899 - 2917