A visual performance analysis framework for task-based parallel applications running on hybrid clusters

被引:13
作者
Pinto, Vinicius Garcia [1 ,2 ]
Schnorr, Lucas Mello [1 ,2 ]
Stanisic, Luka [3 ]
Legrand, Arnaud [2 ]
Thibault, Samuel [4 ]
Danjean, Vincent [2 ]
机构
[1] Fed Univ Rio Grande do Sul UFRGS, Inst Informat, Porto Alegre, RS, Brazil
[2] Univ Grenoble Alpes, Lab Informat Grenoble, Grenoble INP, Inria,CNRS, Grenoble, France
[3] Max Planck Comp & Data Facil, Garching, Germany
[4] Inria Bordeaux Sud Ouest, Bordeaux, France
基金
欧盟地平线“2020”;
关键词
Cholesky; heterogeneous platforms; high-performance computing; task-based applications; trace visualization; DESIGN;
D O I
10.1002/cpe.4472
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Programming paradigms in High-Performance Computing have been shifting toward task-based models that are capable of adapting readily to heterogeneous and scalable supercomputers. The performance of task-based application heavily depends on the runtime scheduling heuristics and on its ability to exploit computing and communication resources. Unfortunately, the traditional performance analysis strategies are unfit to fully understand task-based runtime systems and applications: they expect a regular behavior with communication and computation phases, while task-based applications demonstrate no clear phases. Moreover, the finer granularity of task-based applications typically induces a stochastic behavior that leads to irregular structures that are difficult to analyze. Furthermore, the combination of application structure, scheduler, and hardware information is generally essential to understand performance issues. This paper presents a flexible framework that enables one to combine several sources of information and to create custom visualization panels allowing to understand and pinpoint performance problems incurred by bad scheduling decisions in task-based applications. Three case-studies using StarPU-MPI, a task-based multi-node runtime system, are detailed to show how our framework can be used to study the performance of the well-known Cholesky factorization. Performance improvements include a better task partitioning among the multi-(GPU, core) to get closer to theoretical lower bounds, improved MPI pipelining in multi-(node, core, GPU) to reduce the slow start, and changes in the runtime system to increase MPI bandwidth, with gains of up to 13% in the total makespan.
引用
收藏
页数:27
相关论文
共 55 条
  • [1] Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures
    Agullo, E.
    Giraud, L.
    Guermouche, A.
    Nakov, S.
    Roman, J.
    [J]. EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 2017, 10104 : 69 - 82
  • [2] Agullo E, 2014, TASK BASED FMM HETER, V28, P2608
  • [3] Agullo E, 2013, IEEE INT PAR DISTR P
  • [4] Agullo E., 2017, IEEE Transactions on Parallel and Distributed Systems, P1, DOI [10.1109/TPDS.2017.2766064, DOI 10.1109/TPDS.2017.2766064]
  • [5] Agullo E, 2015, 2015 IEEE 22 INT C H
  • [6] Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems
    Agullo, Emmanuel
    Buttari, Alfredo
    Guermouche, Abdou
    Lopez, Florent
    [J]. ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2016, 43 (02):
  • [7] Agullo E, 2012, 2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), P1330
  • [8] Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects
    Agullo, Emmanuel
    Demmel, Jim
    Dongarra, Jack
    Hadri, Bilel
    Kurzak, Jakub
    Langou, Julien
    Ltaief, Hatem
    Luszczek, Piotr
    Tomov, Stanimire
    [J]. SCIDAC 2009: SCIENTIFIC DISCOVERY THROUGH ADVANCED COMPUTING, 2009, 180
  • [9] [Anonymous], 2016, TIDYVERSE EASILY INS
  • [10] [Anonymous], 2014, Using Advanced MPI: Modern Features of the Message-Passing Interface