Automating Workload Analysis of Large-Scale Supercomputer Systems

被引:1
|
作者
Shvets, P. A. [1 ,2 ]
Voevodin, V. V. [1 ,2 ]
Zhumatiy, S. A. [1 ]
机构
[1] Lomonosov Moscow State Univ, Moscow 119991, Russia
[2] Moscow Ctr Fundamental & Appl Math, Moscow 119991, Russia
基金
俄罗斯基础研究基金会;
关键词
supercomputing; high-performance computing; workload analysis; efficiency; data analysis; monitoring data; system software;
D O I
10.1134/S1995080221070210
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The architecture of modern supercomputers is extremely complex, so it is exceedingly difficult to monitor and maintain the efficiency of their functioning. And even if it is possible to collect the necessary data on the operation of all important supercomputer components, how not to drown in this ''sea of information'' and not miss the onset of a critical situation? This requires the automation of the workload analysis process. One of the possible solutions is to create a set of rules that automatically detect and notify supercomputer administrators about the occurrence of certain critical situations or cases of a significant decrease in the efficiency of supercomputer functioning. Such approach allows quickly identifying the most interesting and important situations for the administrator, as well as correctly prioritizing the workload analysis process in whole. This article describes the process of developing a set of 19 rules, each of which determines a way to detect the onset of a certain critical situation, provides a description of the possible causes of its occurrence, and also specifies the criticality of the situation that has arisen. These rules allow monitoring different aspects of supercomputer behavior: the efficiency of using application packages, the operation of the queue system, the load and availability of service servers, the presence of global performance issues in user applications, and the peculiarities of using separate partitions of the supercomputer. The developed rules formed the basis of the software solution that was implemented and evaluated on the Petaflop-level Lomonosov-2 supercomputer.
引用
收藏
页码:1547 / 1559
页数:13
相关论文
共 50 条
  • [1] Automating Workload Analysis of Large-Scale Supercomputer Systems
    P. A. Shvets
    V. V. Voevodin
    S. A. Zhumatiy
    Lobachevskii Journal of Mathematics, 2021, 42 : 1547 - 1559
  • [2] ‘‘Endless’’ Workload Analysis of Large-Scale Supercomputers
    P. A. Shvets
    V. V. Voevodin
    Lobachevskii Journal of Mathematics, 2021, 42 : 184 - 194
  • [3] ''Endless'' Workload Analysis of Large-Scale Supercomputers
    Shvets, P. A.
    Voevodin, V. V.
    LOBACHEVSKII JOURNAL OF MATHEMATICS, 2021, 42 (01) : 184 - 194
  • [4] Large-Scale Magnetic Field Analysis Using Supercomputer K
    Kawase, Yoshihiro
    Yamaguchi, Tadashi
    Murashita, Masaya
    Ishimura, Shota
    Ota, Tomohiro
    Yamamoto, Takeshi
    2018 21ST INTERNATIONAL CONFERENCE ON ELECTRICAL MACHINES AND SYSTEMS (ICEMS), 2018, : 2688 - 2691
  • [5] Large-Scale Graph Processing Analysis using Supercomputer Cluster
    Vildario, Alfrido
    Fitriyani
    Nurkahfi, Galih Nugraha
    1ST INTERNATIONAL CONFERENCE ON COMPUTING AND APPLIED INFORMATICS 2016 : APPLIED INFORMATICS TOWARD SMART ENVIRONMENT, PEOPLE, AND SOCIETY, 2017, 801
  • [6] Exercise initialization: Automating the setup of large-scale simulation systems
    Garrett, RL
    Whitlock, AH
    Kohler, OD
    PROCEEDINGS OF THE HIGH-PERFORMANCE COMPUTING (HPC'98), 1998, : 323 - 328
  • [7] Automating Conflict Detection and Mitigation in Large-Scale IoT Systems
    Pradeep, Pavana
    Pal, Amitangshu
    Kant, Krishna
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 535 - 544
  • [8] Automating large-scale LEMUF calculations
    Picard, R.R.
    JNMM, Journal of the Institute of Nuclear Materials Management, 1992, 20 (03): : 43 - 46
  • [9] BGPEval: Automating Large-Scale Testbed Creation
    Rodday, Nils
    Rodosek, Gabi Dreo
    2023 19TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT, CNSM, 2023,
  • [10] AUTOMATING THE CONSTRUCTION OF LARGE-SCALE VIRTUAL WORLDS
    POLIS, MF
    GIFFORD, SJ
    MCKEOWN, DM
    COMPUTER, 1995, 28 (07) : 57 - 65