Resilience-Aware Resource Management for Exascale Computing Systems

被引:5
|
作者
Dauwe, Daniel [1 ]
Pasricha, Sudeep [1 ,2 ]
Maciejewski, Anthony A. [1 ]
Siegel, Howard Jay [1 ,2 ]
机构
[1] Colorado State Univ, Dept Elect & Comp Engn, Ft Collins, CO 80523 USA
[2] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA
来源
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING | 2018年 / 3卷 / 04期
关键词
Exascale resilience; checkpoint restart; multilevel checkpointing; message logging; fault tolerance; HPC resource management;
D O I
10.1109/TSUSC.2018.2797890
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are "resilience aware" through the use of more accurate execution time predictions.
引用
收藏
页码:332 / 345
页数:14
相关论文
共 50 条
  • [41] Online Resource Allocation for Semantic-Aware Edge Computing Systems
    Cang, Yihan
    Chen, Ming
    Yang, Zhaohui
    Hu, Yuntao
    Wang, Yinlu
    Huang, Chongwen
    Zhang, Zhaoyang
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (17): : 28094 - 28110
  • [42] Dynamic load balancing in distributed exascale computing systems
    Mirtaheri, Seyedeh Leili
    Grandinetti, Lucio
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (04): : 3677 - 3689
  • [43] Age of Information-Aware Resource Management in UAV-Assisted Mobile-Edge Computing Systems
    Chen, Xianfu
    Wu, Celimuge
    Chen, Tao
    Liu, Zhi
    Bennis, Mehdi
    Ji, Yusheng
    2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,
  • [44] Delay-Aware Stochastic Resource Management for Mobile Edge Computing Systems via Constrained Reinforcement Learning
    Tian, Chang
    Liu, An
    Luo, Wu
    IEEE WIRELESS COMMUNICATIONS LETTERS, 2021, 10 (12) : 2708 - 2712
  • [45] Dynamic load balancing in distributed exascale computing systems
    Seyedeh Leili Mirtaheri
    Lucio Grandinetti
    Cluster Computing, 2017, 20 : 3677 - 3689
  • [46] Operational Intelligence for Distributed Computing Systems for Exascale Science
    Di Girolamo, Alessandro
    Legger, Federica
    Paparrigopoulos, Panos
    Klimentov, Alexei
    Schovancova, Jaroslava
    Kuznetsov, Valentin
    Lassnig, Mario
    Clissa, Luca
    Rinaldi, Lorenzo
    Sharma, Mayank
    Bakhshiansohi, Hamed
    Zvada, Marian
    Bonacorsi, Daniele
    Tisbeni, Simone Rossi
    Giommi, Luca
    de Sousa, Leticia Decker
    Diotalevi, Tommaso
    Grigorieva, Maria
    Padolski, Sergey
    24TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2019), 2020, 245
  • [47] Argo NodeOS: Toward Unified Resource Management for Exascale
    Perarnau, Swann
    Zounmevo, Judicael A.
    Dreher, Matthieu
    Van Essen, Brian C.
    Gioiosa, Roberto
    Iskra, Kamil
    Gokhale, Maya B.
    Yoshii, Kazutomo
    Beckman, Pete
    2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 153 - 162
  • [48] Resilience and Resource Management
    Brown, Eleanor D.
    Williams, Byron K.
    ENVIRONMENTAL MANAGEMENT, 2015, 56 (06) : 1416 - 1427
  • [49] Resilience and Resource Management
    Eleanor D. Brown
    Byron K. Williams
    Environmental Management, 2015, 56 : 1416 - 1427
  • [50] Resource Constrained Failure Management in Networked Computing Systems
    Bommannavar, Praveen
    Bambos, Nicholas
    2012 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2012,