Resilience-Aware Resource Management for Exascale Computing Systems

被引:5
|
作者
Dauwe, Daniel [1 ]
Pasricha, Sudeep [1 ,2 ]
Maciejewski, Anthony A. [1 ]
Siegel, Howard Jay [1 ,2 ]
机构
[1] Colorado State Univ, Dept Elect & Comp Engn, Ft Collins, CO 80523 USA
[2] Colorado State Univ, Dept Comp Sci, Ft Collins, CO 80523 USA
来源
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING | 2018年 / 3卷 / 04期
关键词
Exascale resilience; checkpoint restart; multilevel checkpointing; message logging; fault tolerance; HPC resource management;
D O I
10.1109/TSUSC.2018.2797890
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are "resilience aware" through the use of more accurate execution time predictions.
引用
收藏
页码:332 / 345
页数:14
相关论文
共 50 条
  • [31] Deadline-aware Dynamic Resource Management in Serverless Computing Environments
    Mampage, Anupama
    Karunasekera, Shanika
    Buyya, Rajkumar
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 483 - 492
  • [32] Network-Aware Resource Management Strategy in Cloud Computing Environments
    Abdclaal, Marwa A.
    Ebrahim, Gamal A.
    Anis, Wagdy R.
    PROCEEDINGS OF 2016 11TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES), 2016, : 26 - 31
  • [33] Function-Aware Resource Management Framework for Serverless Edge Computing
    Ko, Haneul
    Pack, Sangheon
    IEEE INTERNET OF THINGS JOURNAL, 2023, 10 (02) : 1310 - 1319
  • [34] qCon: QoS-Aware Network Resource Management for Fog Computing
    Hong, Cheol-Ho
    Lee, Kyungwoon
    Kang, Minkoo
    Yoo, Chuck
    SENSORS, 2018, 18 (10)
  • [35] Sharing-Aware Resource Management Algorithms for Virtual Computing Environments
    Rampersaud, Safraz
    2015 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2015), 2015, : 493 - 495
  • [36] ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency
    Elham Adibi
    Ehsan Mousavi Khaneghah
    Cluster Computing, 2020, 23 : 3349 - 3369
  • [37] ExaRD: introducing a framework for empowerment of resource discovery to support distributed exascale computing systems with high consistency
    Adibi, Elham
    Khaneghah, Ehsan
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2020, 23 (04): : 3349 - 3369
  • [38] Resilience-aware design of interconnected supply chain networks with application to water-energy nexus
    Tsolas, Spyridon D.
    Hasan, M. M. Faruque
    AICHE JOURNAL, 2021, 67 (11)
  • [39] Security-Aware Resource Allocation for Mobile Cloud Computing Systems
    Liu, Yanchen
    Lee, Myung J.
    24TH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS ICCCN 2015, 2015,
  • [40] QoS-aware resource matching and recommendation for cloud computing systems
    Ding, Shuai
    Xia, Chengyi
    Cai, Qiong
    Zhou, Kaile
    Yang, Shanlin
    APPLIED MATHEMATICS AND COMPUTATION, 2014, 247 : 941 - 950