PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS

被引:14
作者
Cappello, Franck [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ]
机构
[1] INRIA Illinois Joint Lab Petascale Comp, Urbana, IL USA
[2] Univ Hawaii Manoa, Dept Informat & Comp Sci, Honolulu, HI USA
[3] Ecole Normale Super Lyon, Lab Informat Parallelisme, Lyon, France
关键词
failure prediction; checkpointing; migration; parallel jobs;
D O I
10.1142/S0129626411000126
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 2(20) nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques.
引用
收藏
页码:111 / 132
页数:22
相关论文
共 19 条
  • [1] TOWARD EXASCALE RESILIENCE
    Cappello, Franck
    Geist, Al
    Gropp, Bill
    Kale, Laxmikant
    Kramer, Bill
    Snir, Marc
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) : 374 - 388
  • [2] Proactive management of software aging
    Castelli, V
    Harper, RE
    Heidelberger, P
    Hunter, SW
    Trivedi, KS
    Vaidyanathan, K
    Zeggert, WP
    [J]. IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2001, 45 (02) : 311 - 332
  • [3] Chakravorty S, 2006, LECT NOTES COMPUT SC, V4297, P485
  • [4] Daly J.T., 2004, FGCS, V22, P303
  • [5] THE INTERNATIONAL EXASCALE SOFTWARE PROJECT: A CALL TO COOPERATIVE ACTION BY THE GLOBAL HIGH-PERFORMANCE COMMUNITY
    Dongarra, Jack
    Beckman, Pete
    Aerts, Patrick
    Cappello, Frank
    Lippert, Thomas
    Matsuoka, Satoshi
    Messina, Paul
    Moore, Terry
    Stevens, Rick
    Trefethen, Anne
    Valero, Mateo
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04) : 309 - 322
  • [6] On asynchronous iterations
    Frommer, A
    Szyld, DB
    [J]. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2000, 123 (1-2) : 201 - 216
  • [7] Gujrati Prashasta, 2007, INT C PAR PROC ICPP
  • [8] Heath T., 2002, Performance Evaluation Review, V30, P217, DOI 10.1145/511399.511362
  • [9] HUANG YN, 1995, DIG PAP INT SYMP FAU, P381, DOI 10.1109/FTCS.1995.466961
  • [10] Performance analysis and evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand
    Koop, Matthew J.
    Huang, Wei
    Gopalakrishnan, Karthik
    Panda, Dhabaleswar K.
    [J]. 16TH ANNUAL IEEE SYMPOSIUM ON HIGH-PERFORMANCE INTERCONNECTS, PROCEEDINGS, 2008, : 85 - 92