A Tunable Holistic Resiliency Approach for High-Performance Computing Systems

被引:3
|
作者
Scott, Stephen L. [1 ]
Engelmann, Christian [1 ]
Vallee, Geoffroy R. [1 ]
Naughton, Thomas [1 ]
Tikotekar, Anand [1 ]
Ostrouchov, George [1 ]
Leangsuksun, Chokchai [2 ]
Naksinehaboon, Nichamon [2 ]
Nassar, Raja [2 ]
Paun, Mihaela [2 ]
Mueller, Frank [3 ]
Wang, Chao [3 ]
Nagarajan, Arun B. [3 ]
Varma, Jyothish [3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Louisiana Tech Univ, Ruston, LA 71270 USA
[3] N Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Design; Measurement; Performance; Reliability;
D O I
10.1145/1594835.1504227
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
引用
收藏
页码:305 / 306
页数:2
相关论文
共 50 条
  • [1] NCBI BLASTP on High-Performance Reconfigurable Computing Systems
    Mahram, Atabak
    Herbordt, Martin C.
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2015, 7 (04)
  • [2] Molecular Dynamics Simulations on High-Performance Reconfigurable Computing Systems
    Chiu, Matt
    Herbordt, Martin C.
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2010, 3 (04)
  • [3] System simulation methodology of optical interconnects for high-performance computing systems
    Kodi, Avinash Karanth
    Louri, Ahmed
    JOURNAL OF OPTICAL NETWORKING, 2007, 6 (12): : 1282 - 1300
  • [4] A Large-Scale Study of Failures in High-Performance Computing Systems
    Schroeder, Bianca
    Gibson, Garth A.
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2010, 7 (04) : 337 - 350
  • [5] Enabling High-Performance Onboard Computing with Virtualization for Unmanned Aerial Systems
    Wang, Baoqian
    Xie, Junfei
    Li, Songwei
    Wan, Yan
    Fu, Shengli
    Lu, Kejie
    2018 INTERNATIONAL CONFERENCE ON UNMANNED AIRCRAFT SYSTEMS (ICUAS), 2018, : 202 - 211
  • [6] Autotuning in High-Performance Computing Applications
    Balaprakash, Prasanna
    Dongarra, Jack
    Gamblin, Todd
    Hall, Mary
    Hollingsworth, Jeffrey K.
    Norris, Boyana
    Vuduc, Richard
    PROCEEDINGS OF THE IEEE, 2018, 106 (11) : 2068 - 2083
  • [7] Terra: A Multi-Stage Language for High-Performance Computing
    DeVito, Zachary
    Hegarty, James
    Aiken, Alex
    Hanrahan, Pat
    Vitek, Jan
    ACM SIGPLAN NOTICES, 2013, 48 (06) : 105 - 115
  • [8] Pattern-Based Modeling of High-Performance Computing Resilience
    Hukerikar, Saurabh
    Engelmann, Christian
    EURO-PAR 2017: PARALLEL PROCESSING WORKSHOPS, 2018, 10659 : 557 - 568
  • [9] Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs
    Papakonstantinou, Alexandros
    Gururaj, Karthik
    Stratton, John A.
    Chen, Deming
    Cong, Jason
    Hwu, Wen-Mei W.
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2013, 13 (02)
  • [10] Fault-Aware Runtime Strategies for High-Performance Computing
    Li, Yawei
    Lan, Zhiling
    Gujrati, Prashasta
    Sun, Xian-He
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (04) : 460 - 473