A Tunable Holistic Resiliency Approach for High-Performance Computing Systems

被引:3
作者
Scott, Stephen L. [1 ]
Engelmann, Christian [1 ]
Vallee, Geoffroy R. [1 ]
Naughton, Thomas [1 ]
Tikotekar, Anand [1 ]
Ostrouchov, George [1 ]
Leangsuksun, Chokchai [2 ]
Naksinehaboon, Nichamon [2 ]
Nassar, Raja [2 ]
Paun, Mihaela [2 ]
Mueller, Frank [3 ]
Wang, Chao [3 ]
Nagarajan, Arun B. [3 ]
Varma, Jyothish [3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Louisiana Tech Univ, Ruston, LA 71270 USA
[3] N Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Design; Measurement; Performance; Reliability;
D O I
10.1145/1594835.1504227
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
引用
收藏
页码:305 / 306
页数:2
相关论文
共 50 条
  • [21] High-Performance Processing of Text Queries with Tunable Pruned Term and Term Pair Indexes
    Broschart, Andreas
    Schenkel, Ralf
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2012, 30 (01)
  • [22] Imparting resiliency in biocomposite production systems: A system dynamics approach
    Piri, Imelda Saran
    Das, Oisik
    Hedenqvist, Mikael S.
    Vaisanen, Taneli
    Ikram, Shafaq
    Bhattacharyya, Debes
    JOURNAL OF CLEANER PRODUCTION, 2018, 179 : 450 - 459
  • [23] Parallel Colt: A High-Performance Java']Java Library for Scientific Computing and Image Processing
    Wendykier, Piotr
    Nagy, James G.
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2010, 37 (03):
  • [24] Implementing an Affordable High-Performance Computing for Teaching-Oriented Computer Science Curriculum
    Abuzaghleh, Omar
    Goldschmidt, Kathleen
    Elleithy, Yasser
    Lee, Jeongkyu
    ACM TRANSACTIONS ON COMPUTING EDUCATION, 2013, 13 (01):
  • [25] The Long and Winding Road Toward Efficient High-Performance Computing
    Jalby, William
    Kuck, David
    Malony, Allen D.
    Masella, Michel
    Mazouz, Abdelhafid
    Popov, Mihail
    PROCEEDINGS OF THE IEEE, 2018, 106 (11) : 1985 - 2003
  • [26] RAID - HIGH-PERFORMANCE, RELIABLE SECONDARY STORAGE
    CHEN, PM
    LEE, EK
    GIBSON, GA
    KATZ, RH
    PATTERSON, DA
    ACM COMPUTING SURVEYS, 1994, 26 (02) : 145 - 185
  • [27] Reliability-oriented resource management for High-Performance Computing
    Massari, Giuseppe
    Peta, Miriam
    Campi, Alessandro
    Reghenzani, Federico
    Terraneo, Federico
    Agosta, Giovanni
    Fornaciari, William
    Ciesielski, Sebastian
    Kulczewski, Michal
    Piatek, Wojciech
    SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2023, 39
  • [28] A Checkpoint of Research on Parallel I/O for High-Performance Computing
    Boito, Francieli Zanon
    Inacio, Eduardo C.
    Bez, Jean Luca
    Navaux, Philippe O. A.
    Dantas, Mario A. R.
    Denneulin, Yves
    ACM COMPUTING SURVEYS, 2018, 51 (02)
  • [29] A Memetic Approach to the Automatic Design of High-Performance Analog Integrated Circuits
    Liu, Bo
    Fernandez, Francisco V.
    Gielen, Georges
    Castro-Lopez, R.
    Roca, E.
    ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2009, 14 (03)
  • [30] Ring-mesh: a scalable and high-performance approach for manycore accelerators
    Mazumdar, Somnath
    Scionti, Alberto
    JOURNAL OF SUPERCOMPUTING, 2020, 76 (09) : 6720 - 6752