A Tunable Holistic Resiliency Approach for High-Performance Computing Systems

被引:3
作者
Scott, Stephen L. [1 ]
Engelmann, Christian [1 ]
Vallee, Geoffroy R. [1 ]
Naughton, Thomas [1 ]
Tikotekar, Anand [1 ]
Ostrouchov, George [1 ]
Leangsuksun, Chokchai [2 ]
Naksinehaboon, Nichamon [2 ]
Nassar, Raja [2 ]
Paun, Mihaela [2 ]
Mueller, Frank [3 ]
Wang, Chao [3 ]
Nagarajan, Arun B. [3 ]
Varma, Jyothish [3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Louisiana Tech Univ, Ruston, LA 71270 USA
[3] N Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Design; Measurement; Performance; Reliability;
D O I
10.1145/1594835.1504227
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
引用
收藏
页码:305 / 306
页数:2
相关论文
共 50 条
[21]   High-Performance Processing of Text Queries with Tunable Pruned Term and Term Pair Indexes [J].
Broschart, Andreas ;
Schenkel, Ralf .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2012, 30 (01)
[22]   Imparting resiliency in biocomposite production systems: A system dynamics approach [J].
Piri, Imelda Saran ;
Das, Oisik ;
Hedenqvist, Mikael S. ;
Vaisanen, Taneli ;
Ikram, Shafaq ;
Bhattacharyya, Debes .
JOURNAL OF CLEANER PRODUCTION, 2018, 179 :450-459
[23]   Implementing an Affordable High-Performance Computing for Teaching-Oriented Computer Science Curriculum [J].
Abuzaghleh, Omar ;
Goldschmidt, Kathleen ;
Elleithy, Yasser ;
Lee, Jeongkyu .
ACM TRANSACTIONS ON COMPUTING EDUCATION, 2013, 13 (01)
[24]   Parallel Colt: A High-Performance Java']Java Library for Scientific Computing and Image Processing [J].
Wendykier, Piotr ;
Nagy, James G. .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2010, 37 (03)
[25]   The Long and Winding Road Toward Efficient High-Performance Computing [J].
Jalby, William ;
Kuck, David ;
Malony, Allen D. ;
Masella, Michel ;
Mazouz, Abdelhafid ;
Popov, Mihail .
PROCEEDINGS OF THE IEEE, 2018, 106 (11) :1985-2003
[26]   Reliability-oriented resource management for High-Performance Computing [J].
Massari, Giuseppe ;
Peta, Miriam ;
Campi, Alessandro ;
Reghenzani, Federico ;
Terraneo, Federico ;
Agosta, Giovanni ;
Fornaciari, William ;
Ciesielski, Sebastian ;
Kulczewski, Michal ;
Piatek, Wojciech .
SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2023, 39
[27]   A Checkpoint of Research on Parallel I/O for High-Performance Computing [J].
Boito, Francieli Zanon ;
Inacio, Eduardo C. ;
Bez, Jean Luca ;
Navaux, Philippe O. A. ;
Dantas, Mario A. R. ;
Denneulin, Yves .
ACM COMPUTING SURVEYS, 2018, 51 (02)
[28]   RAID - HIGH-PERFORMANCE, RELIABLE SECONDARY STORAGE [J].
CHEN, PM ;
LEE, EK ;
GIBSON, GA ;
KATZ, RH ;
PATTERSON, DA .
ACM COMPUTING SURVEYS, 1994, 26 (02) :145-185
[29]   A Memetic Approach to the Automatic Design of High-Performance Analog Integrated Circuits [J].
Liu, Bo ;
Fernandez, Francisco V. ;
Gielen, Georges ;
Castro-Lopez, R. ;
Roca, E. .
ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2009, 14 (03)
[30]   Ring-mesh: a scalable and high-performance approach for manycore accelerators [J].
Mazumdar, Somnath ;
Scionti, Alberto .
JOURNAL OF SUPERCOMPUTING, 2020, 76 (09) :6720-6752