A Tunable Holistic Resiliency Approach for High-Performance Computing Systems

被引:3
作者
Scott, Stephen L. [1 ]
Engelmann, Christian [1 ]
Vallee, Geoffroy R. [1 ]
Naughton, Thomas [1 ]
Tikotekar, Anand [1 ]
Ostrouchov, George [1 ]
Leangsuksun, Chokchai [2 ]
Naksinehaboon, Nichamon [2 ]
Nassar, Raja [2 ]
Paun, Mihaela [2 ]
Mueller, Frank [3 ]
Wang, Chao [3 ]
Nagarajan, Arun B. [3 ]
Varma, Jyothish [3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Louisiana Tech Univ, Ruston, LA 71270 USA
[3] N Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Design; Measurement; Performance; Reliability;
D O I
10.1145/1594835.1504227
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
引用
收藏
页码:305 / 306
页数:2
相关论文
共 50 条
[31]   An FPGA approach for high-performance multi-match priority encoder [J].
Xuan-Thuan Nguyen ;
Hong-Thu Nguyen ;
Cong-Kha Pham .
IEICE ELECTRONICS EXPRESS, 2016, 13 (13)
[32]   A Lightweight VMM on Many Core for High Performance Computing [J].
Dai, Yuehua ;
Qi, Yong ;
Ren, Jianbao ;
Shi, Yi ;
Wang, Xiaoguang ;
Yu, Xuan .
ACM SIGPLAN NOTICES, 2013, 48 (07) :110-119
[33]   Flex-KV: Enabling High-performance and Flexible KV Systems [J].
Phanishayee, Amar ;
Andersen, David G. ;
Pucha, Himabindu ;
Povzner, Anna ;
Belluomini, Wendy .
MBDS '12: PROCEEDINGS OF THE 2012 WORKSHOP ON MANAGEMENT OF BIG DATA SYSTEMS, 2012, :19-24
[34]   Electrical characteristics of interconnections for high-performance systems [J].
Deutsch, A .
PROCEEDINGS OF THE IEEE, 1998, 86 (02) :315-355
[35]   A HIGH-PERFORMANCE INTERCONNECTION NETWORK FOR MULTIPROCESSOR SYSTEMS [J].
SHEN, H .
PARALLEL COMPUTING, 1993, 19 (09) :993-1001
[36]   High Performance Computing Systems with Various Checkpointing Schemes [J].
Naksinehaboon, N. ;
Paun, M. ;
Nassar, R. ;
Leangsuksun, B. ;
Scott, S. .
INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2009, 4 (04) :386-400
[37]   The Method of Synthesis of a High-Performance Computing Complex in Solving the Problem of Forecasting the Technical State of Information and Management Systems [J].
Matveeva, S. S. ;
Perlov, A. Yu ;
Pankratov, V. A. ;
Lvov, K., V .
2021 SYSTEMS OF SIGNAL SYNCHRONIZATION, GENERATING AND PROCESSING IN TELECOMMUNICATIONS (SYNCHROINFO), 2021,
[38]   Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources [J].
Palmer, Jeffrey T. ;
Gallo, Steven M. ;
Furlani, Thomas R. ;
Jones, Matthew D. ;
DeLeon, Robert L. ;
White, Joseph P. ;
Simakov, Nikolay ;
Patra, Abani K. ;
Sperhac, Jeanette ;
Yearke, Thomas ;
Rathsam, Ryan ;
Innus, Martins ;
Cornelius, Cynthia D. ;
Browne, James C. ;
Barth, William L. ;
Evans, Richard T. .
COMPUTING IN SCIENCE & ENGINEERING, 2015, 17 (04) :52-62
[39]   A Comparative Review of High-Performance Computing Major Cloud Service Providers [J].
Aljamal, Rawan ;
El-Mousa, Ali ;
Jubair, Fahed .
2018 9TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2018, :181-186
[40]   Analyzing Data-Error Propagation Effects in High-Performance Computing [J].
Utrera, Gladys ;
Gil, Marisa ;
Martorell, Xavier .
2016 24TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP), 2016, :418-421