A Tunable Holistic Resiliency Approach for High-Performance Computing Systems

被引:3
作者
Scott, Stephen L. [1 ]
Engelmann, Christian [1 ]
Vallee, Geoffroy R. [1 ]
Naughton, Thomas [1 ]
Tikotekar, Anand [1 ]
Ostrouchov, George [1 ]
Leangsuksun, Chokchai [2 ]
Naksinehaboon, Nichamon [2 ]
Nassar, Raja [2 ]
Paun, Mihaela [2 ]
Mueller, Frank [3 ]
Wang, Chao [3 ]
Nagarajan, Arun B. [3 ]
Varma, Jyothish [3 ]
机构
[1] Oak Ridge Natl Lab, Oak Ridge, TN 37831 USA
[2] Louisiana Tech Univ, Ruston, LA 71270 USA
[3] N Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
Design; Measurement; Performance; Reliability;
D O I
10.1145/1594835.1504227
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In order to address anticipated high failure rates, resiliency characteristics have become an urgent priority for next-generation extreme-scale high-performance computing (HPC) systems. This poster describes our past and ongoing efforts in novel fault resilience technologies for HPC. Presented work includes proactive fault resilience techniques, system and application reliability models and analyses, failure prediction, transparent process- and virtual-machine-level migration, and trade-off models for combining preemptive migration with checkpoint/restart. This poster summarizes our work and puts all individual technologies into context with a proposed holistic fault resilience framework.
引用
收藏
页码:305 / 306
页数:2
相关论文
共 50 条
[41]   Towards High-Performance SAN with Fast Storage Devices [J].
Choi, Jae Woo ;
Shin, Dong In ;
Yu, Young Jin ;
Eom, Hyeonsang ;
Yeom, Heon Young .
ACM TRANSACTIONS ON STORAGE, 2014, 10 (02)
[42]   DevoFlow: Scaling Flow Management for High-Performance Networks [J].
Curtis, Andrew R. ;
Mogul, Jeffrey C. ;
Tourrilhes, Jean ;
Yalagandula, Praveen ;
Sharma, Puneet ;
Banerjee, Sujata .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) :254-265
[43]   TOTAL HIGH-PERFORMANCE TIME AND DESIGN OF DEGRADABLE REAL-TIME SYSTEMS [J].
AKATSU, M ;
MURATA, T ;
KURIHARA, K .
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 1994, E77A (03) :510-516
[44]   Hybrid Nonvolatile Disk Cache for Energy-Efficient and High-Performance Systems [J].
Shi, Liang ;
Li, Jianhua ;
Xue, Chun Jason ;
Zhou, Xuehai .
ACM TRANSACTIONS ON DESIGN AUTOMATION OF ELECTRONIC SYSTEMS, 2013, 18 (01)
[45]   A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems [J].
Ifeanyi P. Egwutuoha ;
David Levy ;
Bran Selic ;
Shiping Chen .
The Journal of Supercomputing, 2013, 65 :1302-1326
[46]   A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems [J].
Egwutuoha, Ifeanyi P. ;
Levy, David ;
Selic, Bran ;
Chen, Shiping .
JOURNAL OF SUPERCOMPUTING, 2013, 65 (03) :1302-1326
[47]   The strategic approach to the high-performance paradigm: a European perspective [J].
Ferreira, Pedro ;
Neira, Isabel ;
Vieira, Elvira .
8TH INTERNATIONAL STRATEGIC MANAGEMENT CONFERENCE, 2012, 58 :474-482
[48]   Multifunctional Biobased Benzoxazines Blended with an Epoxy Resin for Tunable High-Performance Properties [J].
Chong, Alexandra M. ;
Salazar, Sarah A. ;
Stanzione, Joseph F., III .
ACS SUSTAINABLE CHEMISTRY & ENGINEERING, 2021, 9 (17) :5768-5775
[49]   Hierarchical porous carbon fibers for broadband and tunable high-performance microwave absorption [J].
Wu, Dan ;
Deng, Shuanglin ;
Wang, Yiqun ;
Wen, Jianghao ;
Ren, Lianggui ;
He, Qinchuan .
MATERIALS RESEARCH BULLETIN, 2024, 172
[50]   Tunable microwave metasurfaces for high-performance operations: dispersion compensation and dynamical switch [J].
Xu, He-Xiu ;
Tang, Shiwei ;
Ma, Shaojie ;
Luo, Weijie ;
Cai, Tong ;
Sun, Shulin ;
He, Qiong ;
Zhou, Lei .
SCIENTIFIC REPORTS, 2016, 6