Energy Analysis and Optimization for Resilient Scalable Linear Systems

被引:5
作者
Miao, Zheng [1 ]
Calhoun, Jon [1 ]
Ge, Rong [1 ]
机构
[1] Clemson Univ, Clemson, SC 29631 USA
来源
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2018年
基金
美国国家科学基金会;
关键词
Resilience; Energy-Efficiency; Forward-Recovery; HPC; FAULT-TOLERANCE; PERFORMANCE;
D O I
10.1109/CLUSTER.2018.00015
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget.
引用
收藏
页码:24 / 34
页数:11
相关论文
共 39 条
[31]  
Schöll A, 2017, IEEE INT ON LINE, P237, DOI 10.1109/IOLTS.2017.8046244
[32]  
Scholl Alexander, 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Proceedings, P251, DOI 10.1109/DSN.2016.31
[33]  
Sheng Di, 2014, 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), P1181, DOI 10.1109/IPDPS.2014.122
[34]   PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures [J].
Shye, Alex ;
Blomstedt, Joseph ;
Moseley, Tipp ;
Reddi, Vijay Janapa ;
Connors, Daniel A. .
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2009, 6 (02) :135-148
[35]   SCALING PARALLEL PROGRAMS FOR MULTIPROCESSORS - METHODOLOGY AND EXAMPLES [J].
SINGH, JP ;
HENNESSY, JL ;
GUPTA, A .
COMPUTER, 1993, 26 (07) :42-50
[36]   Addressing failures in exascale computing [J].
Snir, Marc ;
Wisniewski, Robert W. ;
Abraham, Jacob A. ;
Adve, Sarita V. ;
Bagchi, Saurabh ;
Balaji, Pavan ;
Belak, Jim ;
Bose, Pradip ;
Cappello, Franck ;
Carlson, Bill ;
Chien, Andrew A. ;
Coteus, Paul ;
DeBardeleben, Nathan A. ;
Diniz, Pedro C. ;
Engelmann, Christian ;
Erez, Mattan ;
Fazzari, Saverio ;
Geist, Al ;
Gupta, Rinku ;
Johnson, Fred ;
Krishnamoorthy, Sriram ;
Leyffer, Sven ;
Liberty, Dean ;
Mitra, Subhasish ;
Munson, Todd ;
Schreiber, Rob ;
Stearley, Jon ;
Van Hensbergen, Eric .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (02) :129-173
[37]  
Wu P, 2016, P 25 ACM INT S HIGH, P31, DOI DOI 10.1145/2907294.2907315
[38]   Modeling communication overhead: MPI and MPL performance on the IBM SP2 [J].
Xu, ZW ;
Hwang, K .
IEEE PARALLEL & DISTRIBUTED TECHNOLOGY, 1996, 4 (01) :9-23
[39]   FIRST-ORDER APPROXIMATION TO OPTIMUM CHECKPOINT INTERVAL [J].
YOUNG, JW .
COMMUNICATIONS OF THE ACM, 1974, 17 (09) :530-531