Runtime-Guided ECC Protection using Online Estimation of Memory Vulnerability

被引:1
作者
Jaulmes, Luc [1 ]
Moreto, Miquel [1 ]
Valero, Mateo [1 ]
Erez, Mattan [2 ]
Casas, Marc [1 ]
机构
[1] Univ Politecn Catalunya UPC, Barcelona Supercomp Ctr BSC, Barcelona, Spain
[2] Univ Texas Austin, Elect & Comp Engn Dept, Austin, TX 78712 USA
来源
PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20) | 2020年
关键词
Vulnerability; Runtime Systems; Error Correcting Codes; DRAM;
D O I
10.1109/SC41405.2020.00080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Diminishing reliability of semiconductor technologies and decreasing power budgets per component hinder designing next-generation high performance computing (HPC) systems. Both constraints strongly impact memory subsystems, as DRAM main memory accounts for up to 30 to 50 percent of a node's overall power consumption, and is the subsystem that is most subject to faults. Improving reliability requires stronger error correcting codes (ECCs), which incur additional power and storage costs. It is critical to develop strategies to uphold memory reliability while minimising these costs, with the goal of improving the power efficiency of computing machines. We introduce a methodology to dynamically estimate the vulnerability of data, and adjust ECC protection accordingly. Our methodology relies on information readily available to runtime systems in task-based datafiow programming models, and the existing Virtualized Error Correcting Code (VECC) schemes to provide adaptable protection. Guiding VECC using vulnerability estimates offers a wide range of reliability-redundancy trade-offs, as reliable as using expensive offline profiling for guidance and up to to 25% safer than VECC without guidance. Runtime-guided VECC is more efficient than a stronger uniform ECC, reducing DIMM lifetime failure from 1.84% down to 1.26% while increasing DRAM energy consumption by only 1.03x.
引用
收藏
页数:14
相关论文
共 37 条
[1]  
Advanced Micro Devices (AMD) Inc., 2016, PUBLICATIONS, V50742
[2]  
Alameldeen AR, 2011, ISCA 2011: PROCEEDINGS OF THE 38TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, P461, DOI 10.1145/2024723.2000118
[3]  
[Anonymous], 2013, OpenMP Application Program Interface
[4]  
Bautista-Gomez Leonardo, 2016, P INT C HIGH PERF CO, V55, P1, DOI [10.1109/SC.2016.54, DOI 10.1109/SC.2016.54]
[5]   Defect Analysis and Cost-effective Resilience Architecture for Future DRAM Devices [J].
Cha, Sanguhn ;
Seongil, O. ;
Shin, Hyunsung ;
Hwang, Sangjoon ;
Park, Kwangil ;
Jang, Seong Jin ;
Choi, Joo Sun ;
Jin, Gyo Young ;
Son, Young Hoon ;
Cho, Hyunyoon ;
Ahn, Jung Ho ;
Kim, Nam Sung .
2017 23RD IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2017, :61-72
[6]   Parallel programmability and the Chapel language [J].
Chamberlain, B. L. ;
Callahan, D. ;
Zima, H. P. .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2007, 21 (03) :291-312
[7]  
Chandrasekar K., 2011, Proceedings of the 2011 14th Euromicro Conference on Digital System Design. Architectures, Methods and Tools. (DSD 2011), P99, DOI 10.1109/DSD.2011.17
[8]  
Dell T. J., 1997, CISC VIS NETW IND GL
[9]   OnipSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS MULTI-CORE ARCHITECTURES [J].
Duran, Alejandro ;
Ayguade, Eduard ;
Badia, Rosa M. ;
Labahta, Jesus ;
Martinell, Luis ;
Martorell, Xavier ;
Planas, Judit .
PARALLEL PROCESSING LETTERS, 2011, 21 (02) :173-193
[10]   Reliability-Aware Data Placement for Heterogeneous Memory Architecture [J].
Gupta, Manish ;
Sridharan, Vilas ;
Roberts, David ;
Prodromou, Andreas ;
Venkat, Ashish ;
Tullsen, Dean ;
Gupta, Rajesh .
2018 24TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA), 2018, :583-595