Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

被引:2
作者
Zhang, Guozhen [1 ]
Liu, Yi [1 ]
Yang, Hailong [1 ]
Qian, Depei [1 ]
机构
[1] Beihang Univ, Sino German Joint Software Inst, Beijing, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
High-performance computing; Silent data corruption detection; Modular redundancy; Synchronization-free; Message verification; DRAM ERRORS; REPLICATION; SOFTWARE;
D O I
10.1007/s11227-021-03892-4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, high-performance computing (HPC) is stepping forward to exascale era. However, silent data corruption (SDC) behaved as bit-flipping can cause disastrous consequences for scientific computation, which jeopardizes the reliability of HPC at large scale. The most commonly used methods to address SDC are based on modular redundancy, which usually requires keeping execution progress consistent between replicas by synchronization and performing additional message transmission and comparison during program execution. Although such methods can detect SDC with high recall, they can introduce significant performance overhead and even stall the execution progress at a large scale. To our knowledge, this paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas. It combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines (Data-Analytic-Service, DAS) to perform progress comparison without interfering target program execution. Besides, our solution adopts a distributed parallel architecture to perform DAS and utilizes an innovative reference mechanism based on single non-deterministic event to guarantee the consistent execution of different replicas. We implemented a user-level prototype, termed as synchronization-free SDC detection (SFSD). The experimental results on the Tianhe-2 supercomputer show that SFSD is effective in detecting SDC, with low-performance overhead (within 10%) and an acceptable recall rate. Moreover, SFSD exhibits good scalability when applied to large-scale program executions.
引用
收藏
页码:1381 / 1408
页数:28
相关论文
共 35 条
[1]  
Bautista-Gomez L, 2015, P 22 EUR MPI US GROU, P1, DOI 10.1145/2802658.2802665
[2]   Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption [J].
Bautista-Gomez, Leonardo ;
Cappello, Franck .
2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, :128-133
[3]   Silent error detection in numerical time-stepping schemes [J].
Benson, Austin R. ;
Schmit, Sven ;
Schreiber, Robert .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (04) :403-421
[4]   Toward General Software Level Silent Data Corruption Detection for Parallel Applications [J].
Berrocal, Eduardo ;
Bautista-Gomez, Leonardo ;
Di, Sheng ;
Lan, Zhiling ;
Cappello, Franck .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (12) :3642-3655
[5]   Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications [J].
Berrocal, Eduardo ;
Bautista-Gomez, Leonardo ;
Di, Sheng ;
Lan, Zhiling ;
Cappello, Franck .
EURO-PAR 2016: PARALLEL PROCESSING, 2016, 9833 :419-430
[6]  
Berrocal Eduardo., 2015, P 24 INT S HIGH PERF, P275
[7]   Using group replication for resilience on exascale systems [J].
Bougeret, Marin ;
Casanova, Henri ;
Robert, Yves ;
Vivien, Frederic ;
Zaidouni, Dounia .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2014, 28 (02) :210-224
[8]   LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions [J].
Chen, Chao ;
Eisenhauer, Greg ;
Wolf, Matthew ;
Pande, Santosh .
HPDC '18: PROCEEDINGS OF THE 27TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2018, :156-167
[9]   Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors [J].
Cheynet, P ;
Nicolescu, B ;
Velazco, R ;
Rebaudengo, M ;
Reorda, MS ;
Violante, M .
IEEE TRANSACTIONS ON NUCLEAR SCIENCE, 2000, 47 (06) :2231-2236
[10]   Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications [J].
Di, Sheng ;
Cappello, Franck .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (10) :2809-2823