Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression

被引:10
作者
Li, Sihuan [1 ]
Di, Sheng [2 ]
Zhao, Kai [1 ]
Liang, Xin [3 ]
Chen, Zizhong [1 ]
Cappello, Franck [2 ,4 ]
机构
[1] Univ Calif Riverside, Riverside, CA 92521 USA
[2] Argonne Natl Lab, Lemont, IL USA
[3] Oak Ridge Natl Lab, Oak Ridge, TN USA
[4] Univ Illinois, Urbana, IL USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2020) | 2020年
基金
美国国家科学基金会;
关键词
D O I
10.1109/CLUSTER49012.2020.00043
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data reduction techniques have been widely demanded and used by large-scale high performance computing (HPC) applications because of vast volumes of data to be produced and stored for post-analysis. Due to very limited compression ratios of lossless compressors, error-bounded lossy compression has become an indispensable part in many HPC applications nowadays, because it can significantly reduce science data volume with user-acceptable data distortion. Since the large-scale HPC applications equipped with lossy compression techniques always need to deal with vast volume of data, soft errors or silent data corruptions (SDC) are non-negligible. Although SDC detection techniques have been studied for years, no studies were performed toward the HPC applications with lossy compression, leaving a significant gap between these applications and confidence of execution results. To fill this gap, this paper proposes a couple of SDC detection strategies for scientific simulations with lossy compression. Experimental results on 4 widely used scientific simulation datasets show promising detection ability could be still obtained with two popular lossy compressors. Our parallel experiments with up to 1,024 cores confirm that the time overheads could be limited within 7.9%.
引用
收藏
页码:326 / 336
页数:11
相关论文
共 35 条
  • [1] A. Center, FLASH US GUID VERS 4
  • [2] Baker A. H., 2014, P 23 INT S HIGH PERF, P203, DOI DOI 10.1145/2600212.2600217
  • [3] Baker A. H., 2019, EVALUATING IMAGE QUA
  • [4] Exploiting Spatial Smoothness in HPC Applications to Detect Silent Data Corruption
    Bautista-Gomez, Leonardo
    Cappello, Franck
    [J]. 2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 128 - 133
  • [5] LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions
    Chen, Chao
    Eisenhauer, Greg
    Wolf, Matthew
    Pande, Santosh
    [J]. HPDC '18: PROCEEDINGS OF THE 27TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2018, : 156 - 167
  • [6] Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
    Di, Sheng
    Cappello, Franck
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (10) : 2809 - 2823
  • [7] An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications
    Di, Sheng
    Berrocal, Eduardo
    Cappello, Franck
    [J]. 2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 271 - 280
  • [8] Fast Error-bounded Lossy HPC Data Compression with SZ
    Di, Sheng
    Cappello, Franck
    [J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 730 - 739
  • [9] Fiala DavidJerome., 2015, Transparent Resilience Across the Entire Software Stack for High-Performance Computing Applications
  • [10] HACC: Extreme Scaling and Performance Across Diverse Architectures
    Habib, Salman
    Morozov, Vitali
    Frontiere, Nicholas
    Finkel, Hal
    Pope, Adrian
    Heitmann, Katrin
    Kumaran, Kalyan
    Vishwanath, Venkatram
    Peterka, Tom
    Insley, Joe
    Daniel, David
    Fasel, Patricia
    Lukic, Zarija
    [J]. COMMUNICATIONS OF THE ACM, 2017, 60 (01) : 97 - 104