NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing

被引:38
作者
Chen, Zhengzhang [1 ]
Son, Seung Woo [1 ]
Hendrix, William [1 ]
Agrawal, Ankit [1 ]
Liao, Wei-keng [1 ]
Choudhary, Alok [1 ]
机构
[1] Northwestern Univ, Dept Elect Engn & Comp Sci, Evanston, IL 60208 USA
来源
SC14: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS | 2014年
关键词
COMPRESSION;
D O I
10.1109/SC.2014.65
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data checkpointing is an important fault tolerance technique in High Performance Computing (HPC) systems. As the HPC systems move towards exascale, the storage space and time costs of checkpointing threaten to overwhelm not only the simulation but also the post-simulation data analysis. One common practice to address this problem is to apply compression algorithms to reduce the data size. However, traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. This paper exploits the fact that in many scientific applications, the relative changes in data values from one simulation iteration to the next are not very significantly different from each other. Thus, capturing the distribution of relative changes in data instead of storing the data itself allows us to incorporate the temporal dimension of the data and learn the evolving distribution of the changes. We show that an order of magnitude data reduction becomes achievable within guaranteed user-defined error bounds for each data point. We propose NUMARCK, Northwestern University Machine learning Algorithm for Resiliency and ChecKpointing, that makes use of the emerging distributions of data changes between consecutive simulation iterations and encodes them into an indexing space that can be concisely represented. We evaluate NUMARCK using two production scientific simulations, FLASH and CMIP5, and demonstrate a superior performance in terms of compression ratio and compression accuracy. More importantly, our algorithm allows users to specify the maximum tolerable error on a per point basis, while compressing the data by an order of magnitude.
引用
收藏
页码:733 / 744
页数:12
相关论文
共 26 条
[1]  
Agrawal A., 2013, HIGH PERFORMANCE BIG, P192
[2]   Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications [J].
Bicer, Tekin ;
Yin, Jian ;
Chiu, David ;
Agrawal, Gagan ;
Schuchardt, Karen .
IEEE 27TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2013), 2013, :1205-1216
[3]   FPC: A High-Speed Compressor for Double-Precision Floating-Point Data [J].
Burtscher, Martin ;
Ratanaworabhan, Paruj .
IEEE TRANSACTIONS ON COMPUTERS, 2009, 58 (01) :18-31
[4]  
Chen ZZ, 2011, HPDC 11: PROCEEDINGS OF THE 20TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P73
[5]   A new cluster validity measure and its application to image compression [J].
Chou, CH ;
Su, MC ;
Lai, E .
PATTERN ANALYSIS AND APPLICATIONS, 2004, 7 (02) :205-220
[6]   DATA REDUCTION USING CUBIC RATIONAL B-SPLINES [J].
CHOU, JJ ;
PIEGL, LA .
IEEE COMPUTER GRAPHICS AND APPLICATIONS, 1992, 12 (03) :60-68
[7]  
Cover T.M., 2006, ELEMENTS INFORM THEO, V2nd ed
[8]   ENERGY-EFFICIENT COMPUTING FOR EXTREME-SCALE SCIENCE [J].
Donofrio, David ;
Oliker, Leonid ;
Shalf, John ;
Wehner, Michael F. ;
Rowen, Chris ;
Krueger, Jens ;
Kamil, Shoaib ;
Mohiyuddin, Marghoob .
COMPUTER, 2009, 42 (11) :62-71
[9]  
Frazier M.W.., 1999, INTRO WAVELETS LINEA
[10]   Flash: An adaptive mesh hydrodynamics code for modeling astrophysical thermonuclear flashes [J].
Fryxell, B ;
Olson, K ;
Ricker, P ;
Timmes, FX ;
Zingale, M ;
Lamb, DQ ;
MacNeice, P ;
Rosner, R ;
Truran, JW ;
Tufo, H .
ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2000, 131 (01) :273-334