Data-Aware Compression for HPC using Machine Learning

被引:0
作者
Plehn, Julius [1 ]
Fuchs, Anna [1 ]
Kuhn, Michael [2 ]
Luettgau, Jakob [3 ]
Ludwig, Thomas [4 ]
机构
[1] Univ Hamburg, Hamburg, Germany
[2] Otto von Guericke Univ, Magdeburg, Germany
[3] Univ Tennessee, Knoxville, TN USA
[4] Deutsch Klimarechenzentrum GmbH, Hamburg, Germany
关键词
compression; machine learning; file systems; HDF5;
D O I
10.1145/3503646.3524294
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
While compression can provide significant storage and cost savings, its use within HPC applications is often only of secondary concern. This is in part due to the inflexibility of existing approaches where a single compression algorithm has to be used throughout the whole application but also because insights into the behaviour of the algorithms within the context of individual applications are missing. There are several different compression algorithms available, with each one also having a unique set of options. These options have a direct influence on the achieved performance and compression results. Furthermore, the algorithms and options to use for a given dataset are highly dependent on the characteristics of said dataset. This paper explores how machine learning can help with identifying fitting compression algorithms with corresponding options based on actual data structure encountered during I/O. In order to do so, a data collection and training pipeline is introduced. Inferencing is performed during regular application runs and shows promising results. Moreover, it provides valuable insights into the benefits of using certain compression algorithms and options for specific data. Further investigations into more advanced machine learning techniques and a deeper integration into existing I/O paths will provide additional benefits.
引用
收藏
页码:62 / 69
页数:8
相关论文
共 17 条
  • [1] [Anonymous], 2020, LZ4 EXTREMELY FAST C
  • [2] Collet Y., 2016, SMALLER FASTER DATA
  • [3] Fuchs Anna, 2019, ENHANCED ADAPTIVE CO
  • [4] Gailly Jean-loup, 2022, OPENSFS SURVEY MARCH
  • [5] Gailly Jean-loup, 2022, ZLIB TECHNICAL DETAI
  • [6] ibm, IBM SPECTR SCAL FIL
  • [7] Significantly Improving Lossy Compression for HPC Datasets with Second-Order Prediction and Parameter Optimization
    Zhao, Kai
    Di, Sheng
    Liang, Xin
    Li, Sihuan
    Tao, Dingwen
    Chen, Zizhong
    Cappello, Franck
    [J]. PROCEEDINGS OF THE 29TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2020, 2020, : 89 - 100
  • [8] Decision-Making Approaches for Performance QoS in Distributed Storage Systems: A Survey
    Karniavoura, Flora
    Magoutis, Kostas
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2019, 30 (08) : 1906 - 1919
  • [9] Kryukov K, 2019, bioRxiv, DOI [10.1101/642553, 10.1101/642553, DOI 10.1101/642553]
  • [10] Kuhn Michael, 2020, ENERGY 2020 10 INT C, P17