An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems

被引:10
作者
Yang, Bin [1 ,4 ]
Zou, Yanliang [2 ,4 ]
Liu, Weiguo [1 ,4 ]
Xue, Wei [3 ,4 ]
机构
[1] Shandong Univ, Sch Software, Jinan, Peoples R China
[2] ShanghaiTech Univ, Sch Informat Sci & Technol, Shanghai, Peoples R China
[3] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[4] Natl Supercomp Ctr Wuxi, Wuxi, Jiangsu, Peoples R China
来源
2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022) | 2022年
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
I/O modeling; auto-tuning; load imbalance; resource allocation; Sunway TaihuLight;
D O I
10.1109/IPDPS53621.2022.00128
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Real-world large-scale applications expose more and more pressures to storage services of modern supercomputers. Supercomputers have been introducing new storage devices and technologies to meet the performance requirements of various applications, leading to more complicated architectures. High I/O demand of applications and the complicated and shared storage architectures make the issues, such as unbalanced load, I/O interference, system parameter configuration error, and node performance degradation, more frequently observed. And it is challenging to both achieve high I/O performance on application level and efficiently utilize scarce storage resources. We propose AIOT, an end-to-end and adaptive I/O optimization tool for HPC storage systems, which introduces effective I/O performance modeling and several active tuning strategies to improve both the I/O performance of applications and the utilization of storage resources. AIOT provides a global view of the whole storage system and searches for the optimal end-to-end I/O path through flow network modeling. Moreover, AIOT tunes system parameters across multiple layers of the storage system by using the automated identified application I/O behaviors and the instant status of the workload of storage system. We verified the effectiveness of AIOT for balancing I/O load, resolving I/O interference, improving I/O performance by configuring appropriate system parameters, and avoiding I/O performance degradation caused by abnormal nodes through quite a few realworld cases. AIOT has helped to save over ten millions of corehours during the deployment on Sunway TaihuLight since July 2021. It's worth mentioning that our proposed AIOT is capable of managing other I/O optimization methods across various storage platforms.
引用
收藏
页码:1294 / 1304
页数:11
相关论文
共 46 条
  • [1] Ahn D. H., 2014, 2014 INT C PARALLEL
  • [2] Anderson J.D., 1995, COMPUTATIONAL FLUID
  • [3] [Anonymous], FLAMES
  • [4] [Anonymous], Titan Supercomputer
  • [5] [Anonymous], DESCRIPTION ADV RES
  • [6] [Anonymous], 2015, P 24 INT S HIGH PERF
  • [7] Betke E., 2021, J HIGH PERFORMANCE C, DOI 10.5281/zenodo.4478960
  • [8] Braam P, 2019, Arxiv, DOI arXiv:1903.01955
  • [9] Quantum Simulators
    Buluta, Iulia
    Nori, Franco
    [J]. SCIENCE, 2009, 326 (5949) : 108 - 111
  • [10] Carns P., 2009, INT C CLUSTER COMPUT