Fingerprinting the Checker Policies of Parallel File Systems

被引:6
作者
Han, Runzhou [1 ]
Zhang, Duo [1 ]
Zheng, Mai [1 ]
机构
[1] Iowa State Univ, Ames, IA 50011 USA
来源
PROCEEDINGS OF 2020 IEEE/ACM FIFTH INTERNATIONAL PARALLEL DATA SYSTEMS WORKSHOP (PDSW 2020) | 2020年
关键词
D O I
10.1109/PDSW51947.2020.00013
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel file systems (PFSes) play an essential role in high performance computing. To ensure the integrity, many PFSes are designed with a checker component, which serves as the last line of defense to bring a corrupted PFS back to a healthy state. Motivated by real-world incidents of PFS corruptions, we perform a fine-grained study on the capability of PFS checkers in this paper. We apply type-aware fault injection to specific PFS structures, and examine the detection and repair policies of PFS checkers meticulously via a well-defined taxonomy. The study results on two representative PFS checkers show that they are able to handle a wide range of corruptions on important data structures. On the other hand, neither of them is perfect: there are multiple cases where the checkers may behave sub-optimally, leading to kernel panics, wrong repairs, etc. Our work has led to a new patch on Lustre. We hope to develop our methodology into a generic framework for analyzing the checkers of diverse PFSes, and enable more elegant designs of PFS checkers for reliable high-performance computing.
引用
收藏
页码:46 / 51
页数:6
相关论文
共 37 条
  • [1] [Anonymous], 2016, HPC User Site Census
  • [2] [Anonymous], 2006, P 13 ACM C COMP COMM
  • [3] [Anonymous], 2017, LUSTRE SOFTWARE RELE
  • [4] [Anonymous], 2017, LFSCK ONLINE FILE SY
  • [5] [Anonymous], 2003, P 2003 LINUX S
  • [6] [Anonymous], 2014, PFS CORRUPTION UPGRA
  • [7] [Anonymous], 2016, HPCC POWER OUTAGE EV
  • [8] [Anonymous], 2017, Lustre Software Release 2.x: Operations Manual
  • [9] Arpaci-Dusseau A. C., 2001, Operating Systems Review, V35, P43, DOI 10.1145/502059.502040
  • [10] Cao J.., 2018, P 2018 INT C SUPERCO