PFAULT: A General Framework for Analyzing the Reliability of High-Performance Parallel File Systems

被引:38
作者
Cao, Jinrui [1 ]
Gatla, Om Rameshwar [1 ]
Zheng, Mai [1 ]
Dai, Dong [2 ]
Eswarappa, Vidya [2 ]
Mu, Yan [2 ]
Chen, Yong [2 ]
机构
[1] New Mexico State Univ, Las Cruces, NM 88003 USA
[2] Texas Tech Univ, Lubbock, TX 79409 USA
来源
INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS 2018) | 2018年
基金
美国国家科学基金会;
关键词
Parallel file systems; reliability; high performance computing;
D O I
10.1145/3205289.3205302
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-performance parallel file systems (PFSes) are of prime importance today. However, despite the importance, their reliability is much less studied compared with that of local storage systems, largely due to the lack of an effective analysis methodology. In this paper, we introduce PFAULT, a general framework for analyzing the failure handling of PFSes. PFAULT automatically emulates the failure state of each storage device in the target PFS based on a set of well-defined fault models, and enables analyzing the recoverability of the PFS under faults systematically. To demonstrate the practicality, we apply PFAULT to study Lustre, one of the most widely used PFSes. Our analysis reveals a number of cases where Lustre's checking and repairing utility LFSCK fails with unexpected symptoms (e.g., I/O error, hang, reboot). Moreover, with the help of PFAULT, we are able to identify a resource leak problem where a portion of Lustre's internal namespace and storage space become unusable even after running LFSCK. On the other hand, we also verify that the latest Lustre has made noticeable improvement in terms of failure handling comparing to a previous version. We hope our study and framework can help improve PFSes for reliable high-performance computing.
引用
收藏
页码:1 / 11
页数:11
相关论文
共 33 条
[1]  
Ali N, 2009, 2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, P86
[2]  
[Anonymous], 2016, HPCC Power Outage Event Announced at 8:50:17 AM Central Standard Time on Monday January 11 2016
[3]  
Bairavasundaram Lakshmi N., 2008, ACM Transaction on Storage, V4, DOI 10.1145/1416944.1416947
[4]  
Bairavasundaram LN, 2007, PERF E R SI, V35, P289
[5]  
Cao JR, 2016, PROCEEDINGS OF PDSW-DISCS 2016 - 1ST JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE AND DATA INTENSIVE SCALABLE COMPUTING SYSTEMS, P49, DOI [10.1109/PDSW-DISCS.2016.12, 10.1109/PDSW-DISCS.2016.013]
[6]   Using Crash Hoare Logic for Certifying the FSCQ File System [J].
Chen, Haogang ;
Ziegler, Daniel ;
Chajed, Tej ;
Chlipala, Adam ;
Kaashoek, M. Frans ;
Zeldovich, Nickolai .
SOSP'15: PROCEEDINGS OF THE TWENTY-FIFTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2015, :18-37
[7]  
Conway A, 2017, PROCEEDINGS OF FAST '17: 15TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P45
[8]  
Ganesan A, 2017, PROCEEDINGS OF FAST '17: 15TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P149
[9]  
Gatla O. R, 2017, 9 USENIX WORKSH HOT
[10]  
Gatla OR, 2018, PROCEEDINGS OF THE 16TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P105