Fault Tolerance Performance Evaluation of Large-Scale Distributed Storage Systems HDFS and Ceph Case Study

被引:0
作者
Arafa, Yehia [1 ]
Barai, Atanu [1 ]
Zheng, Mai [2 ]
Badawy, Abdel-Hameed A. [1 ,3 ]
机构
[1] New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA
[2] Iowa State Univ, Dept Elect & Comp Engn, Ames, IA USA
[3] Los Alamos Natl Lab, Los Alamos, NM USA
来源
2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2018年
关键词
Fault Tolerance; Performance Evaluation; HDFS; Ceph; Distributed Storage Systems;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale distributed systems are a collection of loosely coupled computers interconnected by a communication network. They are now an integral part of everyday life with the development of large web applications, social networks, peer-to-peer systems, wireless sensor networks and many more. At such a scale, hardware components by themselves are prone to failure. Therefore, one key challenge in designing distributed storage systems is how to tolerate faults. To this end, fault tolerance mechanisms such as replication have been widely used to provide high availability for decades. More recently, many systems start supporting erasure coding for fault tolerance, which is expected to achieve high reliability at a lower storage cost compared to replication. However, the reduced storage overhead comes at the cost of more complicated recovery which hurts performance. In this paper, we study the fault tolerance mechanisms of two representative distributed file systems: HDFS and Ceph. In addition to the traditional replication, both HDFS and Ceph support erasure coding in their latest version. We evaluate the replication and erasure coding implementations in both systems using standard benchmarks and fault injection, and quantitatively measure the performance and storage overhead. Our results demonstrate the trade-offs between replication and erasure coding techniques, and serve as a foundation for building optimal storage systems with high availability as well as high performance.
引用
收藏
页数:7
相关论文
共 23 条
[1]  
[Anonymous], 2006, P 7 C OP SYST DES IM
[2]  
Cao J., 2018, P 32 ACM INT C SUP I
[3]  
Cao JR, 2016, PROCEEDINGS OF PDSW-DISCS 2016 - 1ST JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE AND DATA INTENSIVE SCALABLE COMPUTING SYSTEMS, P49, DOI [10.1109/PDSW-DISCS.2016.12, 10.1109/PDSW-DISCS.2016.013]
[4]  
Gatla O. R, 2017, 9 USENIX WORKSH HOT
[5]  
Gatla OR, 2018, PROCEEDINGS OF THE 16TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P105
[6]  
Ghemawat S., 2003, P 19 ACM S OPERATING, P20
[7]  
Leesatapornwongsa Tanakorn, 2014, P 11 USENIX C OP SYS, P399
[8]  
Ma Ao, 2015, 13 USENIX C FIL STOR, P241
[9]   Cross-checking Semantic Correctness: The Case of Finding File System Bugs [J].
Min, Changwoo ;
Kashyap, Sanidhya ;
Lee, Byoungyoung ;
Song, Chengyu ;
Kim, Taesoo .
SOSP'15: PROCEEDINGS OF THE TWENTY-FIFTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, 2015, :361-377
[10]  
Noll M. G., 2011, TESTDFSIO BENCHMARK