Understanding the Performance of Erasure Codes in Hadoop Distributed File System

被引:3
作者
Darrous, Jad [1 ]
Ibrahim, Shadi [2 ]
机构
[1] IMT Atlantique, LS2N, INRIA, Nantes, France
[2] Univ Rennes, CNRS, IRISA, INRIA, Rennes, France
来源
PROCEEDINGS OF THE WORKSHOP ON CHALLENGES AND OPPORTUNITIES OF EFFICIENT AND PERFORMANT STORAGE SYSTEMS, CHEOPS 2022 | 2022年
关键词
Erasure coding; HDFS; Performance evaluation; MAPREDUCE;
D O I
10.1145/3503646.3524296
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Replication has been successfully employed and practiced to ensure high data availability in large-scale distributed storage systems. However, with the relentless growth of generated and collected data, replication has become expensive not only in terms of storage cost but also in terms of network cost and hardware cost. Traditionally, erasure coding (EC) is employed as a cost-efficient alternative to replication when high access latency to the data can be tolerated. However, with the continuous reduction in its CPU overhead, EC is performed on the critical path of data access. For instance, EC has been integrated into the last major release of Hadoop Distributed File System (HDFS) which is the primary storage backend for data analytic frameworks (e.g., Hadoop, Spark, etc.). In this work, we measure and compare the performance of data accesses in HDFS under both replication and EC. Our analysis indicates that EC is a feasible solution for data-intensive applications and it can outperform replication in many scenarios. Furthermore, we demonstrate that it is the block placement algorithm in HDFS that mostly impacts the performance under EC.
引用
收藏
页码:24 / 32
页数:9
相关论文
共 41 条
[11]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[12]   RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics [J].
Dinu, Florin ;
Ng, T. S. Eugene .
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[13]  
Fan Bin., 2009, P 4 ANN WORKSHOP PET, P6, DOI [10.1145/1713072.1713075, DOI 10.1145/1713072.1713075]
[14]  
Ghemawat S., 2003, Operating Systems Review, V37, P29, DOI 10.1145/1165389.945450
[15]  
github.com, 2022, psutil: Cross-platform lib for process and system monitoring in Python
[16]  
Hadoop, 2017, HDFS Erasure Coding
[17]  
Haeberlen A, 2005, USENIX ASSOCIATION PROCEEDINGS OF THE 2ND SYMPOSIUM ON NETWORKED SYSTEMS DESIGN & IMPLEMENTATION (NSDI '05), P143
[18]  
Ibrahim S., 2012, Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2012), P435, DOI 10.1109/CCGrid.2012.122
[19]  
Intel, 2017, ISA-L Performance report
[20]  
Kim Jaeho, 2015, Proceedings of the 13th USENIX Conference on File and Storage Technologies. FAST '15, P183