Understanding the Performance of Erasure Codes in Hadoop Distributed File System

被引:3
作者
Darrous, Jad [1 ]
Ibrahim, Shadi [2 ]
机构
[1] IMT Atlantique, LS2N, INRIA, Nantes, France
[2] Univ Rennes, CNRS, IRISA, INRIA, Rennes, France
来源
PROCEEDINGS OF THE WORKSHOP ON CHALLENGES AND OPPORTUNITIES OF EFFICIENT AND PERFORMANT STORAGE SYSTEMS, CHEOPS 2022 | 2022年
关键词
Erasure coding; HDFS; Performance evaluation; MAPREDUCE;
D O I
10.1145/3503646.3524296
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Replication has been successfully employed and practiced to ensure high data availability in large-scale distributed storage systems. However, with the relentless growth of generated and collected data, replication has become expensive not only in terms of storage cost but also in terms of network cost and hardware cost. Traditionally, erasure coding (EC) is employed as a cost-efficient alternative to replication when high access latency to the data can be tolerated. However, with the continuous reduction in its CPU overhead, EC is performed on the critical path of data access. For instance, EC has been integrated into the last major release of Hadoop Distributed File System (HDFS) which is the primary storage backend for data analytic frameworks (e.g., Hadoop, Spark, etc.). In this work, we measure and compare the performance of data accesses in HDFS under both replication and EC. Our analysis indicates that EC is a feasible solution for data-intensive applications and it can outperform replication in many scenarios. Furthermore, we demonstrate that it is the block placement algorithm in HDFS that mostly impacts the performance under EC.
引用
收藏
页码:24 / 32
页数:9
相关论文
共 41 条
[1]   EC-Store: Bridging the Gap Between Storage and Latency in Distributed Erasure Coded Systems [J].
Abebe, Michael ;
Daudjee, Khuzaima ;
Glasbergen, Brad ;
Tian, Yuanfeng .
2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, :255-266
[2]  
Ananthanarayanan Ganesh., 2011, P 13 USENIX C HOT TO, P12
[3]  
apache.org, 2022, Apache Spark
[4]  
apache.org, 2022, Apache Hadoop
[5]  
apache.org, 2022, Apache Flink
[6]  
Balouek D, 2013, COMM COM INF SC, V367, P3
[7]  
Chen Yanpei., 2012, DESIGN INSIGHTS MAPR
[8]   Leveraging Endpoint Flexibility in Data-Intensive Clusters [J].
Chowdhury, Mosharaf ;
Kandula, Srikanth ;
Stoica, Ion .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2013, 43 (04) :231-242
[9]   Is it time to revisit Erasure Coding in Data-intensive clusters? [J].
Darrous, Jad ;
Ibrahim, Shadi ;
Perez, Christian .
2019 IEEE 27TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2019), 2019, :165-178
[10]  
Darrous Jad, 2019, ICPP 2019 48 INT C P