Replica-aware data recovery performance improvement for Hadoop system with NVM

被引：0

作者：

Li, Xin ^{[1
]}

Li, Huijie ^{[1
]}

Lu, Youyou ^{[2
]}

Zhao, Yanchao ^{[1
]}

Qin, Xiaolin ^{[1
]}

机构：

[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing, Peoples R China

[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

来源：

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING | 2021年 / 3卷 / 02期

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

Data recovery; HDFS; MapReduce; Non-volatile memory; Performance tuning; CLUSTER; MEMORY;

D O I：

10.1007/s42514-021-00066-9

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The non-volatile memory (NVM) is the promising device to store data and accelerate big data analysis due to its excellent I/O performance. However, we find that simply replacing hard disk drive (HDD) with NVM cannot bring the expected performance improvement. In this paper, we take the data recovery issue in Hadoop file system (HDFS) as a case study to investigate how to take advantage of the performance of NVM. We analyze the data recovery mechanism in HDFS and find that the configuration of replication tasks in the DataNode can affect the data recovery significantly. We conduct extensive analysis and experiments tuning the configuration and also get some interesting findings. With the new configuration, we increase the data recovery performance from 17 to 71%. We can also improve the execution performance of MapReduce jobs from 28 to 59% through optimized configuration. We also find that the sudden data recovery brings disordered network resource competition, which reduces the performance of MapReduce jobs. Hence, We present a priority-aware multi-stage data recovery method. This improves the performance by 32.5% in addition for the MapReduce jobs.

引用

页码：144 / 156

页数：13