Hadoop Based Scalable Cluster Deduplication for Big Data

被引:4
作者
Liu, Qing [1 ]
Fu, Yinjin [1 ]
Ni, Guiqiang [1 ]
Hou, Rui [2 ]
机构
[1] PLA Univ Sci & Technol, Coll Command Informat Syst, Nanjing, Jiangsu, Peoples R China
[2] Inst Elect Syst Engn, Beijing, Peoples R China
来源
2016 IEEE 36TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS (ICDCSW 2016) | 2016年
关键词
data deduplication; big data; Hadoop; HBase; index management;
D O I
10.1109/ICDCSW.2016.17
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The exponential growth of data has brought a tremendous challenge on the storage system in data center. Data deduplication technology which detects and eliminates redundant data in the dataset can greatly reduce the quantity of data and optimize the utilization of storage space. This paper presented a scalable and reliable cluster deduplication system Halodedu over the Hadoop-based cloud computing platform. Halodedu used MapReduce and HDFS to realize parallel deduplication processing and manage data storage, respectively. Intra-node local database was used to build up a fast and distributed chunk fingerprint index management. In order to maintain the availability and reliability of metadata, HBase was utilized to store the metadata of backup files. We further used virtual machine images as input dataset to evaluate Halodedu. The comparative experiments demonstrated that Halodedu has improvements on deduplication speed and system scalability.
引用
收藏
页码:98 / 105
页数:8
相关论文
共 17 条
  • [1] [Anonymous], 2009, FAST
  • [2] Bhagwat D, 2009, 2009 IEEE INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS & SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS), P237
  • [3] Chang R., 2014, INT J DISTRIB SENS N, V2014, P774
  • [4] Cheng X., 2014, J SOFTWARE, V2014, P774
  • [5] Clements A. T., 2009, USENIX ANN TECHN C USENIX ANN TECHN C, P101
  • [6] Dirk M., 2013, THESIS
  • [7] Fanglu G., 2011, USENIX ANN TECHN C M
  • [8] [付印金 Fu Yinjin], 2012, [计算机研究与发展, Journal of Computer Research and Development], V49, P12
  • [9] HadoopSphere, DAT DED TACT HDFC MA
  • [10] Kaiser Jurgen., 2012, MASS STORAGE SYSTEMS, P1