Boosting Erasure-Coded Multi-Stripe Repair in Rack Architecture and Heterogeneous Clusters: Design and Analysis

被引:4
作者
Zhou, Hai [1 ]
Feng, Dan [1 ]
机构
[1] Huazhong Univ Sci & Technol, Engn Res Ctr Data Storage Syst & Technol, Sch Comp Sci & Technol, Wuhan Natl Lab Optoelect,Minist Educ China,Key Lab, Wuhan 430074, Hubei, Peoples R China
关键词
Maintenance engineering; Codes; Bandwidth; Computer architecture; Clustering algorithms; Costs; Heterogeneous networks; Erasure code; rack architecture; multiple stripes; heterogeneous network; repair time; EFFICIENT; NETWORK;
D O I
10.1109/TPDS.2023.3282180
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Large-scale storage systems have introduced erasure codes to guarantee high data reliability, yet inevitably at the expense of high repair costs. In practice, storage nodes are usually divided into different racks, and data blocks in nodes are organized into multiple stripes independently manipulated by erasure code. Due to the scarcity and heterogeneity of the cross-rack bandwidth, the cross-rack transmission dominates the entire repair costs. When erasure code is deployed in rack architectures, existing repair techniques are limited in different aspects: neglecting the heterogeneous cross-rack bandwidth, less consideration for multi-stripe failure, and no special treatment on repair-link scheduling. In this paper, we present CMRepair, a Cross-rack Multi-stripe Repair technique that aims to reduce the repair time for multi-stripes failure repair in heterogeneous erasure-coded clusters. CMRepair first carefully chooses the nodes for reading/repairing blocks and searches for the multi-stripe repair solution. It adopts different algorithms to adjust the solution, including the Computation Time Priority (CTP) algorithm based on the greedy idea and the Repair Time Priority (RTP) algorithm based on the meta-heuristics idea. Furthermore, CMRepair selectively schedules the execution orders of cross-rack links, with the primary objective of saturating the unused upload/download bandwidth resources and avoiding network congestion. The experiments show that CMRepair with the CTP algorithm can reduce 27.59%-58.12% of the repair time while only introducing negligible computation overhead, and CMRepair with the RTP algorithm can reduce 33.52%-97.75% of the repair time in an acceptable computation time, over existing repair techniques.
引用
收藏
页码:2251 / 2264
页数:14
相关论文
共 42 条
[1]  
Amazon, 2022, AM EC2
[2]  
[Anonymous], 2014, P USENIX ANN TECH C
[3]  
Apache, 2020, Apache Hadoop 3.1.4
[4]  
Benson T., 2010, P 10 ACM SIGCOMM C I, P267, DOI DOI 10.1145/1879141.1879175
[5]  
ceph, 2014, Erasure coding in ceph
[6]   Leveraging Endpoint Flexibility in Data-Intensive Clusters [J].
Chowdhury, Mosharaf ;
Kandula, Srikanth ;
Stoica, Ion .
ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2013, 43 (04) :231-242
[7]  
Colossus, 2021, SUCC GOOGL FIL SYST
[8]   Network Coding for Distributed Storage Systems [J].
Dimakis, Alexandros G. ;
Godfrey, P. Brighten ;
Wu, Yunnan ;
Wainwright, Martin J. ;
Ramchandran, Kannan .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (09) :4539-4551
[9]  
eecs, 2014, Jerasure
[10]  
Ford D., 2010, PROC 9 USENIX S OPER, P61