Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding

被引：2

作者：

Yu, Qi ^{[1
]}

Wang, Lin ^{[1
]}

Hu, Yuchong ^{[1
]}

Xu, Yumeng ^{[1
]}

Feng, Dan ^{[1
]}

Fu, Jie ^{[2
]}

Zhu, Xia ^{[2
]}

Yao, Zhen ^{[2
]}

Wei, Wenjia ^{[2
]}

机构：

[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China

[2] Huawei Technol Co Ltd, Shenzhen, Peoples R China

来源：

2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS | 2023年

基金：

中国国家自然科学基金;

关键词：

CODES;

D O I：

10.1109/IPDPS54959.2023.00036

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems. However, how to efficiently repair multiple blocks in widestripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks. In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.

引用

页码：279 / 289

页数：11

共 36 条

[1] [Anonymous], HAD 3 0 0
[2] [Anonymous], 2013, USENIX ANNL TECH C A
[3] Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network
Bai, Yunren
Xu, Zihan
Wang, Haixia
Wang, Dongsheng
[J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[4] Network Coding for Distributed Storage Systems
Dimakis, Alexandros G.
Godfrey, P. Brighten
Wu, Yunnan
Wainwright, Martin J.
Ramchandran, Kannan
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (09) : 4539 - 4551
[5] Ghemawat Sanjay., 2003, SOSP'03
[6] Github, INT STOR ACC LIB
[7] Optimizing the Parity-Check Matrix for Efficient Decoding of RS-based Cloud Storage Systems
Gu, Junqing
Wu, Chentao
Xie, Xin
Qiu, Han
Li, Jie
Guo, Minyi
He, Xubin
Dong, Yuanyuan
Zhao, Yafei
[J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 533 - 544
[8] Rack-Aware Regenerating Codes for Data Centers
Hou, Hanxu
Lee, Patrick P. C.
Shum, Kenneth W.
Hu, Yuchong
[J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2019, 65 (08) : 4730 - 4745
[9] Hu YC, 2021, PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), P233
[10] Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice
Hu, Yuchong
Li, Xiaolu
Zhang, Mi
Lee, Patrick P. C.
Zhang, Xiaoyang
Zhou, Pan
Feng, Dan
[J]. ACM TRANSACTIONS ON STORAGE, 2017, 13 (04)

← 1 2 3 4 →