Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding

被引:2
作者
Yu, Qi [1 ]
Wang, Lin [1 ]
Hu, Yuchong [1 ]
Xu, Yumeng [1 ]
Feng, Dan [1 ]
Fu, Jie [2 ]
Zhu, Xia [2 ]
Yao, Zhen [2 ]
Wei, Wenjia [2 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Peoples R China
[2] Huawei Technol Co Ltd, Shenzhen, Peoples R China
来源
2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, IPDPS | 2023年
基金
中国国家自然科学基金;
关键词
CODES;
D O I
10.1109/IPDPS54959.2023.00036
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems. However, how to efficiently repair multiple blocks in widestripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks. In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.
引用
收藏
页码:279 / 289
页数:11
相关论文
共 36 条
  • [1] [Anonymous], HAD 3 0 0
  • [2] [Anonymous], 2013, USENIX ANNL TECH C A
  • [3] Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network
    Bai, Yunren
    Xu, Zihan
    Wang, Haixia
    Wang, Dongsheng
    [J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [4] Network Coding for Distributed Storage Systems
    Dimakis, Alexandros G.
    Godfrey, P. Brighten
    Wu, Yunnan
    Wainwright, Martin J.
    Ramchandran, Kannan
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (09) : 4539 - 4551
  • [5] Ghemawat Sanjay., 2003, SOSP'03
  • [6] Github, INT STOR ACC LIB
  • [7] Optimizing the Parity-Check Matrix for Efficient Decoding of RS-based Cloud Storage Systems
    Gu, Junqing
    Wu, Chentao
    Xie, Xin
    Qiu, Han
    Li, Jie
    Guo, Minyi
    He, Xubin
    Dong, Yuanyuan
    Zhao, Yafei
    [J]. 2019 IEEE 33RD INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2019), 2019, : 533 - 544
  • [8] Rack-Aware Regenerating Codes for Data Centers
    Hou, Hanxu
    Lee, Patrick P. C.
    Shum, Kenneth W.
    Hu, Yuchong
    [J]. IEEE TRANSACTIONS ON INFORMATION THEORY, 2019, 65 (08) : 4730 - 4745
  • [9] Hu YC, 2021, PROCEEDINGS OF THE 19TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES (FAST '21), P233
  • [10] Optimal Repair Layering for Erasure-Coded Data Centers: From Theory to Practice
    Hu, Yuchong
    Li, Xiaolu
    Zhang, Mi
    Lee, Patrick P. C.
    Zhang, Xiaoyang
    Zhou, Pan
    Feng, Dan
    [J]. ACM TRANSACTIONS ON STORAGE, 2017, 13 (04)