Exploiting Parallelism of Disk Failure Recovery via Partial Stripe Repair for an Erasure-Coded High-Density Storage Server

被引:1
作者
Wang, Lin [1 ]
Hu, Yuchong [1 ]
Du, Qian [1 ]
Feng, Dan [1 ]
Wu, Ray [2 ]
He, Ingo [2 ]
Zhang, Kevin [2 ]
机构
[1] Huazhong Univ Sci & Technol, Wuhan, Hubei, Peoples R China
[2] Inspur, Jinan, Peoples R China
来源
51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022 | 2022年
基金
中国国家自然科学基金;
关键词
High-density storage server; Erasure coding; Disk failure recovery; DISTRIBUTED STORAGE; SCHEME;
D O I
10.1145/3545008.3545014
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
High-density storage servers (HDSSes), which pack many disks into single servers, are currently used in data centers to save costs (power, cooling, etc). Erasure coding, which stripes data and provides high availability guarantees, is also commonly deployed in data centers at lower cost than replication. However, when applying erasure coding to a single HDSS, we find that erasure coding's state-of-the-art studies that improve repair performance in parallel mainly use multiple servers' sufficient footprint, which is yet quite limited in the single HDSS, thus leading to a memory-competition issue for disk failure recovery. In this paper, for a single HDSS, we analyze its disk failure recovery's parallelism which exists within each stripe (intra-stripe) and between stripes (inter-stripe), observe that the intra-stripe and inter-stripe parallelisms are mutually restrictive, and explore how they affect the disk failure recovery time. Based on the observations, we propose, for the HDSS, partial stripe repair (HD-PSR) schemes which exploit parallelism in both active and passive ways for single-disk recovery. We further propose a cooperative repair strategy to improve multi-disk recovery performance. We prototype HD-PSR and show via Amazon EC2 experiments that the recovery time of a single-disk failure and a multi-disk failure can be reduced by up to 71.7% and 52.5%, respectively, over existing erasure-coded repair scheme in high-density storage.
引用
收藏
页数:11
相关论文
共 36 条
[1]   Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network [J].
Bai, Yunren ;
Xu, Zihan ;
Wang, Haixia ;
Wang, Dongsheng .
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[2]   EVENODD - AN EFFICIENT SCHEME FOR TOLERATING DOUBLE-DISK FAILURES IN RAID ARCHITECTURES [J].
BLAUM, M ;
BRADY, J ;
BRUCK, J ;
MENON, J .
IEEE TRANSACTIONS ON COMPUTERS, 1995, 44 (02) :192-202
[3]   MDS array codes with independent parity symbols [J].
Blaum, M ;
Bruck, J ;
Vardy, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1996, 42 (02) :529-542
[4]  
Corbett P, 2004, USENIX ASSOCIATION PROCEEDINGS OF THE 3RD USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES, P1
[5]  
datacenterpost.com, 2014, The cost savings of high-density data center environment
[6]  
dell.com, 2022, Dell high-density storage
[7]   Network Coding for Distributed Storage Systems [J].
Dimakis, Alexandros G. ;
Godfrey, P. Brighten ;
Wu, Yunnan ;
Wainwright, Martin J. ;
Ramchandran, Kannan .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (09) :4539-4551
[8]  
Fan Bin., 2009, P 4 ANN WORKSHOP PET, P6, DOI [10.1145/1713072.1713075, DOI 10.1145/1713072.1713075]
[9]   A Stack-Based Single Disk Failure Recovery Scheme for Erasure Coded Storage Systems [J].
Fu, Yingxun ;
Shu, Jiwu ;
Luo, Xianghong .
2014 IEEE 33RD INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS), 2014, :136-145
[10]  
Ghemawat S., 2003, Operating Systems Review, V37, P29, DOI 10.1145/1165389.945450