Dynamic multiple node failure recovery in distributed storage systems

被引:8
作者
Itani, May [1 ]
Sharafeddine, Sanaa [2 ]
ElKabani, Islam [1 ,3 ]
机构
[1] Beirut Arab Univ, Fac Sci, Math & Comp Sci Dept, Beirut, Lebanon
[2] Lebanese Amer Univ, Sch Arts & Sci, Dept Comp Sci & Math, Beirut, Lebanon
[3] Alexandria Univ, Fac Sci, Math & Comp Sci Dept, Alexandria, Egypt
关键词
Distributed storage systems; Fractional repetition codes; Failure recovery; Genetic algorithms; Multiple failures; Heuristic optimization; REGENERATING CODES;
D O I
10.1016/j.adhoc.2017.12.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our daily lives are getting more and more dependent on data centers and distributed storage systems in general, whether at the business or at the personal level. With the advent of fog computing, personal mobile devices in a given geographical area may also comprise a very dynamic distributed storage system. These paradigm changes call for the urgent need of devising efficient and reliable failure recovery mechanisms in dynamic scenarios where failures become more likely and nodes join and leave the network more frequently. Redundancy schemes in distributed storage systems have become essential for providing reliability given the fact of frequent node failures. In this work, we address the problem of multiple failure recovery with dynamic scenarios using the fractional repetition code as a redundancy scheme. The fractional repetition (FR) code is a class of regenerating codes that concatenates a maximum distance separable code (MDS) with an inner fractional repetition code where data is split into several blocks then replicated and multiple replicas of each block are stored on various system nodes. We formulate the problem as an integer linear programming problem and extend it to account for three dynamic scenarios of newly arriving blocks, nodes, and variable priority blocks allocation. The contribution of this paper is four-fold: i. we generate an optimized block distribution scheme that minimizes the total system repair cost of all dependent and independent multiple node failure scenarios; ii. we address the practical scenario of having newly arriving blocks and allocate those blocks to existing nodes without any modification to the original on-node block distribution; iii. we consider new-comer nodes and generate an updated optimized block distribution; iv. we consider optimized storage and recovery of blocks with varying priority using variable fractional repetition codes. The four problems are modeled using incidence matrices and solved heuristically. We present a range of results for our proposed algorithms in several scenarios to assess the effectiveness of the solution approaches that are shown to generate results close to optimal. (C) 2017 Published by Elsevier B.V.
引用
收藏
页码:1 / 13
页数:13
相关论文
共 29 条
[1]  
Carroll A., 2013, WHY DATA CTR ARE NEC
[2]   Network Coding for Distributed Storage Systems [J].
Dimakis, Alexandros G. ;
Godfrey, P. Brighten ;
Wu, Yunnan ;
Wainwright, Martin J. ;
Ramchandran, Kannan .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2010, 56 (09) :4539-4551
[3]  
ElRouayheb S., 2010, 48 IEEE ANN ALL C CO
[4]  
Ford D., 2010, P 9 USENIX C OP SYST, P61
[5]  
Hou SW, 2011, IN C IND ENG ENG MAN, P549, DOI 10.1109/IEEM.2011.6117977
[6]   Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding [J].
Hu, Yuchong ;
Xu, Yinlong ;
Wang, Xiaozhao ;
Zhan, Cheng ;
Li, Pei .
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2010, 28 (02) :268-276
[7]  
Itani M, 2016, 2016 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATION (ISCC), P901, DOI 10.1109/ISCC.2016.7543851
[8]  
Kermarrec Anne-Marie., 2011, PROC INT S NETWORK C, P1
[9]  
Khan O., 2012, FAST
[10]  
Lavalle B., 2015, PROLIFERATION REMOTE