Tools for Predicting the Reliability of Large-Scale Storage Systems

被引:7
|
作者
Hall, Robert J. [1 ]
机构
[1] AT&T Labs Res, 1 AT&T Way, Bedminster, NJ 07921 USA
关键词
Tools; large scale; storage systems;
D O I
10.1145/2911987
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data-intensive applications require extreme scaling of their underlying storage systems. Such scaling, together with the fact that storage systems must be implemented in actual data centers, increases the risk of data loss from failures of underlying components. Accurate engineering requires quantitatively predicting reliability, but this remains challenging due to the need to account for extreme scale, redundancy scheme type and strength, distribution architecture, and component dependencies. This article introduces CQSIM-R, a tool suite for predicting the reliability of large-scale storage system designs and deployments. CQSIM-R includes (a) direct calculations based on an only-drives-fail failure model and (b) an event-based simulator for detailed prediction that handles failures of and failure dependencies among arbitrary (drive or nondrive) components. These are based on a common combinatorial framework for modeling placement strategies. The article demonstrates CQSIM-R using models of common storage systems, including replicated and erasure coded designs. New results, such as the poor reliability scaling of spread-placed systems and a quantification of the impact of data center distribution and rack-awareness on reliability, demonstrate the usefulness and generality of the tools. Analysis and empirical studies show the tools' soundness, performance, and scalability.
引用
收藏
页数:30
相关论文
共 50 条
  • [1] Reliability of SSDs in Enterprise Storage Systems: A Large-Scale Field Study
    Maneas, Stathis
    Mahdaviani, Kaveh
    Emami, Tim
    Schroeder, Bianca
    ACM TRANSACTIONS ON STORAGE, 2021, 17 (01)
  • [2] Reliability Design for Large Scale Storage Systems
    Du, Kai
    Wang, Huaimin
    Yang, Shuqiang
    Chen, Yingwen
    Wen, Yan
    11TH IEEE HIGH ASSURANCE SYSTEMS ENGINEERING SYMPOSIUM, PROCEEDINGS, 2008, : 463 - 466
  • [3] LARGE-SCALE SYSTEMS - STABILITY, COMPLEXITY, RELIABILITY
    SILJAK, DD
    VUKCEVIC, MB
    JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 1976, 301 (1-2): : 49 - 69
  • [4] ACHIEVING RELIABILITY IN LARGE-SCALE SOFTWARE SYSTEMS
    SCHICK, GJ
    WOLVERTON, RW
    PROCEEDINGS ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, 1974, 7 (02): : 302 - 319
  • [5] Legal reliability in large-scale distributed systems
    Sommer, P
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 416 - 421
  • [6] Predicting the Large-Scale Evolution of Tag Systems
    Martin, Carlos
    COMPLEX SYSTEMS, 2016, 25 (02): : 79 - 107
  • [7] Performance virtualization for large-scale storage systems
    Chambliss, DD
    Alvarez, GA
    Pandey, P
    Jadav, D
    Xu, J
    Menon, R
    Lee, TP
    22ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2003, : 109 - 118
  • [8] Identification and Authentication in Large-scale Storage Systems
    Niu, Zhongying
    Zhou, Ke
    Jiang, Hong
    Yang, Tianming
    Yan, Wei
    NAS: 2009 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE, 2009, : 421 - +
  • [9] Reliability Evaluation of Large-Scale Systems With Identical Units
    Ko, Young Myoung
    Byon, Eunshin
    IEEE TRANSACTIONS ON RELIABILITY, 2015, 64 (01) : 420 - 434
  • [10] Reliability Guided Resource Allocation for Large-scale Systems
    Umamaheshwaran, Shruti
    Hacker, Thomas J.
    2014 IEEE 6TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM), 2014, : 334 - 341