Reliability in grid computing systems

被引:39
作者
Dabrowski, Christopher [1 ]
机构
[1] Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA
关键词
grid computing; grid computing system; reliability; fault tolerances; dependability; RESOURCE-ALLOCATION; RECOVERY PROTOCOLS; ROLLBACK-RECOVERY; FAULT-TOLERANCE; SERVICE; PERFORMANCE; ARCHITECTURE; FRAMEWORK; MANAGEMENT; CONSENSUS;
D O I
10.1002/cpe.1410
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In recent years, grid technology has emerged as an important tool for solving compute-intensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing large-scale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development organizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments. The survey also illuminates reliability issues relating to standard specifications used in grid systems, identifying existing specifications that may need to be evolved and areas where new specifications are needed to better support the reliability. Published in 2009 by John Wiley & Sons, Ltd.
引用
收藏
页码:927 / 959
页数:33
相关论文
共 209 条
  • [11] [Anonymous], P 2007 S COMP FRAM T
  • [12] [Anonymous], 3031 RFC INT ENG TAS
  • [13] [Anonymous], 2004, COMPUTING SYSTEMS RE
  • [14] ARNOLD K, 1999, JINI SPECIFICATION V
  • [15] Basic concepts and taxonomy of dependable and secure computing
    Avizienis, A
    Laprie, JC
    Randell, B
    Landwehr, C
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (01) : 11 - 33
  • [16] BAKER M, 2000, CLUSTER COMPUTER WHI
  • [17] THE CONSENSUS PROBLEM IN FAULT-TOLERANT COMPUTING
    BARBORAK, M
    MALEK, M
    DAHBURA, A
    [J]. COMPUTING SURVEYS, 1993, 25 (02) : 171 - 220
  • [18] FAULT INJECTION EXPERIMENTS USING FIAT
    BARTON, JH
    CZECK, EW
    SEGALL, ZZ
    SIEWIOREK, DP
    [J]. IEEE TRANSACTIONS ON COMPUTERS, 1990, 39 (04) : 575 - 582
  • [19] Evaluation of an economy-based file replication strategy for a data grid
    Bell, WH
    Cameron, DG
    Carvajal-Schiaffino, R
    Millar, AP
    Stockinger, K
    Zini, F
    [J]. CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 661 - 668
  • [20] BEZZINE S, 2006, P 2 IEEE INT C E SCI, P49