Reliability in grid computing systems

被引:39
作者
Dabrowski, Christopher [1 ]
机构
[1] Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA
关键词
grid computing; grid computing system; reliability; fault tolerances; dependability; RESOURCE-ALLOCATION; RECOVERY PROTOCOLS; ROLLBACK-RECOVERY; FAULT-TOLERANCE; SERVICE; PERFORMANCE; ARCHITECTURE; FRAMEWORK; MANAGEMENT; CONSENSUS;
D O I
10.1002/cpe.1410
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In recent years, grid technology has emerged as an important tool for solving compute-intensive problems within the scientific community and in industry. To further the development and adoption of this technology, researchers and practitioners from different disciplines have collaborated to produce standard specifications for implementing large-scale, interoperable grid systems. The focus of this activity has been the Open Grid Forum, but other standards development organizations have also produced specifications that are used in grid systems. To date, these specifications have provided the basis for a growing number of operational grid systems used in scientific and industrial applications. However, if the growth of grid technology is to continue, it will be important that grid systems also provide high reliability. In particular, it will be critical to ensure that grid systems are reliable as they continue to grow in scale, exhibit greater dynamism, and become more heterogeneous in composition. Ensuring grid system reliability in turn requires that the specifications used to build these systems fully support reliable grid services. This study surveys work on grid reliability that has been done in recent years and reviews progress made toward achieving these goals. The survey identifies important issues and problems that researchers are working to overcome in order to develop reliability methods for large-scale, heterogeneous, dynamic environments. The survey also illuminates reliability issues relating to standard specifications used in grid systems, identifying existing specifications that may need to be evolved and areas where new specifications are needed to better support the reliability. Published in 2009 by John Wiley & Sons, Ltd.
引用
收藏
页码:927 / 959
页数:33
相关论文
共 209 条
  • [1] Abawajy J. H., 2004, Proceedings. 18th International Parallel and Distributed Processing Symposium
  • [2] Abawajy JH, 2004, LECT NOTES COMPUT SC, V3044, P107
  • [3] ADAMSON B, 2007, NACK ORIENTED RELIAB
  • [4] Enhancing the fault tolerance of workflow management systems
    Alonso, G
    Hagen, C
    Agrawal, D
    El Abbadi, A
    Mohan, C
    [J]. IEEE CONCURRENCY, 2000, 8 (03): : 74 - 81
  • [5] Andrieux A., 2007, GFD107 OP GRID FOR
  • [6] ANDRZEJAK A, 2002, HPL2002259
  • [7] [Anonymous], 2002, 3272 RFC INT ENG TAS
  • [8] [Anonymous], 2008, S3 SIMPL STOR SERV
  • [9] [Anonymous], PODC 01 P ANN ACM S
  • [10] [Anonymous], 1985, 959 RFC INT ENG TASK