Fault tolerance in computational grids: perspectives, challenges, and issues

被引：8

作者：

Haider, Sajjad ^{[1
,2
]}

Nazir, Babar ^{[3
]}

机构：

[1] SZABIST, Dept Comp Sci, H-8, Islamabad, Pakistan

[2] NUML, Dept Comp Sci, H-9, Islamabad, Pakistan

[3] COMSATS Inst Informat Technol, Dept Comp Sci, Univ Rd, Abbottabad 22060, Pakistan

来源：

SPRINGERPLUS | 2016年 / 5卷

关键词：

Fault identification; Fault tolerance; Fault classification; Computational grid; Distributed computing; RESOURCE-ALLOCATION; FAILURE-DETECTION; LARGE-SCALE; DESIGN; SYSTEM;

D O I：

10.1186/s40064-016-3669-0

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.

引用

页数：20

共 103 条

[1]

Affaan M, 2006, GCC 2005: FIFTH INTERNATIONAL CONFERENCE ON GRID AND COOPERATIVE COMPUTING, PROCEEDINGS, P363

[2] Message logging: Pessimistic, optimistic, causal, and optimal [J].

Alvisi, L ;

Marzullo, K .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (02) :149-159

[3] A hierarchical watchdog mechanism for systemic fault awareness on distributed systems [J].

Ammendola, Roberto ;

Biagioni, Andrea ;

Frezza, Ottorino ;

Lo Cicero, Francesca ;

Lonardo, Alessandro ;

Paolucci, Pier Stanislao ;

Rossetti, Davide ;

Simula, Francesco ;

Tosoratto, Laura ;

Vicini, Piero .

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2015, 53 :90-99

[4] A fault-tolerant scheduling system for computational grids [J].

Amoon, Mohammed .

COMPUTERS & ELECTRICAL ENGINEERING, 2012, 38 (02) :399-412

[5]

[Anonymous], ARTIFICIAL INTELLIGE

[6]

[Anonymous], THESIS

[7]

[Anonymous], 2010, CCGrid, DOI DOI 10.1109/CCGRID.2010.71

[8]

arasteh Bahman, 2012, LECT NOTES ELECT ENG, V114, P497

[9]

Arshad N., 2006, THESIS

[10] Basic concepts and taxonomy of dependable and secure computing [J].

Avizienis, A ;

Laprie, JC ;

Randell, B ;

Landwehr, C .

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2004, 1 (01) :11-33

← 1 2 3 4 5 6 7 8 9 10 →