Fault-Tolerant Grid-Based Solvers: Combining Concepts from Sparse Grids and MapReduce

被引:12
|
作者
Larson, J. W. [1 ]
Hegland, M. [1 ]
Harding, B. [1 ]
Roberts, S. [1 ]
Stals, L. [1 ]
Rendell, A. P. [2 ]
Strazdins, P. [2 ]
Ali, M. M. [2 ]
Kowitz, C. [2 ,4 ]
Nobes, R. [3 ]
Southern, J. [3 ]
Wilson, N. [3 ]
Li, M. [3 ]
Oishi, Y. [3 ]
机构
[1] Australian Natl Univ, Inst Math Sci, GPO Box 4, Canberra, ACT 0200, Australia
[2] Australian Natl Univ, Res Sch Comp Sci, Canberra, ACT 0200, Australia
[3] Fujitsu Lab Europe, Hayes UB4 8FE, England
[4] Univ Munich, Inst Informat Tech, F-85748 Munich, Germany
来源
2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE | 2013年 / 18卷
基金
澳大利亚研究理事会;
关键词
Parallel computing; partial differential equations; fault-tolerance; sparse grids; MapReduce; PARALLELIZATION; COMBINATION;
D O I
10.1016/j.procs.2013.05.176
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than re-scheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users' legacy codes, and the system's runtime behavior.
引用
收藏
页码:130 / 139
页数:10
相关论文
共 10 条
  • [1] Deploying Fault-Tolerant Grid-Based Wireless Sensor Networks for Environmental Applications
    Al-Turjman, Fadi M.
    Al-Fagih, Ashraf E.
    Hassanein, Hossam S.
    Ibnkahla, Mohamed A.
    IEEE LOCAL COMPUTER NETWORK CONFERENCE, 2010, : 715 - 722
  • [2] Grid-based switch fabrics: a new approach in designing fault-tolerant ATM switches
    Laskaridis, HS
    Veglis, AA
    Papadimitriou, GI
    Pombortsis, AS
    COMPUTER COMMUNICATIONS, 2001, 24 (15-16) : 1589 - 1606
  • [3] GRFM: an efficient grid-based replication and fault tolerant middleware
    Siddesh, G. M.
    Srinivasa, K. G.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2013, 8 (02) : 133 - 147
  • [4] A Fault-Tolerant Gyrokinetic Plasma Application using the Sparse Grid Combination Technique
    Ali, Md Mohsin
    Strazdins, Peter E.
    Harding, Brendan
    Hegland, Markus
    Larson, Jay W.
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), 2015, : 499 - 507
  • [5] Complex scientific applications made fault-tolerant with the sparse grid combination technique
    Ali, Md Mohsin
    Strazdins, Peter E.
    Harding, Brendan
    Hegland, Markus
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2016, 30 (03) : 335 - 359
  • [6] Performance analysis of a dependable scheduling strategy based on a fault-tolerant grid model
    Wang Y.
    Lin C.
    Yang Y.
    Shan Z.
    Frontiers of Computer Science in China, 2007, 1 (03): : 329 - 337
  • [7] Combining qualitative model-based diagnosis and observation within fault-tolerant systems
    Schiller, F
    Schröder, J
    AI COMMUNICATIONS, 1999, 12 (1-2) : 79 - 98
  • [8] A load-balancing-based fault-tolerant mapping method in smart grid virtual networks
    Sun, Li-Qian
    Guo, Shao-Yong
    Xu, Si-Ya
    Liu, Zhu
    Wei, Lei
    2016 18TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2016,
  • [9] Fault-tolerant Data Aggregation Scheme for Monitoring of Critical Events in Grid based Healthcare Sensor Networks
    Saeed, Ather
    Stranieri, Andrew
    Dazeley, Richard
    HIGH PERFORMANCE COMPUTING SYMPOSIUM 2011 (HPC 2011) - 2011 SPRING SIMULATION MULTICONFERENCE - BK 6 OF 8, 2011, 43 (02): : 56 - 64
  • [10] Architecting Fault-tolerant Component-based Systems: from requirements to testing
    Bucchiarone, Antonio
    Muccini, Henry
    Pelliccione, Patrizio
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2007, 168 (SPEC. ISS.) : 77 - 90