A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
来源
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2020年
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Massively parallel fault tolerant computations on syntactical patterns
    Kutrib, M
    Löwe, JT
    FUTURE GENERATION COMPUTER SYSTEMS, 2002, 18 (07) : 905 - 919
  • [2] A fault-tolerant hierarchical diagnostic network for massively parallel processing systems
    Choi, YH
    Kim, YS
    COMPUTERS & ELECTRICAL ENGINEERING, 1998, 24 (05) : 349 - 361
  • [3] A Massively-Parallel, Fault-Tolerant Solver for High-Dimensional PDEs
    Heene, Mario
    Hinojosa, Alfredo Parra
    Bungartz, Hans-Joachim
    Pflueger, Dirk
    EURO-PAR 2016: PARALLEL PROCESSING WORKSHOPS, 2017, 10104 : 635 - 647
  • [4] A MASSIVELY-PARALLEL FAULT-TOLERANT ARCHITECTURE FOR TIME-CRITICAL COMPUTING
    AHMAD, I
    JOURNAL OF SUPERCOMPUTING, 1995, 9 (1-2) : 135 - 162
  • [5] A novel fault-tolerant parallel algorithm
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Zhou, Haifang
    Yang, Xuejun
    Yang, Wenjing
    ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2007, 4847 : 18 - 29
  • [6] Fault tolerant memory design for HW/SW co-reliability in massively parallel computing systems
    Choi, M
    Park, NJ
    George, KM
    Jin, B
    Park, N
    Kim, YB
    Lombardi, F
    SECOND IEEE INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS, PROCEEDINGS, 2003, : 341 - 348
  • [7] A fault tolerant MPI-10 implementation using the expand parallel file system
    Calderón, A
    García-Carballeira, F
    Carretero, J
    Pérez, JM
    Sánchez, LM
    13TH EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, PROCEEDINGS, 2005, : 274 - 281
  • [8] A Parallel Fault Tolerant Combination Technique
    Harding, Brendan
    Hegland, Markus
    PARALLEL COMPUTING: ACCELERATING COMPUTATIONAL SCIENCE AND ENGINEERING (CSE), 2014, 25 : 584 - 592
  • [9] Formal development of fault tolerant parallel systems
    Troubitsyna, Elena A.
    IMCIC 2010: INTERNATIONAL MULTI-CONFERENCE ON COMPLEXITY, INFORMATICS AND CYBERNETICS, VOL II, 2010, : 108 - 112
  • [10] A fault tolerant model for a parallel database system
    Keane, JA
    Ye, X
    EUROSIM '96 - HPCN CHALLENGES IN TELECOMP AND TELECOM: PARALLEL SIMULATION OF COMPLEX SYSTEMS AND LARGE-SCALE APPLICATIONS, 1996, : 127 - 134