A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
来源
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2020年
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] RECONFIGURATION AND ANALYSIS OF A FAULT-TOLERANT CIRCULAR BUTTERFLY PARALLEL SYSTEM
    TZENG, NF
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1993, 4 (08) : 855 - 863
  • [42] Fault Tolerant in the Expand Ad-Hoc Parallel File System
    Munoz-Munoz, Dario
    Garcia-Carballeira, Felix
    Camarmas-Alonso, Diego
    Calderon-Mateos, Alejandro
    Carretero, Jesus
    EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024, 2024, 14802 : 62 - 76
  • [43] Design and evaluation of a fault-tolerant adaptive router for parallel computers
    Yoshinaga, T
    Hosogoshi, H
    Sowa, M
    INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, 2003, : 100 - 107
  • [44] Implementation of a Massively Parallel Dynamic Security Assessment Platform for Large-Scale Grids
    Konstantelos, Ioannis
    Jamgotchian, Geoffroy
    Tindemans, Simon H.
    Duchesne, Philippe
    Cole, Stijn
    Merckx, Christian
    Strbac, Goran
    Panciatici, Patrick
    IEEE TRANSACTIONS ON SMART GRID, 2017, 8 (03) : 1417 - 1426
  • [45] Fault tolerant adaptive parallel and distributed simulation through functional replication
    D'Angelo, Gabriele
    Ferretti, Stefano
    Marzolla, Moreno
    SIMULATION MODELLING PRACTICE AND THEORY, 2019, 93 : 192 - 207
  • [46] Hardware implementation of a fault-tolerant Hopfield Neural Network on FPGAs
    Antonio Clemente, Juan
    Mansour, Wassim
    Ayoubi, Rafic
    Serrano, Felipe
    Mecha, Hortensia
    Ziade, Haissam
    El Falou, Wassim
    Velazco, Raoul
    NEUROCOMPUTING, 2016, 171 : 1606 - 1609
  • [47] Design and implementation of a CORBA fault-tolerant object group service
    Morgan, G
    Shrivastava, S
    Ezhilchelvan, P
    Little, M
    DISTRIBUTED APPLICATIONS AND INTEROPERABLE SYSTEMS II, 1999, 15 : 361 - 374
  • [48] Fault Tolerant Implementation of Peer-to-Peer Distributed Iterative Algorithms
    The Tung Nguyen
    El-Baz, Didier
    15TH IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE 2012) / 10TH IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC 2012), 2012, : 137 - 145
  • [49] A fault tolerant implementation of Multi-Level Monte Carlo methods
    Pauli, Stefan
    Kohler, Manuel
    Arbenz, Peter
    PARALLEL COMPUTING: ACCELERATING COMPUTATIONAL SCIENCE AND ENGINEERING (CSE), 2014, 25 : 471 - 480
  • [50] A Novel Framework of Cooperative Design: Bringing Active Fault Diagnosis Into Fault-Tolerant Control
    Jia, Fanlin
    Cao, Fangfei
    Lyu, Guangran
    He, Xiao
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (05) : 3301 - 3310