A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
来源
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2020年
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] DESIGN AND IMPLEMENTATION OF THE ATTEMPTO FAULT-TOLERANT SYSTEM
    GUNTER, W
    COMPUTING SYSTEMS, 1993, 8 (02): : 101 - 108
  • [22] A framework for high-fidelity particle tracking on massively parallel systems
    Kopper, Patrick
    Schwarz, Anna
    Copplestone, Stephen M.
    Ortwein, Philip
    Staudacher, Stephan
    Beck, Andrea
    COMPUTER PHYSICS COMMUNICATIONS, 2023, 289
  • [23] A Python']Python extension for the massively parallel multiphysics simulation framework WALBERLA
    Bauer, Martin
    Schornbaum, Florian
    Godenschwager, Christian
    Markl, Matthias
    Anderl, Daniela
    Koestler, Harald
    Ruede, Ulrich
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2016, 31 (06) : 529 - 542
  • [24] Passive and Partially Active Fault Tolerance for Massively Parallel Stream Processing Engines
    Su, Li
    Zhou, Yongluan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2019, 31 (01) : 32 - 45
  • [25] A Framework for a Fault Tolerant Multi-robot System
    Khan, M. Tahir
    Qadir, M. U.
    Nasir, F.
    de Silva, C. W.
    10TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2015), 2015, : 197 - 201
  • [26] A fault-tolerant computing method for Xdraw parallel algorithm
    Dou, Wanfeng
    Li, Yanan
    JOURNAL OF SUPERCOMPUTING, 2018, 74 (06) : 2776 - 2800
  • [27] Fault Tolerant Scheduling for Parallel Loops on Shared Memory Systems
    Wang, Yizhuo
    Cammarota, Rosario
    Nicolau, Alexandru
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2015, 31 (06) : 1937 - 1959
  • [28] A Markov model for fault-tolerant task parallel computations
    Bertolli, Carlo
    Meneghin, Massimiliano
    Gabarro, Joaquim
    FROM GRIDS TO SERVICE AND PERVASIVE COMPUTING, 2008, : 123 - +
  • [29] A fault-tolerant computing method for Xdraw parallel algorithm
    Wanfeng Dou
    Yanan Li
    The Journal of Supercomputing, 2018, 74 : 2776 - 2800
  • [30] Design and Implementation of a Pluggable Fault-Tolerant CORBA Infrastructure
    W. Zhao
    L.E. Moser
    P.M. Melliar-Smith
    Cluster Computing, 2004, 7 (4) : 317 - 330