A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] A MASSIVELY PARALLEL ALGORITHM FOR FAULT SIMULATION ON THE CONNECTION MACHINE
    NARAYANAN, V
    PITCHUMANI, V
    26TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, 1989, : 734 - 737
  • [32] On the design of fault tolerant parallel manipulators
    Notash, L
    Huang, L
    MECHANISM AND MACHINE THEORY, 2003, 38 (01) : 85 - 101
  • [33] FAULT-TOLERANT PARALLEL PROCESSOR
    HARPER, RE
    LALA, JH
    JOURNAL OF GUIDANCE CONTROL AND DYNAMICS, 1991, 14 (03) : 554 - 563
  • [34] A fault tolerant implementation of the Goertzel algorithm
    Gao, Z.
    Reviriego, P.
    Li, X.
    Maestro, J. A.
    Zhao, M.
    Wang, J.
    MICROELECTRONICS RELIABILITY, 2014, 54 (01) : 335 - 337
  • [35] Fault Tolerant Implementation of a SpaceWire Interface
    Taube, Sebastian
    Petrovic, Vladimir
    Krstic, Milos
    2014 21ST IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS (ICECS), 2014, : 614 - 617
  • [36] Implementation of Fault Tolerant Techniques into FPNNs
    Krcma, Martin
    Kotasek, Zdenek
    Lojda, Jakub
    2016 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT), 2016, : 297 - 298
  • [37] A massively parallel implementation of the watershed based on cellular automata
    Noguet, D
    IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES AND PROCESSORS, PROCEEDINGS, 1997, : 42 - 52
  • [38] Use of Direct Solvers in TFETI Massively Parallel Implementation
    Hapla, Vaclav
    Horak, David
    Merta, Michal
    APPLIED PARALLEL AND SCIENTIFIC COMPUTING (PARA 2012), 2013, 7782 : 192 - 205
  • [39] Efficient massively parallel implementation of some combinatorial algorithms
    Academia Sinica, Taipei, Taiwan
    Theor Comput Sci, 2 (297-322):
  • [40] Massively parallel implementation of the mesoscale compressible community model
    Thomas, SJ
    Malevsky, AV
    Desgagne, M
    Benoit, R
    Pellerin, P
    Valin, M
    PARALLEL COMPUTING, 1997, 23 (14) : 2143 - 2160