A Fault Tolerant Implementation for a Massively Parallel Seismic Framework

被引:2
|
作者
Kayum, Suha N. [1 ]
Alsalim, Hussain [1 ]
Tonellot, Thierry-Laurent [1 ]
Momin, Ali [1 ]
机构
[1] Saudi Aramco, Dhahran, Saudi Arabia
来源
2020 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2020年
关键词
parallel seismic applications; fault tolerance; High Performance Computing;
D O I
10.1109/hpec43674.2020.9286143
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Scalable and Fault-Tolerant Cloud Computations: Modelling and Implementation
    Spichkova, Maria
    Thomas, Ian E.
    Schmidt, Heinz W.
    Yusuf, Iman I.
    Drumm, Daniel W.
    Androulakis, Steve
    Opletal, George
    Russo, Salvy P.
    2015 IEEE 21ST INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2015, : 396 - 404
  • [32] A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data
    Siretskiy, Alexey
    Sundqvist, Tore
    Voznesenskiy, Mikhail
    Spjuth, Ola
    GIGASCIENCE, 2015, 4
  • [33] Fault-tolerant protocol for hybrid task-parallel message-passing applications
    Martsinkevich, Tatiana
    Subasi, Omer
    Unsal, Osman
    Labarta, Jesus
    Cappello, Franck
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 563 - 570
  • [34] A comprehensive fault-tolerant framework for wireless sensor networks
    Afsar, Mehdi
    SECURITY AND COMMUNICATION NETWORKS, 2015, 8 (17) : 3247 - 3261
  • [35] A framework for fault-tolerant control of discrete event systems
    Wen, Qin
    Kumar, Ratnesh
    Huang, Jing
    Liu, Haifeng
    IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2008, 53 (08) : 1839 - 1849
  • [36] Compressionless routing: A framework for adaptive and fault-tolerant routing
    Kim, JH
    Liu, ZQ
    Chien, AA
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (03) : 229 - 244
  • [37] A framework for the design of fault-tolerant systems-of-systems☆
    Ferreira, Francisco Henrique Cerdeira
    Nakagawa, Elisa Yumi
    Bertolino, Antonia
    Lonetti, Francesca
    Neves, Vania de Oliveira
    dos Santos, Rodrigo Pereira
    JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 211
  • [38] A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations
    Zhou, Tian
    Gao, Lixin
    Guan, Xiaohong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (08) : 2062 - 2073
  • [39] A fault-tolerant architecture for parallel applications in tiled-CMPs
    Sanchez, Daniel
    Aragon, Juan L.
    Garcia, Jose M.
    JOURNAL OF SUPERCOMPUTING, 2012, 61 (03) : 997 - 1023
  • [40] A fault-tolerant architecture for parallel applications in tiled-CMPs
    Daniel Sánchez
    Juan L. Aragón
    José M. García
    The Journal of Supercomputing, 2012, 61 : 997 - 1023