From tasks graphs to asynchronous distributed checkpointing with local restart

被引:5
|
作者
Lion, Romain [1 ]
Thibault, Samuel [1 ]
机构
[1] Univ Bordeaux, Inria Bordeaux Sud Ouest, Bordeaux, France
来源
PROCEEDINGS OF 2020 IEEE/ACM 10TH WORKSHOP ON FAULT TOLERANCE FOR HPC AT EXTREME SCALE (FTXS 2020) | 2020年
基金
欧盟地平线“2020”;
关键词
Fault tolerance; task-based programming; checkpoint-restart; buddy in-memory; RECOVERY; ROLLBACK;
D O I
10.1109/FTXS51974.2020.00009
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The ever-increasing number of computation units assembled in current HPC platforms leads to a concerning increase in fault probability. Traditional checkpoint/restart strategies avoid wasting large amounts of computation time when such fault occurs. With the increasing amount of data dealt with by current applications, these strategies however suffer from their data transfer demand becoming unreasonable, or the entailed global synchronizations. Meanwhile, the current trend towards task-based programming is an opportunity to revisit the principles of the checkpoint/restart strategies. We here propose a checkpointing scheme which is closely tied to the execution of task graphs. We describe how it allows for completely asynchronous and distributed checkpointing, as well as localized node restart, thus opening up for very large scalability. We also show how a synergy between the application data transfers and the checkpointing transfers can lead to a reasonable additional network load, measured to be lower than +10% on a dense linear algebra example.
引用
收藏
页码:31 / 40
页数:10
相关论文
共 50 条
  • [1] A causal message logging protocol with asynchronous checkpointing for distributed systems
    Ahn, J
    Kim, K
    Hwang, C
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2000, : 523 - 528
  • [2] An Asynchronous Gossip Algorithm with Restart Strategy in Distributed Minimax Optimization
    Hanada, Kenta
    Wada, Takayuki
    Fujisaki, Yasumasa
    IFAC PAPERSONLINE, 2017, 50 (01): : 14212 - 14217
  • [3] A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems
    Tantikul, T
    Manivannan, D
    PARALLEL AND DISTRIBUTED COMPUTING: APPLICATIONS AND TECHNOLOGIES, PROCEEDINGS, 2004, 3320 : 284 - 292
  • [4] Distributed computing in the asynchronous LOCAL model
    Delporte-Gallet, Carole
    Fauconnier, Hugues
    Fraigniaud, Pierre
    Rabie, Mikael
    THEORETICAL COMPUTER SCIENCE, 2025, 1025
  • [5] CHECKPOINTING FOR DISTRIBUTED DATABASES - STARTING FROM THE BASICS
    PILARSKI, S
    KAMEDA, T
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1992, 3 (05) : 602 - 610
  • [6] A distributed fault-tolerant asynchronous algorithm for performing N tasks
    Weerasinghe, GM
    Lipsky, L
    COMPUTERS AND THEIR APPLICATIONS, 2001, : 69 - 73
  • [7] NULL COMPACTNESS FOR LOCAL, PARTIAL GRAPHS IN ECONOMIC TASKS
    Tazhbayev, N.
    Orynbasarov, S.
    Bespayeva, R.
    Bugubayeva, R.
    Shinet, G. G.
    Fernandez Grela, Manuel
    BULLETIN OF THE NATIONAL ACADEMY OF SCIENCES OF THE REPUBLIC OF KAZAKHSTAN, 2019, (06): : 241 - 246
  • [8] NULL COMPACTNESS FOR LOCAL, PARTIAL GRAPHS IN ECONOMIC TASKS
    Tazhbayev, N.
    Orynbasarov, S.
    Bespayeva, R.
    Bugubayeva, R.
    Shinet, G. G.
    Fernandez-Grela, Manuel
    BULLETIN OF THE NATIONAL ACADEMY OF SCIENCES OF THE REPUBLIC OF KAZAKHSTAN, 2019, (04): : 235 - 240
  • [9] Brief Announcement: Distributed Computing in the Asynchronous LOCAL Model
    Delporte-Gallet, Carole
    Fauconnier, Hugues
    Fraigniaud, Pierre
    Rabie, Mikael
    STABILIZATION, SAFETY, AND SECURITY OF DISTRIBUTED SYSTEMS, SSS 2019, 2019, 11914 : 105 - 110
  • [10] CONCURRENT IMAGE QUERY USING LOCAL RANDOM WALK WITH RESTART ON LARGE SCALE GRAPHS
    Xia, Yinglong
    Lail, Lui-Hsin
    Nai, Lifeng
    Lin, Ching-Yung
    2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2014,