Fault-Tolerant LU Factorization Is Low Cost

被引:2
|
作者
Coti, Camille [1 ]
Petrucci, Laure [1 ]
Gonzalez, Daniel Alberto Torres [1 ]
机构
[1] Univ Sorbonne Paris Nord, LIPN, CNRS UMR 7030, 99 Ave Jean Baptiste Clement, F-93430 Villetaneuse, France
来源
关键词
PROJECT;
D O I
10.1007/978-3-030-85665-6_33
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
At large scale, failures are statistically frequent and need to be taken into account. Tolerating failures has arisen as a major challenge in parallel computing as the size of the systems grow, failures become more common and some computation units are expected to fail during the execution of a program. Algorithms used in these programs must be scalable, while being resilient to hardware failures that will happen during the execution. In this paper, we present an algorithm that takes advantage of intrinsic properties of the scalable communication-avoiding LU algorithms in order to make them fault-tolerant and proceed with the computation in spite of failures. We evaluate the overhead of the fault tolerance mechanisms with respect to failure-free execution on both talland-skinny matrices (TSLU) and square matrices (CALU), and the cost of a failure during the execution.
引用
收藏
页码:536 / 549
页数:14
相关论文
共 50 条
  • [1] A Low-Cost Fault-Tolerant Structure for the Hypercube
    Dajin Wang
    The Journal of Supercomputing, 2001, 20 : 203 - 216
  • [2] A low-cost fault-tolerant structure for the hypercube
    Wang, DJ
    JOURNAL OF SUPERCOMPUTING, 2001, 20 (03): : 203 - 216
  • [3] Fault-Tolerant Service Composition Based on Low Cost Mechanism
    Dai, Yu
    Yang, Lei
    Zhu, Zhiliang
    Zhang, Bin
    INFORMATION COMPUTING AND APPLICATIONS, PT 2, 2010, 106 : 56 - 63
  • [4] COST MODELING OF FAULT-TOLERANT SOFTWARE
    MCALLISTER, DF
    SCOTT, RK
    INFORMATION AND SOFTWARE TECHNOLOGY, 1991, 33 (08) : 594 - 603
  • [5] Low-Cost Fault-Tolerant Routing for Regular Topology NoCs
    Tatas, K.
    Savva, S.
    Kyriacou, C.
    2014 21ST IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS (ICECS), 2014, : 566 - 569
  • [6] Novel low-cost and fault-tolerant reversible logic adders
    Valinataj, Mojtaba
    Mirshekar, Mahboobeh
    Jazayeri, Hamid
    COMPUTERS & ELECTRICAL ENGINEERING, 2016, 53 : 56 - 72
  • [7] Low cost fault-tolerant routing algorithm for Networks-on-Chip
    Liu, Junxiu
    Harkin, Jim
    Li, Yuhua
    Maguire, Liam
    MICROPROCESSORS AND MICROSYSTEMS, 2015, 39 (06) : 358 - 372
  • [8] A cost effective fault-tolerant scheme for RAIDs
    Fang, L
    Lu, XC
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2003, 18 (02) : 230 - 234
  • [9] A cost effective fault-tolerant scheme for RAIDs
    Liang Fang
    XiCheng Lu
    Journal of Computer Science and Technology, 2003, 18 : 230 - 234
  • [10] A Low-Cost Fault-Tolerant Racetrack Cache Based on Data Compression
    Cheshmikhani, Elham
    Shokouhinia, Fateme
    Farbeh, Hamed
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2024, 71 (08) : 3940 - 3944