Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance

被引:10
作者
Jia, Yulu [1 ]
Bosilca, George [1 ]
Dongarra, Jack J. [1 ]
机构
[1] Univ Tennessee, Knoxville, TN 37996 USA
来源
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC) | 2013年
关键词
Algorithm-based fault tolerance; Hessenberg reduction; ScaLA-PACK; Dense linear algebra; Parallel numerical libraries; QR ALGORITHM; ERROR;
D O I
10.1145/2503210.2503249
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
引用
收藏
页数:11
相关论文
共 46 条
  • [1] ANDERSON E., 1999, LAPACK USERSGUIDE, V3rd
  • [2] [Anonymous], 2006, Google's PageRank and beyond: the science of search engine rankings
  • [3] [Anonymous], 2012, MATRIX COMPUTATIONS
  • [4] [Anonymous], 2005, THESIS U ILLINOIS UR
  • [5] Bautista Gomez Leonardo Arturo, 2010, Proceedings 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), P63, DOI 10.1109/CCGRID.2010.40
  • [6] Berry M. W., 2005, Understanding Search Engines: Mathematical Modeling and Text Retrieval
  • [7] Bischof C. H., 1985, SOC IND APPL MATH, P2
  • [8] Blackford L., 1997, ScaLAPACK Users Guide
  • [9] Bland W., 2012, UTCS12702
  • [10] Bland W, 2012, LECT NOTES COMPUT SC, V7484, P477, DOI 10.1007/978-3-642-32820-6_48