Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation

被引:19
作者
Gansterer, Wilfried N. [1 ]
Niederbrucker, Gerhard [1 ]
Strakova, Hana [1 ]
Grotthoff, Stefan Schulze [1 ]
机构
[1] Univ Vienna, Res Grp Theory & Applicat Algorithms, A-1090 Vienna, Austria
基金
奥地利科学基金会;
关键词
Distributed reduction operation; Push-flow algorithm; Distributed orthogonalization; Distributed matrix computations; Fault tolerant matrix computations; GOSSIP; ALGORITHM;
D O I
10.1016/j.jocs.2013.01.006
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:480 / 488
页数:9
相关论文
共 32 条
[1]   Broadcast Gossip Algorithms for Consensus [J].
Aysal, Tuncer Can ;
Yildiz, Mehmet Ercan ;
Sarwate, Anand D. ;
Scaglione, Anna .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2009, 57 (07) :2748-2761
[2]   Twitter reciprocal reply networks exhibit assortativity with respect to happiness [J].
Bliss, Catherine A. ;
Kloumann, Isabel M. ;
Harris, Kameron Decker ;
Danforth, Christopher M. ;
Dodds, Peter Sheridan .
JOURNAL OF COMPUTATIONAL SCIENCE, 2012, 3 (05) :388-397
[3]   Randomized gossip algorithms [J].
Boyd, Stephen ;
Ghosh, Arpita ;
Prabhakar, Balaji ;
Shah, Devavrat .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2006, 52 (06) :2508-2530
[4]   Algorithm-Based Fault Tolerance for Fail-Stop Failures [J].
Chen, Zizhong ;
Dongarra, Jack .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) :1628-1641
[5]  
Chen ZZ, 2011, HPDC 11: PROCEEDINGS OF THE 20TH INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, P73
[6]   A higher order estimate of the optimum checkpoint interval for restart dumps [J].
Daly, JT .
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF GRID COMPUTING THEORY METHODS AND APPLICATIONS, 2006, 22 (03) :303-312
[7]   Geographic gossip: Efficient averaging for sensor networks [J].
Dimakis, Alexandros D. G. ;
Sarwate, Anand D. ;
Wainwright, Martin J. .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2008, 56 (03) :1205-1216
[8]  
Engelmann C., 2009, PDCN, P189
[9]  
Eyal Ittay, 2012, Algorithms for Sensor Systems. 7th International Symposium on Algorithms for Sensor Systems, Wireless Ad Hoc Networks and Autonomous Mobile Entities, ALGOSENSORS 2011. Revised Selected Papers, P72, DOI 10.1007/978-3-642-28209-6_7
[10]   Process fault tolerance: Semantics, design and applications for high performance computing [J].
Fagg, GE ;
Gabriel, E ;
Chen, ZZ ;
Angskun, T ;
Bosillca, G ;
Pjesivac-Grbovic, J ;
Dongarra, JJ .
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2005, 19 (04) :465-477