Communication lower bounds and optimal algorithms for numerical linear algebra

被引:65
作者
Ballard, G. [1 ]
Carson, E. [2 ]
Demmel, J. [2 ,3 ]
Hoemmen, M. [4 ]
Knight, N. [2 ]
Schwartz, O. [2 ]
机构
[1] Sandia Natl Labs, Livermore, CA 94551 USA
[2] Univ Calif Berkeley, Dept EECS, Berkeley, CA 94704 USA
[3] Univ Calif Berkeley, Dept Math, Berkeley, CA 94704 USA
[4] Sandia Natl Labs, Albuquerque, NM 87185 USA
基金
美国国家科学基金会;
关键词
EFFICIENT MATRIX MULTIPLICATION; MULTISHIFT QR ALGORITHM; OPTIMAL PARALLEL; COLLECTIVE COMMUNICATION; MINIMIZING COMMUNICATION; MODEL IMPLEMENTATION; ATTAINABLE ACCURACY; ITERATIVE METHODS; PRACTICAL USE; EXTENDED SET;
D O I
10.1017/S0962492914000038
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The traditional metric for the efficiency of a numerical algorithm has been the number of arithmetic operations it performs. Technological trends have long been reducing the time to perform an arithmetic operation, so it is no longer the bottleneck in many algorithms; rather, communication, or moving data, is the bottleneck. This motivates us to seek algorithms that move as little data as possible, either between levels of a memory hierarchy or between parallel processors over a network. In this paper we summarize recent progress in three aspects of this problem. First we describe lower bounds on communication. Some of these generalize known lower bounds for dense classical (O(n(3))) matrix multiplication to all direct methods of linear algebra, to sequential and parallel algorithms, and to dense and sparse matrices. We also present lower bounds for Strassen-like algorithms, and for iterative methods, in particular Krylov subspace methods applied to sparse matrices. Second, we compare these lower bounds to widely used versions of these algorithms, and note that these widely used algorithms usually communicate asymptotically more than is necessary. Third, we identify or invent new algorithms for most linear algebra problems that do attain these lower bounds, and demonstrate large speed-ups in theory and practice.
引用
收藏
页码:1 / 155
页数:155
相关论文
共 215 条
[1]  
Aasen J. O., 1971, BIT (Nordisk Tidskrift for Informationsbehandling), V11, P233, DOI 10.1007/BF01931804
[2]  
Abdelmalek N. N., 1971, BIT (Nordisk Tidskrift for Informationsbehandling), V11, P345, DOI 10.1007/BF01939404
[3]   A HIGH-PERFORMANCE MATRIX-MULTIPLICATION ALGORITHM ON A DISTRIBUTED-MEMORY PARALLEL COMPUTER, USING OVERLAPPED COMMUNICATION [J].
AGARWAL, RC ;
GUSTAVSON, FG ;
ZUBAIR, M .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1994, 38 (06) :673-681
[4]   A three-dimensional approach to parallel matrix multiplication [J].
Agarwal, RC ;
Balle, SM ;
Gustavson, FG ;
Joshi, M ;
Palkar, P .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 1995, 39 (05) :575-582
[5]   THE INPUT OUTPUT COMPLEXITY OF SORTING AND RELATED PROBLEMS [J].
AGGARWAL, A ;
VITTER, JS .
COMMUNICATIONS OF THE ACM, 1988, 31 (09) :1116-1127
[6]   COMMUNICATION COMPLEXITY OF PRAMS [J].
AGGARWAL, A ;
CHANDRA, AK ;
SNIR, M .
THEORETICAL COMPUTER SCIENCE, 1990, 71 (01) :3-28
[7]  
Ahmed N, 2000, LECT NOTES COMPUT SC, V1900, P368
[8]  
Anderson E., 1992, LAPACK Users Guide
[9]  
Anderson M., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P48, DOI 10.1109/IPDPS.2011.15
[10]  
[Anonymous], CS10102 COL STAT U