Accurate and efficient floating point summation

被引:52
作者
Demmel, J [1 ]
Hida, Y
机构
[1] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Math, Berkeley, CA 94720 USA
关键词
accurate summation; dot product; floating point; distillation; roundoff error;
D O I
10.1137/S1064827502407627
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
We present and analyze several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator. Let f and F be the number of significant bits in the summands and the accumulator, respectively. Then assuming gradual underflow, no overflow, and round-to-nearest arithmetic, up to approximately 2(F-f) numbers can be added accurately by simply summing the terms in decreasing order of exponents, yielding a sum correct to within about 1.5 units in the last place (ulps). We apply this result to the floating point formats in the IEEE floating point standard. For example, a dot product of single precision vectors of length at most 33 computed using double precision and sorting is guaranteed correct to nearly 1.5 ulps. If double-extended precision is used, the vector length can be as large as 65,537. We also investigate how the cost of sorting can be reduced or eliminated while retaining accuracy.
引用
收藏
页码:1214 / 1248
页数:35
相关论文
共 25 条