External memory algorithms and data structures: Dealing with massive data

被引:299
作者
Vitter, JS
机构
[1] Duke Univ, Levine Sci Res Ctr, Dept Comp Sci, Durham, NC 27708 USA
[2] Univ Aarhus, BRICS, Aarhus, Denmark
[3] INRIA, Sophia Antipolis, France
关键词
algorithms; design; experimentation; performance; theory; batched; block; B-tree; disk; dynamic; extendible hashing; external memory; hierarchical memory; I/O; multidimensional access methods; multilevel memory; online; out-of-core; secondary storage; sorting;
D O I
10.1145/384192.384193
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this article we survey the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving hatched and online problems efficiently in external memory. For the batched problem of sorting and related problems such as permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss distribution and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and constructing convex hulls), and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide a convenient means in online data structures to make effective use of the data accessed from disk. We also reexamine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length (e.g., text strings), or when the allocated amount of internal memory can change dynamically. Programming tools and environments are available for simplifying the EM programming task. During the course of the survey, we report on some experiments in the domain of spatial databases using the TPIE system (transparent parallel I/O programming environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than methods currently used in practice.
引用
收藏
页码:209 / 271
页数:63
相关论文
共 265 条
[1]   AB+-TREE STRUCTURE FOR LARGE QUADTREES [J].
ABEL, DJ .
COMPUTER VISION GRAPHICS AND IMAGE PROCESSING, 1984, 27 (01) :19-31
[2]  
ABELLO J, 1998, LECT NOTES COMPUTER, V1461, P332
[3]   Active disks: Programming model, algorithms and evaluation [J].
Acharya, A ;
Uysal, M ;
Saltz, J .
ACM SIGPLAN NOTICES, 1998, 33 (11) :81-91
[4]  
ADLER M, 1996, P IEEE S FDN COMP SC, V37, P173
[5]  
Agarwal P.K., 1999, ADV DISCRETE COMPUTA, V223, P1, DOI [10.1090/conm/223/03131, DOI 10.1090/conm/223/03131]
[6]  
Agarwal PankajK., 1998, Symposium on Discrete Algorithms, P117
[7]  
AGARWAL PK, 2001, LECT NOTES COMPUTER, V2076
[8]  
AGARWAL PK, 1999, P ACM SIAM S DISCR A, V10, P11
[9]  
AGARWAL PK, 2001, P ACM S COMP GEOM ME, V17
[10]  
AGARWAL PK, 1998, P ACM S PRINC DAT SY, V17, P169