Online random shuffling of large database tables

被引:6
作者
Jermaine, Christopher [1 ]
机构
[1] Univ Florida, CISE Dept, Gainesville, FL 32611 USA
基金
美国国家科学基金会;
关键词
sampling methods; database systems;
D O I
10.1109/TKDE.2007.250586
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many applications require a randomized ordering of input data. Examples include algorithms for online aggregation, data mining, and various randomized algorithms. Most existing work seems to assume that accessing the records from a large database in a randomized order is not a difficult problem. However, it turns out to be extremely difficult in practice. Using existing methods, randomization is either extremely expensive at the front end ( as data are loaded), or at the back end ( as data are queried). This paper presents a simple file structure which supports both efficient, online random shuffling of a large database, as well as efficient online sampling or randomization of the database when it is queried. The key innovation of our method is the introduction of a small degree of carefully controlled, rigorously monitored nonrandomness into the file.
引用
收藏
页码:73 / 84
页数:12
相关论文
共 30 条
[1]  
[Anonymous], 1999, P 5 ACM SIGKDD INT C
[2]  
[Anonymous], P 1997 ACM SIGMOD IN
[3]  
[Anonymous], 2001, P 18 INT C MACH LEAR
[4]  
ANTOSHENKOV G, 1992, PROC INT CONF VERY L, P375
[5]  
BARRON AR, 1988, P IEEE S INFORM THEO, V38, P1437
[6]  
Bradley P. S., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P9
[7]  
Chaudhuri Surajit, 2004, P 2004 ACM SIGMOD IN, P287, DOI [10.1145/1007568.1007602, DOI 10.1145/1007568.1007602]
[8]   Maintenance of discovered association rules in large databases: Art incremental updating technique [J].
Cheung, DW ;
Han, JW ;
Ng, VT ;
Wong, CY .
PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, 1996, :106-114
[9]  
Cochran W. G., 1977, WILEY SERIES PROBABI
[10]  
Garcia-Molina Hector., 2000, DATABASE SYSTEM IMPL