Sample-based quality estimation of query results in relational database environments

被引:13
作者
Ballou, DP [1 ]
Chengalur-Smith, IN
Wang, RY
机构
[1] Univ Albany, Management Sci & Informat Syst, Albany, NY 12222 USA
[2] MIT, Sloan Sch Management, Informat Qual Program, Cambridge, MA 02142 USA
关键词
data quality; database sampling; information product; relational algebra; quality control;
D O I
10.1109/TKDE.2006.83
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The quality of data in relational databases is often uncertain, and the relationship between the quality of the underlying base tables and the set of potential query results, a type of information product (IP), that could be produced from them has not been fully investigated. This paper provides a basis for the systematic analysis of the quality of such IPs. This research uses the relational algebra framework to develop estimates for the quality of query results based on the quality estimates of samples taken from the base tables. Our procedure requires an initial sample from the base tables; these samples are then used for all possible information IPs. Each specific query governs the quality assessment of the relevant samples. By using the same sample repeatedly, our approach is relatively cost effective. We introduce the Reference-Table Procedure, which can be used for quality estimation in general. In addition, for each of the basic algebraic operators, we discuss simpler procedures that may be applicable. Special attention is devoted to the Join operation. We examine various, relevant statistical issues, including how to deal with the impact on quality of missing rows in base tables. Finally, we address several implementation issues related to sampling.
引用
收藏
页码:639 / 650
页数:12
相关论文
共 30 条
[1]  
Acharya S, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P275, DOI 10.1145/304181.304207
[2]  
Acharya S, 2000, SIGMOD REC, V29, P487
[3]   DESIGNING INFORMATION-SYSTEMS TO OPTIMIZE THE ACCURACY-TIMELINESS TRADEOFF [J].
BALLOU, DP ;
PAZER, HL .
INFORMATION SYSTEMS RESEARCH, 1995, 6 (01) :51-72
[4]   COST QUALITY TRADEOFFS FOR CONTROL PROCEDURES IN INFORMATION-SYSTEMS [J].
BALLOU, DP ;
PAZER, HL .
OMEGA-INTERNATIONAL JOURNAL OF MANAGEMENT SCIENCE, 1987, 15 (06) :509-521
[5]  
BOSWELL MT, 1988, HDB STAT, P469
[6]  
Duncan A., 1986, Quality control and industrial statistics, V5th
[7]  
Efron B., 1994, INTRO BOOTSTRAP, DOI DOI 10.1201/9780429246593
[8]  
FIENBERG SE, 1996, CHANCE, V9
[9]  
FISHER ML, 2000, HARVARD BUSINESS JUL, P115
[10]  
Funk J., 1998, Proceedings of the Conference on Information Quality, V3, P1