Validating multi-column schema matchings by type

被引:9
作者
Dai, Bing Tian [1 ]
Koudas, Nick [2 ]
Srivastava, Divesh [3 ]
Tung, Anthony K. H. [1 ]
Venkatasubramanian, Suresh [4 ]
机构
[1] Natl Univ Singapore, Singapore 117590, Singapore
[2] Univ Toronto, Toronto, ON M5S 1A1, Canada
[3] AT&T Labs Res, Shannon Lab, Florham Pk, NJ 07932 USA
[4] Univ Utah, Salt Lake City, UT 84112 USA
来源
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3 | 2008年
关键词
D O I
10.1109/ICDE.2008.4497420
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Validation of multi-column schema matchings is essential for successful database integration. This task is especially difficult when the databases to be integrated contain little overlapping data, as is often the case in practice (e.g., customer bases of different companies). Based on the intuition that values present in different columns related by a schema matching will have similar "semantic type", and that this can be captured using distributions over values ("statistical types"), we develop a method for validating 1-1 and compositional schema matchings. Our technique is based on three key technical ideas. First, we propose a generic measure for comparing two columns matched by a schema matching, based on a notion of information-theoretic discrepancy that generalizes the standard geometric discrepancy; this provides the basis for 1:1 matching. Second, we present an algorithm for "splitting" the string values in a column to identify substrings that are likely to match with the values in another column; this enables (multi-column) 1:m schema matching. Third, our technique provides an invalidation certificate if it fails to validate a schema matching. We complement our conceptual and algorithmic contributions with an experimental study that demonstrates the effectiveness and efficiency of our technique on a variety of database schemas and data sets.
引用
收藏
页码:120 / +
页数:2
相关论文
共 22 条
  • [1] [Anonymous], 2005, Data Mining Pratical Machine Learning Tools and Techniques
  • [2] BERLIN J, 2002, P CAISE
  • [3] BERLIN J, 2001, P INT C COOP INF SYS, P108
  • [4] BOHANNON P, 2006, VLDB, P307
  • [5] Chiticariu Laura, 2006, P 32 INT C VER LARG, P79
  • [6] Cover TM, 2006, Elements of Information Theory
  • [7] DAI BT, 2006, P IEEE INT C DAT MIN
  • [8] Doan A, 2005, AI MAG, V26, P83
  • [9] Doan AnHai., 2001, ACM Sigmod Record, V30, P509, DOI DOI 10.1145/375663.375731
  • [10] Embley DW, 2004, SIGMOD REC, V33, P14, DOI 10.1145/1041410.1041413