WebTables: Exploring the Power of Tables on the Web

被引:294
作者
Cafarella, Michael J. [1 ]
Halevy, Alon [2 ]
Wang, Daisy Zhe [3 ]
Wu, Eugene [4 ]
Zhang, Yang [4 ]
机构
[1] Univ Washington, Seattle, WA 98107 USA
[2] Google Inc, Mountain View, CA 94043 USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
[4] MIT, Cambridge, MA 02139 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2008年 / 1卷 / 01期
关键词
D O I
10.14778/1453856.1453916
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.
引用
收藏
页码:538 / 549
页数:12
相关论文
共 31 条
  • [1] Agichtein E., 2001, SIGMOD C
  • [2] Agrawal S., 2002, ICDE
  • [3] Bell S., 1995, EUP C MACH LEARN
  • [4] Brants T., 2007, P 2007 JOINT C EMPIR, P858
  • [5] Cafarella M., 2008, UNCOVERING IN PRESS
  • [6] Cafarella M. J., 2007, WEBDB
  • [7] Chen Hsin-Hsi, 2000, 18 C COMPUTATIONAL L, P166, DOI [10.3115/990820.990845, DOI 10.3115/990820.990845]
  • [8] Church K. W., 1989, P 27 ANN ASS COMP LI
  • [9] Dhamankar R., 2004, SIGMOD C
  • [10] DOAN AH, 2001, SIGMOD C