Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases

被引:30
作者
Dong Deng [1 ]
Yu Jiang [1 ]
Li, Guoliang [1 ]
Jian Li [2 ]
Cong Yu [3 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing, Peoples R China
[2] Tsinghua Univ, Inst Interdisciplinary Informat, Beijing, Peoples R China
[3] Google Res, New York, NY USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2013年 / 6卷 / 13期
基金
中国国家自然科学基金;
关键词
D O I
10.14778/2536258.2536271
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tabular data on the Web has become a rich source of structured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the semantics of those Web tables and providing effective search and exploration mechanisms over them. An important part of table understanding and search is column concept determination, i.e., identifying the most appropriate concepts associated with the columns of the tables. The problem becomes especially challenging with the availability of increasingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity partition. We prove that both the problem of finding the optimal aggregation strategy and that of finding the optimal partition strategy are NP -hard, and propose efficient heuristic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real -world datasets show that our method achieves high annotation quality and performance, and scales well.
引用
收藏
页码:1606 / 1617
页数:12
相关论文
共 35 条
[1]  
Abiteboul S., 2012, ICDE
[2]  
[Anonymous], EDBT
[3]  
[Anonymous], 2012, SIGMOD
[4]  
Arasu A., 2006, VLDB
[5]   DBpedia: A nucleus for a web of open data [J].
Auer, Soeren ;
Bizer, Christian ;
Kobilarov, Georgi ;
Lehmann, Jens ;
Cyganiak, Richard ;
Ives, Zachary .
SEMANTIC WEB, PROCEEDINGS, 2007, 4825 :722-+
[6]   SPACE/TIME TRADE/OFFS IN HASH CODING WITH ALLOWABLE ERRORS [J].
BLOOM, BH .
COMMUNICATIONS OF THE ACM, 1970, 13 (07) :422-&
[7]  
Bollacker K., 2008, ACM C MAN DAT
[8]   WebTables: Exploring the Power of Tables on the Web [J].
Cafarella, Michael J. ;
Halevy, Alon ;
Wang, Daisy Zhe ;
Wu, Eugene ;
Zhang, Yang .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (01) :538-549
[9]  
Chaudhuri S., 2003, SIGMOD
[10]  
Dean J., 2004, P 6 C S OP SYST DES, P1