DSDD: Domain-Specific Dataset Discovery on the Web

被引:1
作者
Zhang, Haoxiang [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10012 USA
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年
关键词
Focused Crawling; Domain-Specific Dataset Discovery; Meta Search; Online Learning; Multi-Armed Bandit; FOCUSED CRAWLER;
D O I
10.1145/3459637.3482427
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
引用
收藏
页码:2527 / 2536
页数:10
相关论文
共 62 条
[1]  
ACHE, ACHE FOC CRAWL
[2]  
Agarwal Amit, 2009, P 18 ACM C INF KNOWL, P1987, DOI [DOI 10.1145/1645953, 10.1145]
[3]  
Alrashed Tarfah, 2021, INT SEM WEB C ISWC 2
[4]  
[Anonymous], 2014, Revised Selected Papers
[5]  
[Anonymous], 2010, P 3 ACM INT C WEB SE
[6]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[7]  
Barbosa L., 2007, P 16 INT C WORLD WID, DOI 10.1145/1242572.1242632
[8]  
Barrios Federico, 2016, ABS160203606 CORR
[9]  
BARYOSSEF Z, 2009, TWEB, V3, P1
[10]   Auctus: A Dataset Search Engine for Data Discovery and Augmentation [J].
Castelo, Sonia ;
Rampin, Remi ;
Santos, Aecio ;
Bessa, Aline ;
Chirigati, Fernando ;
Freire, Juliana .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12) :2791-2794