DSDD: Domain-Specific Dataset Discovery on the Web

被引:1
作者
Zhang, Haoxiang [1 ]
Santos, Aecio [1 ]
Freire, Juliana [1 ]
机构
[1] NYU, New York, NY 10012 USA
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021 | 2021年
关键词
Focused Crawling; Domain-Specific Dataset Discovery; Meta Search; Online Learning; Multi-Armed Bandit; FOCUSED CRAWLER;
D O I
10.1145/3459637.3482427
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
引用
收藏
页码:2527 / 2536
页数:10
相关论文
共 62 条
[51]  
Sarhan AM, 2015, 2015 TENTH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING & SYSTEMS (ICCES), P185, DOI 10.1109/ICCES.2015.7393043
[52]  
Sizov Sergej, 2003, CIDR
[53]  
Slattery S., 2000, PROC ICML00, P895
[54]  
Socrata, SOCR OP DAT API
[55]  
United Nations World Food Programme, ETH COUNTR BRIEF
[56]  
Vidal M. L. A., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P292, DOI 10.1145/1148170.1148223
[57]  
Wang R. Y., 1996, Journal of Management Information Systems, V12, P5
[58]  
Werner L, 2005, DMIN '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON DATA MINING, P24
[59]  
World Bank, WORLD BANK OP DAT
[60]   Data pricing strategy based on data quality [J].
Yu, Haifei ;
Zhang, Mengxiao .
COMPUTERS & INDUSTRIAL ENGINEERING, 2017, 112 :1-10