DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing

被引：2

作者：

Watson, Alex ^{[1
]}

Das, Suvam Kumar ^{[1
]}

Ray, Suprio ^{[1
]}

机构：

[1] Univ New Brunswick, Fredericton, NB, Canada

来源：

2021 IEEE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2021年

关键词：

INDEX;

D O I：

10.1109/DSAA53316.2021.9564218

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.

引用

页数：10

共 20 条

[1] [Anonymous], 2012, SIGMOD
[2] Chamanara J., 2017, PROC VLDB ENDOW
[3] Parallel In-Situ Data Processing with Speculative Loading
Cheng, Yu
Rusu, Florin
[J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1287 - 1298
[4] Dask Development Team, 2016, DASK DISTR
[5] Dsilva J. V, 2018, PVLDB, V11
[6] The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds
Ferragina, Paolo
Vinciguerra, Giorgio
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (08): : 1162 - 1175
[7] Fouche E., 2018, SSDBM
[8] FITing-Tree: A Data-aware Index Structure
Galakatos, Alex
Markovitch, Michael
Binnig, Carsten
Fonseca, Rodrigo
Kraska, Tim
[J]. SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 1189 - 1206
[9] The MADlib Analytics Library or MAD Skills, the SQL
Hellerstein, Joseph M.
Re, Christoper
Schoppmann, Florian
Wang, Daisy Zhe
Fratkin, Eugene
Gorajek, Aleksander
Ng, Kee Siong
Welton, Caleb
Feng, Xixuan
Li, Kun
Kumar, Arun
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1700 - 1711
[10] Nguyen K, 2018, ACM SIGPLAN NOTICES, V53, P56, DOI [10.1145/3296957.3173200, 10.1145/3173162.3173200]

← 1 2 →