DaskDB: Scalable Data Science with Unified Data Analytics and In Situ Query Processing

被引:2
作者
Watson, Alex [1 ]
Das, Suvam Kumar [1 ]
Ray, Suprio [1 ]
机构
[1] Univ New Brunswick, Fredericton, NB, Canada
来源
2021 IEEE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA) | 2021年
关键词
INDEX;
D O I
10.1109/DSAA53316.2021.9564218
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the rapidly rising data volume, there is a need to analyze this data efficiently and produce results quickly. However, data scientists today need to use different systems, since presently relational databases are primarily used for SQL querying and data science frameworks for complex data analysis. This may incur significant movement of data across multiple systems, which is expensive. Furthermore, with relational databases, the data must be completely loaded into the database before performing any analysis. We believe that data scientists would prefer to use a single system to perform both data analysis tasks and SQL querying, without requiring data movement between different systems. Ideally, this system would offer adequate performance, scalability, built-in data analysis functionalities, and usability. We present DaskDB, a scalable data science system with support for unified data analytics and in situ SQL query processing on heterogeneous data sources. DaskDB supports invoking Python APIs as User-Defined Functions (UDF). So, it can be easily integrated with most existing Python data science applications. Moreover, we introduce a distributed index join algorithm and a novel distributed learned index to improve join performance. Our experimental evaluation involve the TPC-H benchmark and a custom UDF benchmark, which we developed, for data analytics. And, we demonstrate that DaskDB significantly outperforms PySpark and Hive/Hivemall.
引用
收藏
页数:10
相关论文
共 20 条
  • [1] [Anonymous], 2012, SIGMOD
  • [2] Chamanara J., 2017, PROC VLDB ENDOW
  • [3] Parallel In-Situ Data Processing with Speculative Loading
    Cheng, Yu
    Rusu, Florin
    [J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1287 - 1298
  • [4] Dask Development Team, 2016, DASK DISTR
  • [5] Dsilva J. V, 2018, PVLDB, V11
  • [6] The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds
    Ferragina, Paolo
    Vinciguerra, Giorgio
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (08): : 1162 - 1175
  • [7] Fouche E., 2018, SSDBM
  • [8] FITing-Tree: A Data-aware Index Structure
    Galakatos, Alex
    Markovitch, Michael
    Binnig, Carsten
    Fonseca, Rodrigo
    Kraska, Tim
    [J]. SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 1189 - 1206
  • [9] The MADlib Analytics Library or MAD Skills, the SQL
    Hellerstein, Joseph M.
    Re, Christoper
    Schoppmann, Florian
    Wang, Daisy Zhe
    Fratkin, Eugene
    Gorajek, Aleksander
    Ng, Kee Siong
    Welton, Caleb
    Feng, Xixuan
    Li, Kun
    Kumar, Arun
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1700 - 1711
  • [10] Nguyen K, 2018, ACM SIGPLAN NOTICES, V53, P56, DOI [10.1145/3296957.3173200, 10.1145/3173162.3173200]