Instance-Optimized Data Layouts for Cloud Analytics Workloads

被引:23
作者
Ding, Jialin [1 ]
Minhas, Umar Farooq [2 ]
Chandramouli, Badrish [2 ]
Wang, Chi [2 ]
Li, Yinan [2 ]
Li, Ying [3 ]
Kossmann, Donald [2 ]
Gehrke, Johannes [3 ]
Kraska, Tim [1 ]
机构
[1] MIT, Cambridge, MA 02139 USA
[2] Microsoft Res, Redmond, WA USA
[3] Microsoft, Redmond, WA USA
来源
SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2021年
关键词
D O I
10.1145/3448016.3457270
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Today, businesses rely on efficiently running analytics on large amounts of operational and historical data to gain business insights and competitive advantage. Increasingly, such analytics are run using cloud-based data analytics services, such as Google BigQuery, Microsoft Azure Synapse, Amazon Redshift, and Snowflake. These services persist and process data in compressed, columnar formats, stored in large blocks, each of which contains thousands or millions of records. For these services, disk I/O from (remote) cloud storage is often one of the dominant costs for query processing. To reduce the amount of I/O, services often maintain per-block metadata, such as zone maps, which are used to skip blocks that are irrelevant to the query, leading to lower query execution times. However, the effectiveness of block skipping via zone maps is dependent on how the records are assigned to blocks. Recent work on instance-optimized data layouts aims to maximize block skipping by specializing the block assignment strategy to a specific dataset and workload. However, these existing approaches only optimize the layout for a single table. In this paper, we propose MTO, an instance-optimized data layout framework that determines the blocking strategy for all tables in a multi-table dataset in the presence of joins, such as in a star or snowflake schema common in real-world workloads. MTO takes advantage of sideways information passing through joins to jointly optimize the layout for all tables, which results in better block skipping and hence reduced query execution times. Experiments on a commercial cloud-based analytics service show that MTO achieves up to 93% reduction in blocks accessed and 75% reduction in end-to-end query times compared to state-of-the-art blocking strategies.
引用
收藏
页码:418 / 431
页数:14
相关论文
共 59 条
[1]  
Agrawal S., 2005, SIGMOD, P930, DOI DOI 10.1145/1066157.1066292
[2]  
[Anonymous], 2017, Z-order indexing for multifaceted queries in amazon dynamodb
[3]   Optimal Column Layout for Hybrid Workloads [J].
Athanassoulis, Manos ;
Bogh, Kenneth S. ;
Idreos, Stratos .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (13) :2393-2407
[4]   USING SEMI-JOINS TO SOLVE RELATIONAL QUERIES [J].
BERNSTEIN, PA ;
CHIU, DMW .
JOURNAL OF THE ACM, 1981, 28 (01) :25-40
[5]   Better bitmap performance with Roaring bitmaps [J].
Chambi, Samy ;
Lemire, Daniel ;
Kaser, Owen ;
Godin, Robert .
SOFTWARE-PRACTICE & EXPERIENCE, 2016, 46 (05) :709-719
[6]  
Christopherson Zach, 2016, AMAZON REDSHIFT ENG
[7]   Schism: a Workload-Driven Approach to Database Replication and Partitioning [J].
Curino, Carlo ;
Jones, Evan ;
Zhang, Yang ;
Madden, Sam .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01) :48-57
[8]  
Databricks, 2020, DAT SKIPP IND
[9]  
Databricks Delta Engine, 2020, Z ORD MULT DIM CLUST
[10]   Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads [J].
Ding, Jialin ;
Nathan, Vikram ;
Alizadeh, Mohammad ;
Kraska, Tim .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (02) :74-86