Fine-grained Partitioning for Aggressive Data Skipping

被引:60
作者
Sun, Liwen [1 ]
Franklin, Michael J. [1 ]
Krishnan, Sanjay [1 ]
Xin, Reynold S. [2 ]
机构
[1] Univ Calif Berkeley, AMPLab, Berkeley, CA 94720 USA
[2] Databricks Inc, San Francisco, CA USA
来源
SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2014年
关键词
Data warehouse; Partitioning; Query processing; Algorithms; SELECTION;
D O I
10.1145/2588555.2610515
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Modern query engines are increasingly being required to process enormous datasets in near real-time. While much can be done to speed up the data access, a promising technique is to reduce the need to access data through data skipping. By maintaining some metadata for each block of tuples, a query may skip a data block if the metadata indicates that the block does not contain relevant data. The effectiveness of data skipping, however, depends on how well the blocking scheme matches the query filters. In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. We first extract representative filters in a workload as features using frequent itemset mining. Based on these features, each data tuple can be represented as a feature vector. We then formulate the blocking problem as a optimization problem on the feature vectors, called Balanced MaxSkip Partitioning, which we prove is NP-hard. To find an approximate solution efficiently, we adopt the bottom-up clustering framework. We prototyped our blocking techniques on Shark, an open-source data warehouse system. Our experiments on TPC-H and a real-world workload show that our blocking technique leads to 2-5x improvement in query response time over traditional range-based blocking techniques.
引用
收藏
页码:1115 / 1126
页数:12
相关论文
共 34 条
[1]  
Agrawal R., 1994, P 20 INT C VER LARG, P487, DOI DOI 10.5555/645920.672836
[2]   \ Adaptive Range Filters for Cold Data: Avoiding Trips to Siberia [J].
Alexiou, Karolina ;
Kossmann, Donald ;
Larson, Per-Ake .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (14) :1714-1725
[3]   NP-hardness of Euclidean sum-of-squares clustering [J].
Aloise, Daniel ;
Deshpande, Amit ;
Hansen, Pierre ;
Popat, Preyas .
MACHINE LEARNING, 2009, 75 (02) :245-248
[4]  
[Anonymous], 2002, Proceedings of the 2002 ACM SIGMOD international conference on Management of data
[5]  
[Anonymous], 2007, CIDR, DOI [10.1002/per, DOI 10.1002/PER]
[6]  
[Anonymous], 2009, Sort, DOI DOI 10.14778/1687553.1687609
[7]  
[Anonymous], 2013, P 8 ACM EUR C COMP S
[8]  
[Anonymous], 2000, P 26 INT C VER LARG
[9]  
[Anonymous], 2004, SIGMOD Conference, DOI [10.1145/1007568.1007609, DOI 10.1145/1007568.1007609]
[10]  
Aouiche K, 2006, LECT NOTES COMPUT SC, V4152, P81