Multi-stage Method for Online Vertical Data Partitioning Based on Spectral Clustering

被引:0
作者
Liu P.-J. [1 ,2 ]
Li H.-Y. [1 ,2 ]
Wang T.-Y. [1 ,2 ]
Liu H. [1 ,2 ]
Sun L.-M. [1 ,2 ]
Ren Y.-F. [3 ]
Li C.-P. [1 ,2 ]
Chen H. [1 ,2 ]
机构
[1] Key Laboratory of Data Engineering and Knowledge Engineering, the Ministry of Education, Renmin University of China, Beijing
[2] School of Information, Renmin University of China, Beijing
[3] Huawei Cloud Database Innovation Lab, Shenzhen
来源
Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 06期
关键词
frequent itemsets; greedy search; multistage decision; spectral clustering; vertical partitioning;
D O I
10.13328/j.cnki.jos.006496
中图分类号
学科分类号
摘要
Vertical data partitioning technology logically stores database table attributes satisfying certain semantic conditions in the same physical block, so as to reduce the cost of data accessing and improve the efficiency of query processing. Every query is usually related only to the table’s some attributes in the database, so a subset of the table’s attributes can be used to get the accurate query results. Reasonable vertical data partitioning can make most queries answered without scanning the whole table, so as to reduce the amount of data accessing and improve the efficiency of query processing. Traditional database vertical partitioning methods are mainly based on heuristic rules set by experts. The granularity of partitioning is coarse, and it can not provide different partition optimizations according to the characteristics of workload. Besides, when the scale of workload or the number of attributes becomes large, the execution time of the existing methods are too long to meet the performance requirements of online real-time tuning of database. Therefore, a method called spectral clustering based vertical partitioning (SCVP) is proposed for the online environment. The idea of phased solution is adapted to reduce the time complexity of the algorithm and speed up partitioning. Firstly, SCVP reduces the solution space by increasing the constraint conditions, that is, generating initial partitions by spectral clustering. Secondly, SCVP designs the algorithm to search solution space, that is, the initial partitions are optimized by combining frequent itemset mining and greedy search. In order to further improve the performance of SCVP under high-dimensional attributes, a new method called special clustering based vertical partitioning redesign (SCVP-R) is proposed which is an improved version of SCVP. SCVP-R optimizes the partitions combiner component of SCVP by introducing sympatric-competition mechanism, double-elimination mechanism, and loop mechanism. The experimental results on different datasets show that SCVP and SCVP-R have faster execution time and better performance than the current state-of-the-art vertical partitioning method. © 2023 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2804 / 2832
页数:28
相关论文
共 33 条
[1]  
Hankins RA, Patel JM., Data morphing: An adaptive, cache-conscious storage technique, Proc. of the 29th Int’l Conf. on Very Large Data Bases, pp. 417-428, (2003)
[2]  
Rodriguez L, Li XO, Cuevas-Rasgado AD, Garcia-Lamont F., DYVEP: An active database system with vertical partitioning functionality, Proc. of the 10th IEEE Int’l Conf. on Networking, Sensing and Control (ICNSC), pp. 457-462, (2013)
[3]  
Li LZ, Gruenwald L., SMOPD-C: An autonomous vertical partitioning technique for distributed databases on cluster computers, Proc. of the 15th IEEE Int’l Conf. on Information Reuse and Integration, pp. 171-178, (2014)
[4]  
Sun LW, Franklin MJ, Krishnan S, Xin RS., Fine-grained partitioning for aggressive data skipping, Proc. of the 2014 ACM SIGMOD Int’l Conf. on Management of Data, pp. 1115-1126, (2014)
[5]  
Shanbhag A, Jindal A, Madden S, Quiane J, Elmore AJ., A robust partitioning scheme for ad-hoc query workloads, Proc. of the 2017 Symp. on Cloud Computing, pp. 229-241, (2017)
[6]  
Papadomanolakis S, Ailamaki A., AutoPart: Automating schema design for large scientific databases using data partitioning, Proc. of the 16th Int’l Conf. on Scientific and Statistical Database Management, pp. 383-392, (2004)
[7]  
Grund M, Kruger J, Plattner H, Zeier A, Cudre-Mauroux P, Madden S., Hyrise: A main memory hybrid storage engine, Proc. of the VLDB Endowment, 4, 2, pp. 105-116, (2010)
[8]  
Karypis G, Kumar V., Analysis of multilevel graph partitioning, Proc. of the 1995 ACM/IEEE Conf. on Supercomputing (Supercomputing1995), (1995)
[9]  
Jindal A, Quiane-Ruiz JA, Dittrich J., Trojan data layouts: Right shoes for a running elephant, Proc. of the 2nd ACM Symp. on Cloud Computing, (2011)
[10]  
Navathe S, Ceri S, Wiederhold G, Dou JL., Vertical partitioning algorithms for database design, ACM Trans. on Database Systems, 9, 4, pp. 680-710, (1984)