RAFIKI: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic MetagenomicsWorkloads

被引:39
作者
Mahgoub, Ashraf [1 ]
Wood, Paul [1 ]
Ganesh, Sachandhan [1 ]
Mitra, Subrata [2 ]
Gerlach, Wolfgang [3 ]
Harrison, Travis [3 ]
Meyer, Folker [3 ]
Grama, Ananth [1 ]
Bagchi, Saurabh [1 ]
Chaterji, Somali [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
[2] Adobe Res, San Jose, CA USA
[3] Argonne Natl Lab, Argonne, IL 60439 USA
来源
PROCEEDINGS OF THE 2017 INTERNATIONAL MIDDLEWARE CONFERENCE (MIDDLEWARE'17) | 2017年
基金
美国国家科学基金会;
关键词
Database automatic tuning; Metagenomics workloads; NoSQL datastores; BIOINFORMATICS; MAPREDUCE; DATABASE;
D O I
10.1145/3135974.3135991
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications often rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 configuration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write -heavy ones, and a system tuned with a read-optimized configuration becomes suboptimal when the workload becomes write-heavy. In this paper, we present a method and a system for optimizing NoSQL configurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the significance of configuration parameters using ANOVA. Next, we apply neural networks using the most significant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. Then, we optimize the configuration using genetic algorithms on the surrogate to maximize the workloaddependent performance. Using the proposed methodology in our system (RAFIKI), we can predict the throughput for unseen workloads and configuration values with an error of 7.5% for Cassandra and 6.9-7.8% for ScyllaDB. Searching the configuration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default configuration with respect to a read-heavy workload, and also significant improvement for mixed workloads. In terms of searching speed, RAFIKI, using only 1/10000-th of the searching time of exhaustive search, reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively supporting optimizations for highly dynamic workloads.
引用
收藏
页码:28 / 40
页数:13
相关论文
共 44 条
[1]  
Abramov V. E., 2013, 14th Scientific Conference on the "theory and practice of the struggle against parasitic diseases", Moscow, Russia, 21-23 May 2013, P14
[2]  
Alteroot, 2017, CHANG CASS COMP STRA
[3]  
[Anonymous], 2011, INFORM THEORY CODING, DOI DOI 10.1017/CBO9780511921889
[4]  
[Anonymous], 2011, 6 INT C
[5]  
[Anonymous], 2017, 14 USENIX S NETW SYS
[6]  
[Anonymous], 2015, 2015 ACM SPEC 6 INT
[7]  
[Anonymous], 2010, GENETIC EVOLUTIONARY, DOI DOI 10.1145/1830483.1830558
[8]  
[Anonymous], 2017, MATLAB NEURAL NETWOR
[9]  
[Anonymous], 2010, SIGMOD Rec, DOI [10.1145/1978915.1978919, DOI 10.1145/1978915.1978919]
[10]  
[Anonymous], 2009, P VLDB ENDOWMENT