Massive Data Load on Distributed Database Systems over HBase

被引:10
作者
Azqueta-Alzuaz, Ainhoa [1 ]
Patino-Martinez, Marta [1 ]
Brondino, Ivan [2 ]
Jimenez-Peris, Ricardo [2 ]
机构
[1] Univ Politecn Madrid, Madrid, Spain
[2] LeanXcale, Madrid, Spain
来源
2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID) | 2017年
基金
欧盟地平线“2020”;
关键词
HBase; MapReduce; HDFS; Split table; Massive Data load;
D O I
10.1109/CCGRID.2017.124
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Big Data has become a pervasive technology to manage the ever-increasing volumes of data. Among Big Data solutions, scalable data stores play an important role, especially, key-value data stores due to their large scalability (thousands of nodes). The typical workflow for Big Data applications include two phases. The first one is to load the data into the data store typically as part of an ETL (Extract-Transform-Load) process. The second one is the processing of the data itself. BigTable and HBase are the preferred key-value solutions based on range-partitioned data stores. However, the loading phase is inefficient and creates a single node bottleneck. In this paper, we identify and quantify this bottleneck and propose a tool for parallel massive data loading that solves satisfactorily the bottleneck enabling all the parallelism and throughput of the underlying key-value data store during the loading phase as well. The proposed solution has been implemented as a tool for parallel massive data loading over HBase, the key-value data store of the Hadoop ecosystem.
引用
收藏
页码:776 / 779
页数:4
相关论文
共 7 条
[1]  
Apache HBase Team, 2016, HBASE TOOLS UT 130 1
[2]  
Chang F., 2006, BIGTABLE DISTRIBUTED
[3]  
Cloudera, 2016, IMP DAT HBASE
[4]  
Dean J., 2004, OSDI 2004
[5]  
DeCandia G., 2007, SOSP 2007
[6]  
George Lars, 2011, HBase: the Definitive Guide: Random Access to Your Planet-Size Data
[7]  
Shvachko, 2010, 26 S MASS STOR SYST