SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures

被引：70

作者：

Floratou, Avrilia ^{[1
]}

Minhas, Umar Farooq ^{[1
]}

Ozcan, Fatma ^{[1
]}

机构：

[1] IBM Almaden, Res Ctr, San Jose, CA 95120 USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 7卷 / 12期

关键词：

D O I：

10.14778/2732977.2733002

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQI, support over Hadoop, (live is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.

引用

页码：1295 / 1306

页数：12

共 14 条

[1] Column-oriented Database Systems
Abadi, Daniel J.
Boncz, Peter A.
Harizopoulos, Stavros
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02): : 1664 - 1665
[2] Abouzeid A, 2009, P VLDB, V2, P922
[3] Data page layouts for relational databases on deep memory hierarchies
Ailamaki, A
DeWitt, DJ
Hill, MD
[J]. VLDB JOURNAL, 2002, 11 (03) : 198 - 215
[4] HAWQ: A Massively Parallel Processing SQL Engine in Hadoop
Chang, Lei
Wang, Zhanwei
Ma, Tao
Jian, Lirong
Ma, Lili
Goldshuv, Alon
Lonergan, Luke
Cohen, Jeffrey
Welton, Caleb
Sherry, Gavin
Bhandarkar, Milind
[J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1223 - 1234
[5] DeWitt D. J., 2013, SIGMOD, P1255, DOI DOI 10.1145/2463676.2463709
[6] Can the Elephants Handle the NoSQL Onslaught?
Floratou, Avrilia
Teletia, Nikhil
DeWitt, David J.
Patel, Jignesh M.
Zhang, Donghui
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1712 - 1723
[7] Column-Oriented Storage Techniques for MapReduce
Floratou, Avrilia
Patel, Jignesh M.
Shekita, Eugene J.
Tata, Sandeep
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (07): : 419 - 429
[8] He YQ, 2011, PROC INT CONF DATA, P1199, DOI 10.1109/ICDE.2011.5767933
[9] YSmart: Yet Another SQL-to-MapReduce Translator
Lee, Rubao
Luo, Tian
Huai, Yin
Wang, Fusheng
He, Yongqiang
Zhang, Xiaodong
[J]. 31ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2011), 2011, : 25 - 36
[10] Melnik S, 2010, PROC VLDB ENDOW, V3, P330

← 1 2 →