SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures

被引:70
作者
Floratou, Avrilia [1 ]
Minhas, Umar Farooq [1 ]
Ozcan, Fatma [1 ]
机构
[1] IBM Almaden, Res Ctr, San Jose, CA 95120 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 7卷 / 12期
关键词
D O I
10.14778/2732977.2733002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQI, support over Hadoop, (live is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.
引用
收藏
页码:1295 / 1306
页数:12
相关论文
共 14 条
  • [1] Column-oriented Database Systems
    Abadi, Daniel J.
    Boncz, Peter A.
    Harizopoulos, Stavros
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2009, 2 (02): : 1664 - 1665
  • [2] Abouzeid A, 2009, P VLDB, V2, P922
  • [3] Data page layouts for relational databases on deep memory hierarchies
    Ailamaki, A
    DeWitt, DJ
    Hill, MD
    [J]. VLDB JOURNAL, 2002, 11 (03) : 198 - 215
  • [4] HAWQ: A Massively Parallel Processing SQL Engine in Hadoop
    Chang, Lei
    Wang, Zhanwei
    Ma, Tao
    Jian, Lirong
    Ma, Lili
    Goldshuv, Alon
    Lonergan, Luke
    Cohen, Jeffrey
    Welton, Caleb
    Sherry, Gavin
    Bhandarkar, Milind
    [J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1223 - 1234
  • [5] DeWitt D. J., 2013, SIGMOD, P1255, DOI DOI 10.1145/2463676.2463709
  • [6] Can the Elephants Handle the NoSQL Onslaught?
    Floratou, Avrilia
    Teletia, Nikhil
    DeWitt, David J.
    Patel, Jignesh M.
    Zhang, Donghui
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2012, 5 (12): : 1712 - 1723
  • [7] Column-Oriented Storage Techniques for MapReduce
    Floratou, Avrilia
    Patel, Jignesh M.
    Shekita, Eugene J.
    Tata, Sandeep
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2011, 4 (07): : 419 - 429
  • [8] He YQ, 2011, PROC INT CONF DATA, P1199, DOI 10.1109/ICDE.2011.5767933
  • [9] YSmart: Yet Another SQL-to-MapReduce Translator
    Lee, Rubao
    Luo, Tian
    Huai, Yin
    Wang, Fusheng
    He, Yongqiang
    Zhang, Xiaodong
    [J]. 31ST INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2011), 2011, : 25 - 36
  • [10] Melnik S, 2010, PROC VLDB ENDOW, V3, P330