Spark SQL: Relational Data Processing in Spark

被引:720
作者
Armbrust, Michael [1 ]
Xin, Reynold S. [1 ]
Lian, Cheng [1 ]
Huai, Yin [1 ]
Liu, Davies [1 ]
Bradley, Joseph K. [1 ]
Meng, Xiangrui [1 ]
Kaftan, Tomer [3 ]
Franklint, Michael J. [1 ,3 ]
Ghodsi, Ali [1 ]
Zaharia, Matei [1 ,2 ]
机构
[1] Databricks Inc, San Francisco, CA 94105 USA
[2] MIT CSAIL, Cambridge, MA USA
[3] Univ Calif Berkeley, AMPLab, Berkeley, CA USA
来源
SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA | 2015年
关键词
Databases; Data Warehouse; Machine Learning; Spark; Hadoop;
D O I
10.1145/2723372.2742797
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
引用
收藏
页码:1383 / 1394
页数:12
相关论文
共 29 条
  • [1] Abouzied Azza., 2013, EDBT
  • [2] The Stratosphere platform for big data analytics
    Alexandrov, Alexander
    Bergmann, Rico
    Ewen, Stephan
    Freytag, Johann-Christoph
    Hueske, Fabian
    Heise, Arvid
    Kao, Odej
    Leich, Marcus
    Leser, Ulf
    Markl, Volker
    Naumann, Felix
    Peters, Mathias
    Rheinlaender, Astrid
    Sax, Matthias J.
    Schelter, Sebastian
    Hoeger, Mareike
    Tzoumas, Kostas
    Warneke, Daniel
    [J]. VLDB JOURNAL, 2014, 23 (06) : 939 - 964
  • [3] [Anonymous], 2012, P 9 USENIX C NET WOR
  • [4] [Anonymous], 2014, 11 USENIX S OP SYST
  • [5] [Anonymous], 2015, CIDR
  • [6] [Anonymous], 2010, PLDI
  • [7] [Anonymous], 2015, SIGMOD
  • [8] Armbrust M., 2010, SOCC
  • [9] ASTERIX: towards a scalable, semistructured data platform for evolving-world models
    Behm, Alexander
    Borkar, Vinayak R.
    Carey, Michael J.
    Grover, Raman
    Li, Chen
    Onose, Nicola
    Vernica, Rares
    Deutsch, Alin
    Papakonstantinou, Yannis
    Tsotras, Vassilis J.
    [J]. DISTRIBUTED AND PARALLEL DATABASES, 2011, 29 (03) : 185 - 216
  • [10] Bex GreetJan., 2007, VLDB