DIFF: a relational interface for large-scale data explanation

被引:10
作者
Abuzaid, Firas [1 ]
Kraft, Peter [1 ]
Suri, Sahaana [1 ]
Gan, Edward [1 ]
Xu, Eric [1 ]
Shenoy, Atul [2 ]
Ananthanarayan, Asvin [2 ]
Sheu, John [2 ]
Meijer, Erik [3 ]
Wu, Xi [4 ]
Naughton, Jeff [4 ]
Bailis, Peter [1 ]
Zaharia, Matei [1 ]
机构
[1] Stanford Univ, Stanford DAWN Project, Stanford, CA 94305 USA
[2] Microsoft Inc, Redmond, WA USA
[3] Facebook Inc, Menlo Pk, CA USA
[4] Google Inc, Mountain View, CA USA
关键词
Data exploration; Explanations; Big data; Data analytics; Databases; Feature selection; Query optimization; FEATURE-SELECTION; INDEX SUPPORT;
D O I
10.1007/s00778-020-00633-6
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
A range ofexplanation enginesassist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today's explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.
引用
收藏
页码:45 / 70
页数:26
相关论文
共 70 条
  • [1] Abiteboul S., 1995, Foundations of databases, V8
  • [2] Agrawal R., 1994, P 20 INT C VER LARG, VVolume 1215, P487
  • [3] [Anonymous], 2015, CIDR
  • [4] [Anonymous], 2016, THESIS
  • [5] Antonakakis M, 2017, PROCEEDINGS OF THE 26TH USENIX SECURITY SYMPOSIUM (USENIX SECURITY '17), P1093
  • [6] Spark SQL: Relational Data Processing in Spark
    Armbrust, Michael
    Xin, Reynold S.
    Lian, Cheng
    Huai, Yin
    Liu, Davies
    Bradley, Joseph K.
    Meng, Xiangrui
    Kaftan, Tomer
    Franklint, Michael J.
    Ghodsi, Ali
    Zaharia, Matei
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1383 - 1394
  • [7] Avnur R, 2000, SIGMOD REC, V29, P261, DOI 10.1145/335191.335420
  • [8] Babu Shivnath, 2005, SIGMOD, P107, DOI [10.1145/1066157.1066171, DOI 10.1145/1066157.1066171]
  • [9] Bailis P., 2017, CIDR
  • [10] MacroBase: Prioritizing Attention in Fast Data
    Bailis, Peter
    Gan, Edward
    Maddens, Samuel
    Narayanan, Deepak
    Rong, Kexin
    Suri, Sahaana
    [J]. SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 541 - 556