DIFF: a relational interface for large-scale data explanation

被引:0
作者
Firas Abuzaid
Peter Kraft
Sahaana Suri
Edward Gan
Eric Xu
Atul Shenoy
Asvin Ananthanarayan
John Sheu
Erik Meijer
Xi Wu
Jeff Naughton
Peter Bailis
Matei Zaharia
机构
[1] Stanford University,Stanford DAWN Project
[2] Microsoft Inc,undefined
[3] Facebook Inc,undefined
[4] Google Inc,undefined
来源
The VLDB Journal | 2021年 / 30卷
关键词
Data exploration; Explanations; Big data; Data analytics; Databases; Feature selection; Query optimization;
D O I
暂无
中图分类号
学科分类号
摘要
A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.
引用
收藏
页码:45 / 70
页数:25
相关论文
共 50 条
  • [11] Feature selection for large-scale data sets in GrC
    Liang, Jiye
    2012 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING (GRC 2012), 2012, : 2 - 7
  • [12] Feature-aware forecasting of large-scale time series data sets
    Hartmann, Claudio
    Kegel, Lars
    Lehner, Wolfgang
    IT-INFORMATION TECHNOLOGY, 2020, 62 (3-4): : 157 - 168
  • [13] Data Integration for Large-Scale Models of Species Distributions
    Isaac, Nick J. B.
    Jarzyna, Marta A.
    Keil, Petr
    Dambly, Lea I.
    Boersch-Supan, Philipp H.
    Browning, Ella
    Freeman, Stephen N.
    Golding, Nick
    Guillera-Arroita, Gurutzeta
    Henrys, Peter A.
    Jarvis, Susan
    Lahoz-Monfort, Jose
    Pagel, Joern
    Pescott, Oliver L.
    Schmucki, Reto
    Simmonds, Emily G.
    O'Hara, Robert B.
    TRENDS IN ECOLOGY & EVOLUTION, 2020, 35 (01) : 56 - 67
  • [14] Polynomial Data Compression for Large-Scale Physics Experiments
    Aubert P.
    Vuillaume T.
    Maurin G.
    Jacquemier J.
    Lamanna G.
    Emad N.
    Computing and Software for Big Science, 2018, 2 (1)
  • [15] Review of Statistical Analysis Methods of Large-Scale Data
    Hajirahimova, Makrufa S.
    Aliyeva, Aybeniz S.
    2015 9TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2015, : 67 - 71
  • [16] Visualizing Large-scale and High-dimensional Data
    Tang, Jian
    Liu, Jingzhou
    Zhang, Ming
    Mei, Qiaozhu
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 287 - 297
  • [17] The Family of MapReduce and Large-Scale Data Processing Systems
    Sakr, Sherif
    Liu, Anna
    Fayoumi, Ayman G.
    ACM COMPUTING SURVEYS, 2013, 46 (01)
  • [18] Fast attribute reduction via inconsistent equivalence classes for large-scale data
    Wang, Guoqiang
    Zhang, Pengfei
    Wang, Dexian
    Chen, Hongmei
    Li, Tianrui
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2023, 163
  • [19] Large-scale Semantic Integration of Linked Data: A Survey
    Mountantonakis, Michalis
    Tzitzikas, Yannis
    ACM COMPUTING SURVEYS, 2019, 52 (05)
  • [20] Magging: Maximin Aggregation for Inhomogeneous Large-Scale Data
    Buehlmann, Peter
    Meinshausen, Nicolai
    PROCEEDINGS OF THE IEEE, 2016, 104 (01) : 126 - 135