DIFF: a relational interface for large-scale data explanation

被引:0
|
作者
Firas Abuzaid
Peter Kraft
Sahaana Suri
Edward Gan
Eric Xu
Atul Shenoy
Asvin Ananthanarayan
John Sheu
Erik Meijer
Xi Wu
Jeff Naughton
Peter Bailis
Matei Zaharia
机构
[1] Stanford University,Stanford DAWN Project
[2] Microsoft Inc,undefined
[3] Facebook Inc,undefined
[4] Google Inc,undefined
来源
The VLDB Journal | 2021年 / 30卷
关键词
Data exploration; Explanations; Big data; Data analytics; Databases; Feature selection; Query optimization;
D O I
暂无
中图分类号
学科分类号
摘要
A range of explanation engines assist data analysts by performing feature selection over increasingly high-volume and high-dimensional data, grouping and highlighting commonalities among data points. While useful in diverse tasks such as user behavior analytics, operational event processing, and root-cause analysis, today’s explanation engines are designed as stand-alone data processing tools that do not interoperate with traditional, SQL-based analytics workflows; this limits the applicability and extensibility of these engines. In response, we propose the DIFF operator, a relational aggregation operator that unifies the core functionality of these engines with declarative relational query processing. We implement both single-node and distributed versions of the DIFF operator in MB SQL, an extension of MacroBase, and demonstrate how DIFF can provide the same semantics as existing explanation engines while capturing a broad set of production use cases in industry, including at Microsoft and Facebook. Additionally, we illustrate how this declarative approach to data explanation enables new logical and physical query optimizations. We evaluate these optimizations on several real-world production applications and find that DIFF in MB SQL can outperform state-of-the-art engines by up to an order of magnitude.
引用
收藏
页码:45 / 70
页数:25
相关论文
共 50 条
  • [1] DIFF: a relational interface for large-scale data explanation
    Abuzaid, Firas
    Kraft, Peter
    Suri, Sahaana
    Gan, Edward
    Xu, Eric
    Shenoy, Atul
    Ananthanarayan, Asvin
    Sheu, John
    Meijer, Erik
    Wu, Xi
    Naughton, Jeff
    Bailis, Peter
    Zaharia, Matei
    VLDB JOURNAL, 2021, 30 (01) : 45 - 70
  • [2] DIFF: A Relational Interface for Large-Scale Data Explanation
    Abuzaid, Firas
    Kraft, Peter
    Suri, Sahaana
    Gan, Edward
    Xu, Eric
    Shenoy, Atul
    Ananthanarayan, Asvin
    Sheu, John
    Meijer, Erik
    Wu, Xi
    Naughton, Jeff
    Bailis, Peter
    Zaharia, Matei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 12 (04): : 419 - 432
  • [4] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [5] Optasia: A Relational Platform for Efficient Large-Scale Video Analytics
    Lu, Yao
    Chowdhery, Aakanksha
    Kandula, Srikanth
    PROCEEDINGS OF THE SEVENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC 2016), 2016, : 57 - 70
  • [6] Data Provenance in Large-Scale Distribution
    Zhu, Yunan
    Che, Wei
    Shan, Chao
    Zhao, Shen
    ARTIFICIAL INTELLIGENCE AND SECURITY, ICAIS 2022, PT III, 2022, 13340 : 28 - 42
  • [7] Intelligent approach for large-scale data mining
    Fouad, Khaled M.
    El-Bably, Doaa L.
    INTERNATIONAL JOURNAL OF COMPUTER APPLICATIONS IN TECHNOLOGY, 2020, 63 (1-2) : 93 - 113
  • [8] MAXIMIN EFFECTS IN INHOMOGENEOUS LARGE-SCALE DATA
    Meinshausen, Nicolai
    Buehlmann, Peter
    ANNALS OF STATISTICS, 2015, 43 (04) : 1801 - 1830
  • [9] Fast Plagiarism Detection in Large-Scale Data
    Szmit, Radoslaw
    BEYOND DATABASES, ARCHITECTURES AND STRUCTURES: TOWARDS EFFICIENT SOLUTIONS FOR DATA ANALYSIS AND KNOWLEDGE REPRESENTATION, 2017, 716 : 329 - 343
  • [10] Normalized Entropy Aggregation for Inhomogeneous Large-Scale Data
    Costa, Maria Conceicao
    Macedo, Pedro
    THEORY AND APPLICATIONS OF TIME SERIES ANALYSIS, 2019, : 19 - 29