Large-scale Data Exploration Using Explanatory Regression Functions

被引:4
作者
Savva, Fotis [1 ]
Anagnostopoulos, Christos [1 ]
Triantafillou, Peter [2 ]
Kolomvatsos, Kostas [3 ]
机构
[1] Univ Glasgow, Glasgow G12 8QQ, Lanark, Scotland
[2] Univ Warwick, Coventry CV4 7AL, W Midlands, England
[3] Univ Thessaly, Volos 38221, Greece
基金
英国工程与自然科学研究理事会; 欧盟地平线“2020”;
关键词
Explainability; data exploration; aggregate query explanation; range query explanation; EXPLAINING QUERY ANSWERS;
D O I
10.1145/3410448
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Analysts wishing to explore multivariate data spaces, typically issue queries involving selection operators, i.e., range or equality predicates, which define data subspaces of potential interest. Then, they use aggregation functions, the results of which determine a subspace's interestingness for further exploration and deeper analysis. However, Aggregate Query (AQ) results are scalars and convey limited information and explainability about the queried subspaces for enhanced exploratory analysis. Analysts have no way of identifying how these results are derived or how they change w.r.t query (input) parameter values. We address this short-coming by aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism based on machine learning. We explain AQ results using functions obtained by a three-fold joint optimization problem which assume the form of explainable piecewise-linear regression functions. A key feature of the proposed solution is that the explanation functions are estimated using past executed queries. These queries provide a coarse grained overview of the underlying aggregate function (generating the AQ results) to be learned. Explanations for future, previously unseen AQs can be computed without accessing the underlying data and can be used to further explore the queried data subspaces, without issuing more queries to the backend analytics engine. We evaluate the explanation accuracy and efficiency through theoretically grounded metrics over real-world and synthetic datasets and query workloads.
引用
收藏
页数:33
相关论文
共 54 条
[1]  
Agarwal Sameer, 2013, P 8 ACM EUR C COMP S, P29
[2]  
Amsterdamer Y., 2011, P 30 ACM SIGMODSIGAC, P153, DOI [DOI 10.1145/1989284.1989302, 10.1145/1989284.1989302]
[3]   Scalable aggregation predictive analytics [J].
Anagnostopoulos, Christos ;
Savva, Fotis ;
Triantafillou, Peter .
APPLIED INTELLIGENCE, 2018, 48 (09) :2546-2567
[4]   Efficient Scalable Accurate Regression Queries in In-DBMS Analytics [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, :559-570
[5]   Learning Set Cardinality in Distance Nearest Neighbours [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, :691-696
[6]  
[Anonymous], 2011, P BIENN C INN DAT SY
[7]  
[Anonymous], 2018, Bokeh: Python library for interactive visualization
[8]  
[Anonymous], 2010, P ACM SIGMOD INT C M
[9]  
[Anonymous], 2016, CRIMES 2001 PRESENT
[10]   MacroBase: Prioritizing Attention in Fast Data [J].
Bailis, Peter ;
Gan, Edward ;
Maddens, Samuel ;
Narayanan, Deepak ;
Rong, Kexin ;
Suri, Sahaana .
SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, :541-556