Scalable aggregation predictive analytics

被引:18
作者
Anagnostopoulos, Christos [1 ]
Savva, Fotis [1 ]
Triantafillou, Peter [1 ]
机构
[1] Univ Glasgow, Sch Comp Sci, Glasgow G12 8QQ, Lanark, Scotland
基金
英国工程与自然科学研究理事会; 欧盟地平线“2020”;
关键词
Query-driven predictive analytics; Predictive modeling; Aggregation operators; Set cardinality prediction; Regression vector quantization; Self-organizing maps;
D O I
10.1007/s10489-017-1093-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries' answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark's COUNT method.
引用
收藏
页码:2546 / 2567
页数:22
相关论文
共 32 条
[1]  
Aboulnaga A, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P181, DOI 10.1145/304181.304198
[2]   Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2017, 11 (04)
[3]   Efficient Scalable Accurate Regression Queries in In-DBMS Analytics [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, :559-570
[4]   Quality-optimized predictive analytics [J].
Anagnostopoulos, Christos .
APPLIED INTELLIGENCE, 2016, 45 (04) :1034-1046
[5]   Learning Set Cardinality in Distance Nearest Neighbours [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, :691-696
[6]  
Anagnostopoulos C, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P14, DOI 10.1109/BigData.2015.7363736
[7]   Scaling Out Big Data Missing Value Imputations [J].
Anagnostopoulos, Christos ;
Triantafillou, Peter .
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, :651-660
[8]  
[Anonymous], P 4 ANN S CLOUD COMP, DOI [10.1145/2523616.2523633, DOI 10.1145/2523616.2523633]
[9]  
[Anonymous], 2008, Advances in Neural Information Processing Systems, DOI DOI 10.7751/mitpress/8996.003.0015
[10]  
[Anonymous], 2006, PROC 22TH ANN IEEE I