Statistical Model Computation with UDFs

被引:32
作者
Ordonez, Carlos [1 ]
机构
[1] Univ Houston, Dept Comp Sci Houston, Houston, TX 77204 USA
基金
美国国家科学基金会;
关键词
DBMS; SQL; statistical model; UDF; RELATIONAL DBMS; DATABASES; SQL;
D O I
10.1109/TKDE.2010.44
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Statistical models are generally computed outside a DBMS due to their mathematical complexity. We introduce techniques to efficiently compute fundamental statistical models inside a DBMS exploiting User-Defined Functions (UDFs). Specifically, we study the computation of linear regression, PCA, clustering, and Naive Bayes. Two summary matrices on the data set are mathematically shown to be essential for all models: the linear sum of points and the quadratic sum of cross products of points. We consider two layouts for the input data set: horizontal and vertical. We first introduce efficient SQL queries to compute summary matrices and score the data set. Based on the SQL framework, we introduce UDFs that work in a single table scan: aggregate UDFs to compute summary matrices for all models and a set of primitive scalar UDFs to score data sets. Experiments compare UDFs and SQL queries (running inside the DBMS) with C++ (analyzing exported files). In general, UDFs are faster than SQL queries and not much slower than C++. Considering export times, C++ is slower than UDFs and SQL queries. Statistical models based on precomputed summary matrices are computed in a few seconds. UDFs scale linearly and only require one table scan, highlighting their efficiency.
引用
收藏
页码:1752 / 1765
页数:14
相关论文
共 20 条
[1]  
[Anonymous], P 1998 ACM SIGMOD IN
[2]  
CHAN T, 1983, AM STAT, V7, P242
[3]   Efficient evaluation of queries with mining predicates [J].
Chaudhuri, S ;
Narasayya, V ;
Sarawagi, S .
18TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2002, :529-540
[4]  
Deshpande A., 2006, PROC 2006 SIGMOD C, P73
[5]  
GHANI R, 2006, SIGKDD EXPLORATIONS, V8, P79
[6]  
Graefe G., 1998, Proceedings Fourth International Conference on Knowledge Discovery and Data Mining, P204
[7]  
Hastie T., 2001, ELEMENTS STAT LEARNI
[8]   Self-tuning cost modeling of user-defined functions in an object-relational DBMS [J].
He, Z ;
Lee, BS ;
Snapp, R .
ACM TRANSACTIONS ON DATABASE SYSTEMS, 2005, 30 (03) :812-853
[9]  
Luo C., 2005, SIGMOD Demo, P873
[10]   An out-of-core sparse symmetric-indefinite factorization method [J].
Meshar, Omer ;
Irony, Dror ;
Toledo, Sivan .
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2006, 32 (03) :445-471