Finding the Best Box-Cox Transformation from Massive Datasets on Spark

被引:0
作者
Fang, Huayi [1 ]
Yang, Baijian [2 ]
Zhang, Tonglin [3 ]
机构
[1] Amazon Com Inc, 410 Terry Ave, North Seattle, WA 98109 USA
[2] Purdue Univ, Dept Comp & Informat Technol, W Lafayette, IN 47907 USA
[3] Purdue Univ, Dept Stat, W Lafayette, IN 47907 USA
来源
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2017年
关键词
MODEL;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In order to find the best linear regression model or polynomial regression model that fits the data, traditional methods have to read the whole datasets repetitively and incur many unnecessary slow I/O operations. Apache Spark can train regression models significantly more efficiently with distributed clusters due to its well-crafted in-memory computing architecture. However, if the dataset itself or the temporary data during computation is even bigger for the total physical memory space of a spark system, in-memory data has to be spilled to the secondary storage (such as hard drives or solid state disks) and read it back later if it is needed. These frequent I/O operations will negatively affect the efficiency of Spark computation. Built on top of the per-row update-able data modeling concept we proposed before, this work investigated the cases of finding the best Box-Cox transformation model on a Spark system. The major contribution of this work is that the information needed to compute a linear regression model, or a polynomial regression model can be summarized in an Information Array. The size of this information array does not grow with the datasets. Rather, it is only related to the number of features and the number of models need to be considered. Because the information array is usually very small, it can be stored in memory all the time. With the propose information array approach, the best linear or polynomial regression model could be obtained after one scan of the raw data. The experiment results proved that this approach is fast and efficient on Spark. When training 41 models, the proposed Box-Cox Information Array method is about 8 times faster than the existing Spark APIs and it has better performance of prediction than using linear regression models.
引用
收藏
页码:2951 / 2960
页数:10
相关论文
共 22 条
  • [1] [Anonymous], 2013, TECH REP
  • [2] [Anonymous], 2011, SIGMOD 11 P 2011 INT, DOI [DOI 10.1145/1989323.1989438, 10.1145/1989323.1989438]
  • [3] Armbrust M, 2015, PROC VLDB ENDOW, V8, P1840
  • [4] Bhosale H.S., 2014, Int J Sci Res, V4, P1
  • [5] AN ANALYSIS OF TRANSFORMATIONS
    BOX, GEP
    COX, DR
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1964, 26 (02) : 211 - 252
  • [6] Chiba T, 2016, INT SYM PERFORM ANAL, P112, DOI 10.1109/ISPASS.2016.7482079
  • [7] Choi IS, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P1073, DOI 10.1109/BigData.2015.7363861
  • [8] Duan M., 2015, CONCURRENCY COMPUTAT
  • [9] Estimation for the Box-Cox transformation model without assuming parametric error distribution
    Foster, AM
    Tian, L
    Wei, LJ
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (455) : 1097 - 1101
  • [10] Islam NS, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P243, DOI 10.1109/BigData.2015.7363761