Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY

被引:0
作者
Galbo Engstrom, Ole-Christian [1 ,2 ,3 ]
Holm Jensen, Martin [1 ]
机构
[1] FOSS Analyt A S, Res & Dev, Hillerod, Denmark
[2] Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark
[3] Univ Copenhagen, Dept Food Sci, Frederiksberg, Denmark
关键词
algorithm design; centering and scaling; computational complexity; cross-validation; leakage by preprocessing; PARTIAL LEAST-SQUARES; PLS-REGRESSION; DIFFERENTIATION; COMPONENTS; CHOICE; NUMBER;
D O I
10.1002/cem.70008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products XTX and XTY. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of X and Y, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing XTX and XTY and space complexity equivalent to storing X, Y, XTX, and XTY. Importantly, unlike alternatives found in the literature, we avoid data leakage due to preprocessing. We achieve these results by eliminating redundant computations in the overlap between training partitions. Concretely, we show how to manipulate XTX and XTY using only samples from the validation partition to obtain the preprocessed training partition-wise XTX and XTY. To our knowledge, we are the first to derive correct and efficient cross-validation algorithms for any of the 16 combinations of column-wise centering and scaling, for which we also prove only 12 give distinct matrix products.
引用
收藏
页数:16
相关论文
共 48 条
  • [1] Principal component analysis
    Abdi, Herve
    Williams, Lynne J.
    [J]. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2010, 2 (04): : 433 - 459
  • [2] A Tutorial on Near Infrared Spectroscopy and Its Calibration
    Agelet, Lidia Esteve
    Hurburgh, Charles R., Jr.
    [J]. CRITICAL REVIEWS IN ANALYTICAL CHEMISTRY, 2010, 40 (04) : 246 - 260
  • [3] Comparison of PLS algorithms when number of objects is much larger than number of variables
    Alin, Aylin
    [J]. STATISTICAL PAPERS, 2009, 50 (04) : 711 - 720
  • [4] A comparison of nine PLS1 algorithms
    Andersson, Martin
    [J]. JOURNAL OF CHEMOMETRICS, 2009, 23 (9-10) : 518 - 529
  • [5] Partial least squares for discrimination
    Barker, M
    Rayens, W
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (03) : 166 - 173
  • [6] STANDARD NORMAL VARIATE TRANSFORMATION AND DE-TRENDING OF NEAR-INFRARED DIFFUSE REFLECTANCE SPECTRA
    BARNES, RJ
    DHANOA, MS
    LISTER, SJ
    [J]. APPLIED SPECTROSCOPY, 1989, 43 (05) : 772 - 777
  • [7] Cross-validation of component models: A critical look at current methods
    Bro, R.
    Kjeldahl, K.
    Smilde, A. K.
    Kiers, H. A. L.
    [J]. ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2008, 390 (05) : 1241 - 1251
  • [8] Dayal BS, 1997, J CHEMOMETR, V11, P73, DOI 10.1002/(SICI)1099-128X(199701)11:1<73::AID-CEM435>3.0.CO
  • [9] 2-#
  • [10] COMMENTS ON THE PLS KERNEL ALGORITHM
    DEJONG, S
    TERBRAAK, CJF
    [J]. JOURNAL OF CHEMOMETRICS, 1994, 8 (02) : 169 - 174