All sparse PCA models are wrong, but some are useful. Part I: Computation of scores, residuals and explained variance

被引:25
作者
Camacho, J. [1 ]
Smilde, A. K. [2 ]
Saccenti, E. [3 ]
Westerhuis, J. A. [2 ]
机构
[1] Univ Granada, Dept Signal Theory Telemat & Commun, Sch Comp Sci & Telecommun CITIC, Granada, Spain
[2] Univ Amsterdam, Biosyst Data Anal, Amsterdam, Netherlands
[3] Wageningen Univ & Res, Lab Syst & Synthet Biol, Wageningen, Netherlands
关键词
Sparse principal component analysis; Explained variance; Scores; Residuals; Exploratory data analysis; SELECTION;
D O I
10.1016/j.chemolab.2019.103907
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA can lead to unexpected properties and therefore incorrect interpretations in sPCA. This also affects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-of-the-art sPCA algorithms, and provide proper computation of the different elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noise-free, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this undesired behavior. We title this series of papers after the famous phrase of George Box "All models are wrong, but some are useful" with the same original meaning: sPCA models are only approximations of reality and have structural limitations that should be taken into account by the practitioner, but properly applied they can be useful tools to understand data.
引用
收藏
页数:10
相关论文
共 27 条
  • [1] [Anonymous], 2002, Principal components analysis
  • [2] [Anonymous], [No title captured]
  • [3] [Anonymous], 2003, User's Guide to Principal Components
  • [4] [Anonymous], [No title captured]
  • [5] Coclustering-a useful tool for chemometrics
    Bro, Rasmus
    Papalexakis, Evangelos E.
    Acar, Evrim
    Sidiropoulos, Nicholas D.
    [J]. JOURNAL OF CHEMOMETRICS, 2012, 26 (06) : 256 - 263
  • [6] Group-Wise Principal Component Analysis for Exploratory Data Analysis
    Camacho, Jose
    Rodriguez-Gomez, Rafael A.
    Saccenti, Edoardo
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2017, 26 (03) : 501 - 512
  • [7] Missing-data theory in the context of exploratory data analysis
    Camacho, Jose
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2010, 103 (01) : 8 - 18
  • [8] Scatter plotting in multivariate data analysis
    Geladi, P
    Manley, M
    Lestander, T
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (8-9) : 503 - 511
  • [9] Sparse PCA for High-Dimensional Data With Outliers
    Hubert, Mia
    Reynkens, Tom
    Schmitt, Eric
    Verdonck, Tim
    [J]. TECHNOMETRICS, 2016, 58 (04) : 424 - 434
  • [10] On Consistency and Sparsity for Principal Components Analysis in High Dimensions
    Johnstone, Iain M.
    Lu, Arthur Yu
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2009, 104 (486) : 682 - 693