Comparisons among several methods for handling missing data in principal component analysis (PCA)

被引:8
|
作者
Loisel, Sebastien [1 ]
Takane, Yoshio [2 ]
机构
[1] Heriot Watt Univ, Dept Math, Edinburgh EH14 4AS, Midlothian, Scotland
[2] Univ Victoria, Dept Psychol, 5173 Del Monte Ave, Victoria, BC V8Y 1X3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Homogeneity criterion; Missing data passive (MDP) method; Alternating least squares (ALS) algorithm; Weighted low rank approximation (WLRA) method; Regularized PCA (RPCA) method; Trimmed scores regression (TSR) method; Data augmentation (DA) method; Congruence coefficient; QUESTIONNAIRE DATA; VALUES; IMPUTATION; OUTLIERS;
D O I
10.1007/s11634-018-0310-9
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.
引用
收藏
页码:495 / 518
页数:24
相关论文
共 50 条
  • [41] A Comparison of Three Popular Methods for Handling Missing Data: Complete-Case Analysis, Inverse Probability Weighting, and Multiple Imputation
    Little, Roderick J.
    Carpenter, James R.
    Lee, Katherine J.
    SOCIOLOGICAL METHODS & RESEARCH, 2024, 53 (03) : 1105 - 1135
  • [42] Robust skew-t factor analysis models for handling missing data
    Wang, Wan-Lun
    Liu, Min
    Lin, Tsung-I
    STATISTICAL METHODS AND APPLICATIONS, 2017, 26 (04) : 649 - 672
  • [43] SELF-PACED PROBABILISTIC PRINCIPAL COMPONENT ANALYSIS FOR DATA WITH OUTLIERS
    Zhao, Bowen
    Xiao, Xi
    Zhang, Wanpeng
    Zhang, Bin
    Gan, Guojun
    Xia, Shutao
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3737 - 3741
  • [44] Using the Robust Principal Component Analysis to Identify Incorrect Aerological Data
    A. M. Kozin
    A. D. Lykov
    I. A. Vyazankin
    A. S. Vyazankin
    Russian Meteorology and Hydrology, 2021, 46 : 631 - 639
  • [45] Using the Robust Principal Component Analysis to Identify Incorrect Aerological Data
    Kozin, A. M.
    Lykov, A. D.
    Vyazankin, I. A.
    Vyazankin, A. S.
    RUSSIAN METEOROLOGY AND HYDROLOGY, 2021, 46 (09) : 631 - 639
  • [46] How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data
    Stayseth, Marianne Riksheim
    Clausen, Thomas
    Roislien, Jo
    SAGE OPEN MEDICINE, 2019, 7
  • [47] Statistical analysis and handling of missing data in cluster randomised trials: protocol for a systematic review
    Fiero, Mallorie
    Huang, Shuang
    Bell, Melanie L.
    BMJ OPEN, 2015, 5 (05):
  • [48] Multivariate normality tests with two-step monotone missing data: a critical review with emphasis on the different methods of handling missing values
    Tsatsi, A.
    Batsidis, A.
    Economou, P.
    JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2024, 94 (16) : 3653 - 3677
  • [49] Handling missing data for the identification of charged particles in a multilayer detector: A comparison between different imputation methods
    Riggi, S.
    Riggi, D.
    Riggi, F.
    NUCLEAR INSTRUMENTS & METHODS IN PHYSICS RESEARCH SECTION A-ACCELERATORS SPECTROMETERS DETECTORS AND ASSOCIATED EQUIPMENT, 2015, 780 : 81 - 90
  • [50] Principal component analysis of turbulent combustion data: Data pre-processing and manifold sensitivity
    Parente, Alessandro
    Sutherland, James C.
    COMBUSTION AND FLAME, 2013, 160 (02) : 340 - 350