Comparisons among several methods for handling missing data in principal component analysis (PCA)

被引:8
|
作者
Loisel, Sebastien [1 ]
Takane, Yoshio [2 ]
机构
[1] Heriot Watt Univ, Dept Math, Edinburgh EH14 4AS, Midlothian, Scotland
[2] Univ Victoria, Dept Psychol, 5173 Del Monte Ave, Victoria, BC V8Y 1X3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Homogeneity criterion; Missing data passive (MDP) method; Alternating least squares (ALS) algorithm; Weighted low rank approximation (WLRA) method; Regularized PCA (RPCA) method; Trimmed scores regression (TSR) method; Data augmentation (DA) method; Congruence coefficient; QUESTIONNAIRE DATA; VALUES; IMPUTATION; OUTLIERS;
D O I
10.1007/s11634-018-0310-9
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.
引用
收藏
页码:495 / 518
页数:24
相关论文
共 50 条
  • [31] Social Progress beyond GDP: A Principal Component Analysis (PCA) of GDP and Twelve Alternative Indicators
    Wang, Bing
    Chen, Tianchi
    SUSTAINABILITY, 2022, 14 (11)
  • [32] Analysis of missing data and comparing the accuracy of imputation methods using wheat crop data
    Saini, Preeti
    Nagpal, Bharti
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (14) : 40393 - 40414
  • [33] Handling missing values and imbalanced classes in machine learning to predict consumer preference: Demonstrations and comparisons to prominent methods
    Liu, Yahui
    Li, Bin
    Yang, Shuai
    Li, Zhen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 237
  • [34] Inconsistencies in handling missing data across stages of prediction modelling: a review of methods used
    Tsvetanova, Antonia
    Sperrin, Matthew
    Peek, Niels
    Buchan, Iain
    Hyland, Stephanie
    Martin, Glen
    2021 IEEE 9TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2021), 2021, : 443 - 444
  • [35] Tensor-Based Methods for Handling Missing Data in Quality-of-Life Questionnaires
    Garg, Lalit
    Dauwels, Justin
    Earnest, Arul
    Leong, Khai Pang
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2014, 18 (05) : 1571 - 1580
  • [36] Common Methods for Handling Missing Data in Marginal Structural Models: What Works and Why
    Leyrat, Clemence
    Carpenter, James R.
    Bailly, Sebastien
    Williamson, Elizabeth J.
    AMERICAN JOURNAL OF EPIDEMIOLOGY, 2021, 190 (04) : 663 - 672
  • [37] Handling Missing Data in COVID-19 Incidence Estimation:Secondary Data Analysis
    Pham, Hai-Thanh
    Do, Toan
    Baek, Jonggyu
    Nguyen, Cong-Khanh
    Pham, Quang-Thai
    Nguyen, Hoa L.
    Goldberg, Robert
    Pham, Quang Loc
    Giang, Le Minh
    JMIR PUBLIC HEALTH AND SURVEILLANCE, 2024, 10
  • [38] Robust principal component analysis and outlier detection with ecological data
    Jackson, DA
    Chen, Y
    ENVIRONMETRICS, 2004, 15 (02) : 129 - 139
  • [39] Robust skew-t factor analysis models for handling missing data
    Wan-Lun Wang
    Min Liu
    Tsung-I Lin
    Statistical Methods & Applications, 2017, 26 : 649 - 672
  • [40] Investigating Parallel Analysis in the Context of Missing Data: A Simulation Study Comparing Six Missing Data Methods
    Goretzko, David
    Heumann, Christian
    Buehner, Markus
    EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 2020, 80 (04) : 756 - 774