Comparisons among several methods for handling missing data in principal component analysis (PCA)

被引:8
|
作者
Loisel, Sebastien [1 ]
Takane, Yoshio [2 ]
机构
[1] Heriot Watt Univ, Dept Math, Edinburgh EH14 4AS, Midlothian, Scotland
[2] Univ Victoria, Dept Psychol, 5173 Del Monte Ave, Victoria, BC V8Y 1X3, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Homogeneity criterion; Missing data passive (MDP) method; Alternating least squares (ALS) algorithm; Weighted low rank approximation (WLRA) method; Regularized PCA (RPCA) method; Trimmed scores regression (TSR) method; Data augmentation (DA) method; Congruence coefficient; QUESTIONNAIRE DATA; VALUES; IMPUTATION; OUTLIERS;
D O I
10.1007/s11634-018-0310-9
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.
引用
收藏
页码:495 / 518
页数:24
相关论文
共 50 条
  • [21] A Review of Missing Values Handling Methods on Time-Series Data
    Pratama, Irfan
    Permanasari, Adhistya Erna
    Ardiyanto, Igi
    Indrayani, Rini
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY SYSTEMS AND INNOVATION (ICITSI), 2016,
  • [22] A Provenance Meta Learning Framework for Missing Data Handling Methods Selection
    Liu, Qian
    Hauswirth, Manfred
    2020 11TH IEEE ANNUAL UBIQUITOUS COMPUTING, ELECTRONICS & MOBILE COMMUNICATION CONFERENCE (UEMCON), 2020, : 349 - 358
  • [23] Methods for Mediation Analysis with Missing Data
    Zhang, Zhiyong
    Wang, Lijuan
    PSYCHOMETRIKA, 2013, 78 (01) : 154 - 184
  • [24] The Effectiveness of a Probabilistic Principal Component Analysis Model and Expectation Maximisation Algorithm in Treating Missing Daily Rainfall Data
    Chuan, Zun Liang
    Deni, Sayang Mohd
    Fam, Soo-Fen
    Ismail, Noriszura
    ASIA-PACIFIC JOURNAL OF ATMOSPHERIC SCIENCES, 2020, 56 (01) : 119 - 129
  • [25] Sequential projection pursuit principal component analysis - dealing with missing data associated with new-omics technologies
    Webb-Robertson, Bobbie-Jo M.
    Matzke, Melissa M.
    Metz, Thomas O.
    McDermott, Jason E.
    Walker, Hyunjoo
    Rodland, Karin D.
    Pounds, Joel G.
    Waters, Katrina M.
    BIOTECHNIQUES, 2013, 54 (03) : 165 - 168
  • [26] Missing-Data Handling Methods for Lifelogs-Based Wellness Index Estimation: Comparative Analysis With Panel Data
    Kim, Ki-Hun
    Kim, Kwang-Jae
    JMIR MEDICAL INFORMATICS, 2020, 8 (12)
  • [27] Evaluating missing data handling methods for developing building energy benchmarking models
    Lee, Kyungjae
    Lim, Hyunwoo
    Hwang, Jeongyun
    Lee, Doyeon
    ENERGY, 2024, 308
  • [28] Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research
    Cheema, Jehanzeb R.
    JOURNAL OF MODERN APPLIED STATISTICAL METHODS, 2014, 13 (02) : 53 - 75
  • [29] Principal Component Analysis of Water Pipe Flow Data
    Park, S.
    Jung, S. -Y.
    16TH WATER DISTRIBUTION SYSTEM ANALYSIS CONFERENCE (WDSA2014): URBAN WATER HYDROINFORMATICS AND STRATEGIC PLANNING, 2014, 89 : 395 - 400
  • [30] Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data
    Ehrig, Molly
    Bullock, Garrett S.
    Leng, Xiaoyan Iris
    Pajewski, Nicholas M.
    Speiser, Jaime Lynn
    JMIR MEDICAL INFORMATICS, 2025, 13