Handling high-dimensional data with missing values by modern machine learning techniques

被引:4
|
作者
Chen, Sixia [1 ]
Xu, Chao [1 ]
机构
[1] Univ Oklahoma, Dept Biostat & Epidemiol, Hlth Sci Ctr, Oklahoma City, OK 73126 USA
基金
美国国家卫生研究院;
关键词
Deep learning; high-dimensional data; imputation; machine learning; missing data; JACKKNIFE VARIANCE-ESTIMATION; MULTIPLE IMPUTATION; FRACTIONAL IMPUTATION; ITEM NONRESPONSE; INFERENCE; VARIABLES; SELECTION;
D O I
10.1080/02664763.2022.2068514
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.
引用
收藏
页码:786 / 804
页数:19
相关论文
共 50 条
  • [31] Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
    Palanivinayagam, Ashokkumar
    Damasevicius, Robertas
    INFORMATION, 2023, 14 (02)
  • [32] Regularization techniques for high-dimensional data analysis
    Lu, Jiwen
    Peng, Xi
    Deng, Weihong
    Mian, Ajmal
    IMAGE AND VISION COMPUTING, 2017, 60 : 1 - 3
  • [33] The Validation and Assessment of Machine Learning: A Game of Prediction from High-Dimensional Data
    Pers, Tune H.
    Albrechtsen, Anders
    Holst, Claus
    Sorensen, Thorkild I. A.
    Gerds, Thomas A.
    PLOS ONE, 2009, 4 (08):
  • [34] A Sparse Learning Machine for High-Dimensional Data with Application to Microarray Gene Analysis
    Cheng, Qiang
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (04) : 636 - 646
  • [35] LINEAR MINIMUM MEAN-SQUARE ERROR ESTIMATION BASED ON HIGH-DIMENSIONAL DATA WITH MISSING VALUES
    Zamanighomi, Mahdi
    Wang, Zhengdao
    Slavakis, Konstantinos
    Giannakis, Georgios B.
    2014 48TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2014,
  • [36] The critical role of evaluation metrics in handling missing data in machine learning
    Atoum, Ibrahim
    INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES, 2025, 12 (01): : 112 - 124
  • [37] Handling high-dimensional data in air pollution forecasting tasks
    Domanska, Diana
    Lukasik, Szymon
    ECOLOGICAL INFORMATICS, 2016, 34 : 70 - 91
  • [38] High-Dimensional Matched Subspace Detection When Data are Missing
    Balzano, Laura
    Recht, Benjamin
    Nowak, Robert
    2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1638 - 1642
  • [39] High-dimensional variable selection in regression and classification with missing data
    Gao, Qi
    Lee, Thomas C. M.
    SIGNAL PROCESSING, 2017, 131 : 1 - 7
  • [40] A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-Dimensional Datasets
    Leke, Collins
    Ndjiongue, Alain Richard
    Twala, Bhekisipho
    Marwala, Tshilidzi
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2017, PT I, 2017, 10385 : 561 - 572