Handling high-dimensional data with missing values by modern machine learning techniques

被引：4

作者：

Chen, Sixia ^{[1
]}

Xu, Chao ^{[1
]}

机构：

[1] Univ Oklahoma, Dept Biostat & Epidemiol, Hlth Sci Ctr, Oklahoma City, OK 73126 USA

来源：

JOURNAL OF APPLIED STATISTICS | 2023年 / 50卷 / 03期

基金：

美国国家卫生研究院;

关键词：

Deep learning; high-dimensional data; imputation; machine learning; missing data; JACKKNIFE VARIANCE-ESTIMATION; MULTIPLE IMPUTATION; FRACTIONAL IMPUTATION; ITEM NONRESPONSE; INFERENCE; VARIABLES; SELECTION;

D O I：

10.1080/02664763.2022.2068514

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.

引用

页码：786 / 804

页数：19

共 50 条

[31] Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
Palanivinayagam, Ashokkumar
Damasevicius, Robertas
INFORMATION, 2023, 14 (02)
[32] Regularization techniques for high-dimensional data analysis
Lu, Jiwen
Peng, Xi
Deng, Weihong
Mian, Ajmal
IMAGE AND VISION COMPUTING, 2017, 60 : 1 - 3
[33] The Validation and Assessment of Machine Learning: A Game of Prediction from High-Dimensional Data
Pers, Tune H.
Albrechtsen, Anders
Holst, Claus
Sorensen, Thorkild I. A.
Gerds, Thomas A.
PLOS ONE, 2009, 4 (08):
[34] A Sparse Learning Machine for High-Dimensional Data with Application to Microarray Gene Analysis
Cheng, Qiang
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (04) : 636 - 646
[35] LINEAR MINIMUM MEAN-SQUARE ERROR ESTIMATION BASED ON HIGH-DIMENSIONAL DATA WITH MISSING VALUES
Zamanighomi, Mahdi
Wang, Zhengdao
Slavakis, Konstantinos
Giannakis, Georgios B.
2014 48TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2014,
[36] The critical role of evaluation metrics in handling missing data in machine learning
Atoum, Ibrahim
INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES, 2025, 12 (01): : 112 - 124
[37] Handling high-dimensional data in air pollution forecasting tasks
Domanska, Diana
Lukasik, Szymon
ECOLOGICAL INFORMATICS, 2016, 34 : 70 - 91
[38] High-Dimensional Matched Subspace Detection When Data are Missing
Balzano, Laura
Recht, Benjamin
Nowak, Robert
2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1638 - 1642
[39] High-dimensional variable selection in regression and classification with missing data
Gao, Qi
Lee, Thomas C. M.
SIGNAL PROCESSING, 2017, 131 : 1 - 7
[40] A Deep Learning-Cuckoo Search Method for Missing Data Estimation in High-Dimensional Datasets
Leke, Collins
Ndjiongue, Alain Richard
Twala, Bhekisipho
Marwala, Tshilidzi
ADVANCES IN SWARM INTELLIGENCE, ICSI 2017, PT I, 2017, 10385 : 561 - 572

← 1 2 3 4 5 →