Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

被引：0

作者：

Sarkar R. ^{[1
]}

Manage S. ^{[2
]}

Gao X. ^{[3
]}

机构：

[1] Department of Mathematics and Statistics, University of North Carolina at Greensboro, PO Box 26170, 116 Petty Building, Greensboro, 27402, NC

[2] Department of Mathematics, Texas A&M University, Blocker Building, 3368 TAMU, 155 Ireland Street, College Station, 77840, TX

[3] Meta Platforms, Menlo Park, CA

来源：

Annals of Data Science | 2024年 / 11卷 / 04期

基金：

美国国家科学基金会;

关键词：

Bi-level sparsity; Minimax concave penalty; Stability; Strong correlation; Variable selection;

D O I：

10.1007/s40745-023-00481-5

中图分类号：

学科分类号：

摘要：

High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023.

引用

页码：1139 / 1164

页数：25

共 50 条

[21] Feature selection for high-dimensional temporal data
Tsagris, Michail
Lagani, Vincenzo
Tsamardinos, Ioannis
BMC BIOINFORMATICS, 2018, 19
[22] Sparse Bayesian variable selection in kernel probit model for analyzing high-dimensional data
Yang, Aijun
Tian, Yuzhu
Li, Yunxian
Lin, Jinguan
COMPUTATIONAL STATISTICS, 2020, 35 (01) : 245 - 258
[23] Sparse Bayesian variable selection in kernel probit model for analyzing high-dimensional data
Aijun Yang
Yuzhu Tian
Yunxian Li
Jinguan Lin
Computational Statistics, 2020, 35 : 245 - 258
[24] An ensemble learning method for variable selection: application to high-dimensional data and missing values
Bar-Hen, Avner
Audigier, Vincent
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2022, 92 (16) : 3488 - 3510
[25] Feature selection for high-dimensional temporal data
Michail Tsagris
Vincenzo Lagani
Ioannis Tsamardinos
BMC Bioinformatics, 19
[26] An Additive Sparse Penalty for Variable Selection in High-Dimensional Linear Regression Model
Lee, Sangin
COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2015, 22 (02) : 147 - 157
[27] Variable selection and estimation for high-dimensional spatial autoregressive models
Cai, Liqian
Maiti, Tapabrata
SCANDINAVIAN JOURNAL OF STATISTICS, 2020, 47 (02) : 587 - 607
[28] Variable Selection Methods in High-dimensional RegressionA Simulation Study
Shahriari, Shirin
Faria, Susana
Goncalves, A. Manuela
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2015, 44 (10) : 2548 - 2561
[29] Bayesian Regression Trees for High-Dimensional Prediction and Variable Selection
Linero, Antonio R.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2018, 113 (522) : 626 - 636
[30] Variable selection in high-dimensional double generalized linear models
Xu, Dengke
Zhang, Zhongzhan
Wu, Liucang
STATISTICAL PAPERS, 2014, 55 (02) : 327 - 347

← 1 2 3 4 5 →