Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

被引：0

作者：

Sarkar R. ^{[1
]}

Manage S. ^{[2
]}

Gao X. ^{[3
]}

机构：

[1] Department of Mathematics and Statistics, University of North Carolina at Greensboro, PO Box 26170, 116 Petty Building, Greensboro, 27402, NC

[2] Department of Mathematics, Texas A&M University, Blocker Building, 3368 TAMU, 155 Ireland Street, College Station, 77840, TX

[3] Meta Platforms, Menlo Park, CA

来源：

Annals of Data Science | 2024年 / 11卷 / 04期

基金：

美国国家科学基金会;

关键词：

Bi-level sparsity; Minimax concave penalty; Stability; Strong correlation; Variable selection;

D O I：

10.1007/s40745-023-00481-5

中图分类号：

学科分类号：

摘要：

High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including the Lasso and MCP, etc. In this paper, we perform comparative study of regularization approaches for variable selection under different correlation structures and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Extensive simulation studies and high-dimensional genomic data analysis on real datasets have demonstrated the advantage of the proposed rPGBS method over some of the most used regularization methods. In particular, rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to some recent methods addressing variable selection with strong correlations: Precision Lasso (Wang et al. in Bioinformatics 35:1181–1187, 2019) and Whitening Lasso (Zhu et al. in Bioinformatics 37:2238–2244, 2021). Moreover, rPGBS has been shown to be computationally efficient across various settings. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023.

引用

页码：1139 / 1164

页数：25

共 50 条

[1] Variable selection for high-dimensional incomplete data
Liang, Lixing
Zhuang, Yipeng
Yu, Philip L. H.
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2024, 192
[2] VARIABLE SELECTION AND PREDICTION WITH INCOMPLETE HIGH-DIMENSIONAL DATA
Liu, Ying
Wang, Yuanjia
Feng, Yang
Wall, Melanie M.
ANNALS OF APPLIED STATISTICS, 2016, 10 (01) : 418 - 450
[3] A Variable Selection Method for High-Dimensional Survival Data
Giordano, Francesco
Milito, Sara
Restaino, Marialuisa
MATHEMATICAL AND STATISTICAL METHODS FOR ACTUARIAL SCIENCES AND FINANCE, MAF 2022, 2022, : 303 - 308
[4] Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data
Pes, Barbara
Dessi, Nicoletta
Angioni, Marta
INFORMATION FUSION, 2017, 35 : 132 - 147
[5] Stochastic variational variable selection for high-dimensional microbiome data
Dang, Tung
Kumaishi, Kie
Usui, Erika
Kobori, Shungo
Sato, Takumi
Toda, Yusuke
Yamasaki, Yuji
Tsujimoto, Hisashi
Ichihashi, Yasunori
Iwata, Hiroyoshi
MICROBIOME, 2022, 10 (01)
[6] Variable selection for longitudinal data with high-dimensional covariates and dropouts
Zheng, Xueying
Fu, Bo
Zhang, Jiajia
Qin, Guoyou
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018, 88 (04) : 712 - 725
[7] Stochastic variational variable selection for high-dimensional microbiome data
Tung Dang
Kie Kumaishi
Erika Usui
Shungo Kobori
Takumi Sato
Yusuke Toda
Yuji Yamasaki
Hisashi Tsujimoto
Yasunori Ichihashi
Hiroyoshi Iwata
Microbiome, 10
[8] Comparison of variable selection methods for high-dimensional survival data with competing events
Gilhodes, Julia
Zemmour, Christophe
Ajana, Soufiane
Martinez, Alejandra
Delord, Jean-Pierre
Leconte, Eve
Boher, Jean-Marie
Filleron, Thomas
COMPUTERS IN BIOLOGY AND MEDICINE, 2017, 91 : 159 - 167
[9] Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data
Wang, Haohan
Lengerich, Benjamin J.
Aragam, Bryon
Xing, Eric P.
BIOINFORMATICS, 2019, 35 (07) : 1181 - 1187
[10] PALLADIO: a parallel framework for robust variable selection in high-dimensional data
Barbieri, Matteo
Fiorini, Samuele
Tomasi, Federico
Barla, Annalisa
PROCEEDINGS OF PYHPC2016: 6TH WORKSHOP ON PYTHON FOR HIGH-PERFORMANCE AND SCIENTIFIC COMPUTING, 2016, : 19 - 26

← 1 2 3 4 5 →