Variable Selection in High-Dimensional Logistic Regression Models Using a Whitening Approach

被引:0
作者
Zhu, Wencan [1 ,2 ]
Levy-Leduc, Celine [3 ]
Ternes, Nils [4 ]
机构
[1] Univ Paris Saclay, UMR MIA Paris, AgroParisTech, INRAE, F-91190 Gif Sur Yvette, France
[2] Sanofi R&D, Dept Biostat & Programming, Bridgewater, NJ 08807 USA
[3] Univ Paris Cite, CNRS, Lab Probabil Stat Modelisat LPSM, F-75006 Paris, France
[4] Sanofi R&D, Dept Biostat & Programming, Bridgewater, NJ 08807 USA
来源
IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS | 2025年 / 22卷 / 02期
关键词
Biomarkers; Correlation; Feature extraction; Biological system modeling; Bioinformatics; Logistic regression; Integrated circuit modeling; Diseases; Computational modeling; Computational biology; Feature selection; highly correlated predictors; binary classification; regularized approches; CLASSIFICATION; REGULARIZATION; LASSO;
D O I
10.1109/TCBBIO.2025.3539479
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories. Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The results from numerical experiments suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance of WLogit is also evaluated on two publicly available datasets, and the obtained classifier outperformed other methods in terms of prediction accuracy. Our method is implemented in the WLogit R package available from the Comprehensive R Archive Network (CRAN).
引用
收藏
页码:800 / 807
页数:8
相关论文
共 46 条
[1]   Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection [J].
Ang, Jun Chin ;
Mirzal, Andri ;
Haron, Habibollah ;
Hamed, Haza Nuzly Abdull .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2016, 13 (05) :971-989
[2]  
Boileau P., 2021, R. J. Open Source Softw., V6
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Breiman L., 2001, MACH LEARN, V45, P5, DOI DOI 10.1023/A:1010933404324
[5]   Honest variable selection in linear and logistic regression models via l1 and l1 + l2 penalization [J].
Bunea, Florentina .
ELECTRONIC JOURNAL OF STATISTICS, 2008, 2 :1153-1194
[6]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[7]   Iteratively Reweighted Least Squares Minimization for Sparse Recovery [J].
Daubechies, Ingrid ;
Devore, Ronald ;
Fornasier, Massimo ;
Guentuerk, C. Sinan .
COMMUNICATIONS ON PURE AND APPLIED MATHEMATICS, 2010, 63 (01) :1-38
[8]   Mitochondria-related signaling pathways involved in breast cancer regulate ferroptosis [J].
Dong, Xinrui ;
Li, Ye ;
Sheng, Xiaonan ;
Zhou, Weihang ;
Sun, Aijun ;
Dai, Huijuan .
GENES & DISEASES, 2024, 11 (01) :358-366
[9]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[10]   Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data [J].
Glaab, Enrico ;
Bacardit, Jaume ;
Garibaldi, Jonathan M. ;
Krasnogor, Natalio .
PLOS ONE, 2012, 7 (07)