VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research

被引:17
作者
Liang, Jiaqi [1 ,2 ]
Wang, Chaoye [1 ]
Zhang, Di [3 ]
Xie, Yubin [4 ]
Zeng, Yanru [2 ]
Li, Tianqin [5 ]
Zuo, Zhixiang [1 ]
Ren, Jian [1 ]
Zhao, Qi [1 ]
机构
[1] Sun Yat Sen Univ, Collaborat Innovat Ctr Canc Med, State Key Lab Oncol South China, Canc Ctr, Guangzhou 510060, Guangdong, Peoples R China
[2] Sun Yat Sen Univ, Sch Life Sci, State Key Lab Biocontrol, Guangzhou 510275, Guangdong, Peoples R China
[3] Sun Yat Sen Univ, Affiliated Hosp 6, Dept Coloproctol Surg, Guangdong Prov Key Lab Colorectal & Pelv Floor Dis, Guangzhou 510655, Guangdong, Peoples R China
[4] Sun Yat Sen Univ, Affiliated Hosp 1, Precis Med Inst, Guangzhou 510060, Guangdong, Peoples R China
[5] Carnegie Mellon Univ, Sch Comp Sci, Comp Sci Dept, Pittsburgh, PA 15213 USA
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Feature selection; LASSO bagging algorithm; Biomarker discovery; Omics data; GENERALIZED LINEAR-MODELS; REGRESSION SHRINKAGE; ADAPTIVE LASSO; CLASSIFICATION;
D O I
10.1016/j.jgg.2022.12.005
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research. With its advantages in both feature shrinkage and biological interpret-ability, Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is one of the most popular methods for the scenarios of clinical biomarker development. However, in practice, applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables, leading to the overfitting of the model. Here, we present VSOLassoBag, a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data. Using a bagging strategy in combination with a parametric method or inflection point search method, VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction. In addition, by comparing with multiple existing algorithms, VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others. In summary, VSOLassoBag, which is available at https://seqworld. com/VSOLassoBag/ under the GPL v3 license, provides an alternative strategy for selecting reliable bio-markers from high-dimensional omics data. For user's convenience, we implement VSOLassoBag as an R package that provides multithreading computing configurations. Copyright (c) 2022, The Authors. Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, and Genetics Society of China. Published by Elsevier Limited and Science Press. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页码:151 / 162
页数:12
相关论文
共 45 条
[1]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[2]  
Brennecke P, 2013, NAT METHODS, V10, P1093, DOI [10.1038/NMETH.2645, 10.1038/nmeth.2645]
[3]  
Bühlmann P, 2011, SPRINGER SER STAT, P1, DOI 10.1007/978-3-642-20192-9
[4]   Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction [J].
Cheng, Li-Hsin ;
Hsu, Te-Cheng ;
Lin, Che .
SCIENTIFIC REPORTS, 2021, 11 (01)
[5]   Correlation-Based Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation Neural Network [J].
Christo, V. R. Elgin ;
Nehemiah, H. Khanna ;
Minu, B. ;
Kannan, A. .
COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2019, 2019
[6]   The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups [J].
Curtis, Christina ;
Shah, Sohrab P. ;
Chin, Suet-Feung ;
Turashvili, Gulisa ;
Rueda, Oscar M. ;
Dunning, Mark J. ;
Speed, Doug ;
Lynch, Andy G. ;
Samarajiwa, Shamith ;
Yuan, Yinyin ;
Graef, Stefan ;
Ha, Gavin ;
Haffari, Gholamreza ;
Bashashati, Ali ;
Russell, Roslin ;
McKinney, Steven ;
Langerod, Anita ;
Green, Andrew ;
Provenzano, Elena ;
Wishart, Gordon ;
Pinder, Sarah ;
Watson, Peter ;
Markowetz, Florian ;
Murphy, Leigh ;
Ellis, Ian ;
Purushotham, Arnie ;
Borresen-Dale, Anne-Lise ;
Brenton, James D. ;
Tavare, Simon ;
Caldas, Carlos ;
Aparicio, Samuel .
NATURE, 2012, 486 (7403) :346-352
[7]   Gene selection and classification of microarray data using random forest -: art. no. 3 [J].
Díaz-Uriarte, R ;
de Andrés, SA .
BMC BIOINFORMATICS, 2006, 7 (1)
[8]   SURE INDEPENDENCE SCREENING IN GENERALIZED LINEAR MODELS WITH NP-DIMENSIONALITY [J].
Fan, Jianqing ;
Song, Rui .
ANNALS OF STATISTICS, 2010, 38 (06) :3567-3604
[9]   Regularization Paths for Generalized Linear Models via Coordinate Descent [J].
Friedman, Jerome ;
Hastie, Trevor ;
Tibshirani, Rob .
JOURNAL OF STATISTICAL SOFTWARE, 2010, 33 (01) :1-22
[10]   Visualizing and interpreting cancer genomics data via the Xena platform [J].
Goldman, Mary J. ;
Craft, Brian ;
Hastie, Mim ;
Repecka, Kristupas ;
McDade, Fran ;
Kamath, Akhil ;
Banerjee, Ayan ;
Luo, Yunhai ;
Rogers, Dave ;
Brooks, Angela N. ;
Zhu, Jingchun ;
Haussler, David .
NATURE BIOTECHNOLOGY, 2020, 38 (06) :675-678