The ability to classify patients based on gene-expression data varies by algorithm and performance metric

被引:7
作者
Piccolo, Stephen [1 ]
Mecham, Avery [1 ]
Golightly, Nathan [1 ]
Johnson, Jeremie L. [1 ]
Miller, Dustin [1 ]
机构
[1] Brigham Young Univ, Dept Biol, Provo, UT 84602 USA
关键词
DISTANT RECURRENCE; PAM50; RISK; BIG DATA; CLASSIFICATION; CANCER; SELECTION; SCORE; MEDICINE;
D O I
10.1371/journal.pcbi.1009926
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist-and most support diverse hyperparameters-so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, opensource, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.
引用
收藏
页数:34
相关论文
共 139 条
[1]  
Ahdesmaki M., 2015, SDA SHRINKAGE DISCRI
[2]   Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].
Alizadeh, AA ;
Eisen, MB ;
Davis, RE ;
Ma, C ;
Lossos, IS ;
Rosenwald, A ;
Boldrick, JG ;
Sabet, H ;
Tran, T ;
Yu, X ;
Powell, JI ;
Yang, LM ;
Marti, GE ;
Moore, T ;
Hudson, J ;
Lu, LS ;
Lewis, DB ;
Tibshirani, R ;
Sherlock, G ;
Chan, WC ;
Greiner, TC ;
Weisenburger, DD ;
Armitage, JO ;
Warnke, R ;
Levy, R ;
Wilson, W ;
Grever, MR ;
Byrd, JC ;
Botstein, D ;
Brown, PO ;
Staudt, LM .
NATURE, 2000, 403 (6769) :503-511
[3]   Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods [J].
Alonso-Gonzalez, Carlos J. ;
Isaac Moro-Sancho, Q. ;
Simon-Hurtado, Arancha ;
Varela-Arrabal, Ricardo .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (08) :7270-7280
[4]   AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].
ALTMAN, NS .
AMERICAN STATISTICIAN, 1992, 46 (03) :175-185
[5]  
[Anonymous], 2021, IEEE Trans. Broadcast.
[6]  
[Anonymous], 2011, ACM T INTEL SYST TEC, DOI DOI 10.1145/1961189.1961199
[7]   Attribute clustering for grouping, selection, and classification of gene expression data [J].
Au, WH ;
Chan, KCC ;
Wong, AKC ;
Wang, Y .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2005, 2 (02) :83-101
[8]   NCBI GEO: archive for functional genomics data sets-10 years on [J].
Barrett, Tanya ;
Troup, Dennis B. ;
Wilhite, Stephen E. ;
Ledoux, Pierre ;
Evangelista, Carlos ;
Kim, Irene F. ;
Tomashevsky, Maxim ;
Marshall, Kimberly A. ;
Phillippy, Katherine H. ;
Sherman, Patti M. ;
Muertter, Rolf N. ;
Holko, Michelle ;
Ayanbule, Oluwabukunmi ;
Yefanov, Andrey ;
Soboleva, Alexandra .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D1005-D1010
[9]  
Bay SD., 2000, ACM SIGKDD Explorations Newsletter-Special issue on "Scalable data mining algorithms", V2, P81, DOI 10.1145/380995.381030
[10]   Learning Deep Architectures for AI [J].
Bengio, Yoshua .
FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127