The ability to classify patients based on gene-expression data varies by algorithm and performance metric

被引：7

作者：

Piccolo, Stephen ^{[1
]}

Mecham, Avery ^{[1
]}

Golightly, Nathan ^{[1
]}

Johnson, Jeremie L. ^{[1
]}

Miller, Dustin ^{[1
]}

机构：

[1] Brigham Young Univ, Dept Biol, Provo, UT 84602 USA

来源：

PLOS COMPUTATIONAL BIOLOGY | 2022年 / 18卷 / 03期

关键词：

DISTANT RECURRENCE; PAM50; RISK; BIG DATA; CLASSIFICATION; CANCER; SELECTION; SCORE; MEDICINE;

D O I：

10.1371/journal.pcbi.1009926

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist-and most support diverse hyperparameters-so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, opensource, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.

引用

页数：34

共 139 条

[1]

Ahdesmaki M., 2015, SDA SHRINKAGE DISCRI

[2] Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling [J].

Alizadeh, AA ;

Eisen, MB ;

Davis, RE ;

Ma, C ;

Lossos, IS ;

Rosenwald, A ;

Boldrick, JG ;

Sabet, H ;

Tran, T ;

Yu, X ;

Powell, JI ;

Yang, LM ;

Marti, GE ;

Moore, T ;

Hudson, J ;

Lu, LS ;

Lewis, DB ;

Tibshirani, R ;

Sherlock, G ;

Chan, WC ;

Greiner, TC ;

Weisenburger, DD ;

Armitage, JO ;

Warnke, R ;

Levy, R ;

Wilson, W ;

Grever, MR ;

Byrd, JC ;

Botstein, D ;

Brown, PO ;

Staudt, LM .

NATURE, 2000, 403 (6769) :503-511

[3] Microarray gene expression classification with few genes: Criteria to combine attribute selection and classification methods [J].

Alonso-Gonzalez, Carlos J. ;

Isaac Moro-Sancho, Q. ;

Simon-Hurtado, Arancha ;

Varela-Arrabal, Ricardo .

EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (08) :7270-7280

[4] AN INTRODUCTION TO KERNEL AND NEAREST-NEIGHBOR NONPARAMETRIC REGRESSION [J].

ALTMAN, NS .

AMERICAN STATISTICIAN, 1992, 46 (03) :175-185

[5]

[Anonymous], 2021, IEEE Trans. Broadcast.

[6]

[Anonymous], 2011, ACM T INTEL SYST TEC, DOI DOI 10.1145/1961189.1961199

[7] Attribute clustering for grouping, selection, and classification of gene expression data [J].

Au, WH ;

Chan, KCC ;

Wong, AKC ;

Wang, Y .

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2005, 2 (02) :83-101

[8] NCBI GEO: archive for functional genomics data sets-10 years on [J].

Barrett, Tanya ;

Troup, Dennis B. ;

Wilhite, Stephen E. ;

Ledoux, Pierre ;

Evangelista, Carlos ;

Kim, Irene F. ;

Tomashevsky, Maxim ;

Marshall, Kimberly A. ;

Phillippy, Katherine H. ;

Sherman, Patti M. ;

Muertter, Rolf N. ;

Holko, Michelle ;

Ayanbule, Oluwabukunmi ;

Yefanov, Andrey ;

Soboleva, Alexandra .

NUCLEIC ACIDS RESEARCH, 2011, 39 :D1005-D1010

[9]

Bay SD., 2000, ACM SIGKDD Explorations Newsletter-Special issue on "Scalable data mining algorithms", V2, P81, DOI 10.1145/380995.381030

[10] Learning Deep Architectures for AI [J].

Bengio, Yoshua .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127

← 1 2 3 4 5 6 7 8 9 10 →