Predictive overfitting in immunological applications: Pitfalls and solutions

被引:18
作者
Gygi, Jeremy P. [1 ]
Kleinstein, Steven H. [1 ,2 ,3 ]
Guan, Leying [1 ,4 ,5 ]
机构
[1] Yale Univ, Program Computat Biol & Bioinformat, New Haven, CT USA
[2] Yale Sch Med, Dept Pathol, New Haven, CT USA
[3] Yale Sch Med, Dept Immunobiol, New Haven, CT USA
[4] Yale Sch Publ Hlth, Dept Biostat, New Haven, CT USA
[5] Yale Sch Publ Hlth, Dept Biostat, 60 Coll St, New Haven, CT 06510 USA
基金
美国国家科学基金会;
关键词
Overfitting; regularization; dimension reduction; model evaluation; data diversity; distributionally robust optimization; CROSS-VALIDATION; SELECTION; MODEL; VACCINATION; REGRESSION; DIMENSION; MODULES; SYSTEMS;
D O I
10.1080/21645515.2023.2251830
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Overfitting describes the phenomenon where a highly predictive model on the training data generalizes poorly to future observations. It is a common concern when applying machine learning techniques to contemporary medical applications, such as predicting vaccination response and disease status in infectious disease or cancer studies. This review examines the causes of overfitting and offers strategies to counteract it, focusing on model complexity reduction, reliable model evaluation, and harnessing data diversity. Through discussion of the underlying mathematical models and illustrative examples using both synthetic data and published real datasets, our objective is to equip analysts and bioinformaticians with the knowledge and tools necessary to detect and mitigate overfitting in their research.
引用
收藏
页数:11
相关论文
共 91 条
[1]  
Akaike H., 1998, Selected papers of hirotugu akaike, P199, DOI [DOI 10.1007/978-1-4612-1694-0_15, DOI 10.1007/978-1-4612-1694-015, 10.1007/978-1-4612-1694-0_15]
[2]   MOFA plus : a statistical framework for comprehensive integration of multi-modal single-cell data [J].
Argelaguet, Ricard ;
Arnol, Damien ;
Bredikhin, Danila ;
Deloro, Yonatan ;
Velten, Britta ;
Marioni, John C. ;
Stegle, Oliver .
GENOME BIOLOGY, 2020, 21 (01)
[3]   Multi-Omics Factor Analysis-a framework for unsupervised integration of multi-omics data sets [J].
Argelaguet, Ricard ;
Velten, Britta ;
Arnol, Damien ;
Dietrich, Sascha ;
Zenz, Thorsten ;
Marioni, John C. ;
Buettner, Florian ;
Huber, Wolfgang ;
Stegle, Oliver .
MOLECULAR SYSTEMS BIOLOGY, 2018, 14 (06)
[4]  
Arjovsky M, 2020, Arxiv, DOI arXiv:1907.02893
[5]   Prediction by supervised principal components [J].
Bair, E ;
Hastie, T ;
Paul, D ;
Tibshirani, R .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2006, 101 (473) :119-137
[6]  
Baker M, 2016, NATURE, V533, P452, DOI 10.1038/533452a
[7]   Cross-Validation: What Does It Estimate and How Well Does It Do It? [J].
Bates, Stephen ;
Hastie, Trevor ;
Tibshirani, Robert .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2024, 119 (546) :1434-1445
[8]   A note on the validity of cross-validation for evaluating autoregressive time series prediction [J].
Bergmeir, Christoph ;
Hyndman, Rob J. ;
Koo, Bonsoo .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2018, 120 :70-83
[9]  
Bishop Christopher M., 2006, Pattern recognition and machine learning
[10]   TRAINING WITH NOISE IS EQUIVALENT TO TIKHONOV REGULARIZATION [J].
BISHOP, CM .
NEURAL COMPUTATION, 1995, 7 (01) :108-116