A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction

被引:362
作者
Pudjihartono, Nicholas [1 ]
Fadason, Tayaza [1 ,2 ]
Kempa-Liehr, Andreas W. [3 ]
O'Sullivan, Justin M. [1 ,2 ,4 ,5 ,6 ]
机构
[1] Univ Auckland, Liggins Inst, Auckland, New Zealand
[2] Maurice Wilkins Ctr Mol Biodiscovery, Auckland, New Zealand
[3] Univ Auckland, Dept Engn Sci, Auckland, New Zealand
[4] Univ Southampton, MRC Lifecourse Epidemiol Unit, Southampton, England
[5] ASTAR, Singapore Inst Clin Sci, Singapore, Singapore
[6] Garvan Inst Med Res, Australian Parkinsons Mission, Sydney, NSW, Australia
来源
FRONTIERS IN BIOINFORMATICS | 2022年 / 2卷
关键词
machine learing; feature selection (FS); risk prediction; disease risk prediction; statistical approaches; GENOME-WIDE ASSOCIATION; ROBUST FEATURE-SELECTION; FALSE DISCOVERY RATE; MUTUAL INFORMATION; RANDOM FORESTS; GENE; RELEVANCE; LOCI; GWAS; DIMENSIONALITY;
D O I
10.3389/fbinf.2022.927312
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called "curse of dimensionality" (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most "informative" features and remove noisy "non-informative," irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
引用
收藏
页数:17
相关论文
共 133 条
[1]   Genomic risk prediction of complex human disease and its clinical application [J].
Abraham, Gad ;
Inouye, Michael .
CURRENT OPINION IN GENETICS & DEVELOPMENT, 2015, 33 :10-16
[2]   Adapting to unknown sparsity by controlling the false discovery rate [J].
Abramovich, Felix ;
Benjamini, Yoav ;
Donoho, David L. ;
Johnstone, Iain M. .
ANNALS OF STATISTICS, 2006, 34 (02) :584-653
[3]   Genetic Mapping in Human Disease [J].
Altshuler, David ;
Daly, Mark J. ;
Lander, Eric S. .
SCIENCE, 2008, 322 (5903) :881-888
[4]   Reducing dimensionality in a database of sleep EEG arousals [J].
Alvarez-Estevez, Diego ;
Sanchez-Marono, Noelia ;
Alonso-Betanzos, Amparo ;
Moret-Bonillo, Vicente .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (06) :7746-7754
[5]   A Hybrid Feature Selection Method for Complex Diseases SNPs [J].
Alzubi, Raid ;
Ramzan, Naeem ;
Alzoubi, Hadeel ;
Amira, Abbes .
IEEE ACCESS, 2018, 6 :1292-1301
[6]   A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization [J].
Aphinyanaphongs, Yindalon ;
Fu, Lawrence D. ;
Li, Zhiguo ;
Peskin, Eric R. ;
Efstathiadis, Efstratios ;
Aliferis, Constantin F. ;
Statnikov, Alexander .
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (10) :1964-1987
[7]   Clinical assessment incorporating a personal genome [J].
Ashley, Euan A. ;
Butte, Atul J. ;
Wheeler, Matthew T. ;
Chen, Rong ;
Klein, Teri E. ;
Dewey, Frederick E. ;
Dudley, Joel T. ;
Ormond, Kelly E. ;
Pavlovic, Aleksandra ;
Morgan, Alexander A. ;
Pushkarev, Dmitry ;
Neff, Norma F. ;
Hudgins, Louanne ;
Gong, Li ;
Hodges, Laura M. ;
Berlin, Dorit S. ;
Thorn, Caroline F. ;
Sangkuhl, Katrin ;
Hebert, Joan M. ;
Woon, Mark ;
Sagreiya, Hersh ;
Whaley, Ryan ;
Knowles, Joshua W. ;
Chou, Michael F. ;
Thakuria, Joseph V. ;
Rosenbaum, Abraham M. ;
Zaranek, Alexander Wait ;
Church, George M. ;
Greely, Henry T. ;
Quake, Stephen R. ;
Altman, Russ B. .
LANCET, 2010, 375 (9725) :1525-1535
[8]   A systematic comparison of statistical methods to detect interactions in exposome-health associations [J].
Barrera-Gomez, Jose ;
Agier, Lydiane ;
Portengen, Lutzen ;
Chadeau-Hyam, Marc ;
Giorgis-Allemand, Lise ;
Siroux, Valerie ;
Robinson, Oliver ;
Vlaanderen, Jelle ;
Gonzalez, Juan R. ;
Nieuwenhuijsen, Mark ;
Vineis, Paolo ;
Vrijheid, Martine ;
Vermeulen, Roel ;
Slama, Remy ;
Basagana, Xavier .
ENVIRONMENTAL HEALTH, 2017, 16
[9]   USING MUTUAL INFORMATION FOR SELECTING FEATURES IN SUPERVISED NEURAL-NET LEARNING [J].
BATTITI, R .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (04) :537-550
[10]  
Benjamini Y, 2001, ANN STAT, V29, P1165