Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations

被引:0
作者
Bonet, David [1 ,2 ]
Levin, May [1 ]
Montserrat, Daniel Mas [1 ]
Ioannidis, Alexander G. [1 ,3 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Univ Politecn Cataluna, Barcelona, Spain
[3] Univ Calif Santa Cruz, Santa Cruz, CA USA
来源
BIOCOMPUTING 2024, PSB 2024 | 2024年
关键词
Genetics; Precision Medicine; Machine Learning; Phenotype Prediction; Bioinformatics; GENETIC RISK; SELECTION; GENOMICS;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
引用
收藏
页码:404 / 418
页数:15
相关论文
共 49 条
[1]   Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction [J].
Afrose, Sharmin ;
Song, Wenjia ;
Nemeroff, Charles B. ;
Lu, Chang ;
Yao, Danfeng .
COMMUNICATIONS MEDICINE, 2022, 2 (01)
[2]   Fast model-based estimation of ancestry in unrelated individuals [J].
Alexander, David H. ;
Novembre, John ;
Lange, Kenneth .
GENOME RESEARCH, 2009, 19 (09) :1655-1664
[3]  
Bartusiak E. R., 2022 44 ANN INT C IE
[4]  
Batista GEAPA, 2004, ACM SIGKDD Explor Newsl, V6, P20, DOI [10.1145/1007730.1007735, DOI 10.1145/1007730.1007735]
[5]   Principles and Practice of Explainable Machine Learning [J].
Belle, Vaishak ;
Papantonis, Ioannis .
FRONTIERS IN BIG DATA, 2021, 4
[6]   Predicting Protein-Protein Interactions based on Biological Information using Extreme Gradient Boosting [J].
Beltran, Jerome Cary ;
Valdez, Paolo ;
Naval, Prospero, Jr. .
2019 16TH IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY - CIBCB 2019, 2019, :346-351
[7]  
Branco P., 1 INT WORKSH LEARN I
[8]   Genomics for the world [J].
Bustamante, Carlos D. ;
Burchard, Esteban Gonzalez ;
De La Vega, Francisco M. .
NATURE, 2011, 475 (7355) :163-165
[9]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[10]   XGBoost: A Scalable Tree Boosting System [J].
Chen, Tianqi ;
Guestrin, Carlos .
KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, :785-794