A Machine-Learning-Based Approach to Prediction of Biogeographic Ancestry within Europe

被引:3
作者
Kloska, Anna [1 ,2 ]
Gielczyk, Agata [3 ]
Grzybowski, Tomasz [1 ]
Ploski, Rafal [4 ]
Kloska, Sylwester M. [1 ,2 ]
Marciniak, Tomasz [3 ]
Palczynski, Krzysztof [3 ]
Rogalla-Ladniak, Urszula [1 ]
Malyarchuk, Boris A. [5 ]
Derenko, Miroslava V. [5 ]
Kovacevic-Grujicic, Natasa [6 ]
Stevanovic, Milena [6 ,7 ,8 ]
Drakulic, Danijela [6 ]
Davidovic, Slobodan [9 ]
Spolnicka, Magdalena [10 ]
Zubanska, Magdalena [11 ]
Wozniak, Marcin [1 ]
机构
[1] Nicolaus Copernicus Univ Torun, Dept Forens Med, Ludw Rydygier Coll Medicum Bydgoszcz, PL-85067 Bydgoszcz, Poland
[2] Bydgoszcz Univ Sci & Technol, Fac Med Sci, PL-85796 Bydgoszcz, Poland
[3] Bydgoszcz Univ Sci & Technol, Fac Telecommun Comp Sci & Elect Engn, PL-85796 Bydgoszcz, Poland
[4] Warsaw Med Univ, Dept Med Genet, PL-02106 Warsaw, Poland
[5] Russian Acad Sci, Inst Biol Problems North, Magadan 685000, Russia
[6] Univ Belgrade, Inst Mol Genet & Genet Engn, Belgrade 11042, Serbia
[7] Univ Belgrade, Fac Biol, Belgrade 11000, Serbia
[8] Serbian Acad Arts & Sci, Belgrade 11000, Serbia
[9] Univ Belgrade, Inst Biol Res Sinisa Stankovic, Natl Inst Republ Serbia, Belgrade 11060, Serbia
[10] Univ Warsaw, Ctr Forens Sicences, PL-00927 Warsaw, Poland
[11] Univ Warmia & Mazury, Fac Law & Adm, Dept Criminol & Forens Sci, PL-10726 Olsztyn, Poland
关键词
machine learning; SVM; biogeographic origin; biogeographic ancestry; POPULATION;
D O I
10.3390/ijms242015095
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Data obtained with the use of massive parallel sequencing (MPS) can be valuable in population genetics studies. In particular, such data harbor the potential for distinguishing samples from different populations, especially from those coming from adjacent populations of common origin. Machine learning (ML) techniques seem to be especially well suited for analyzing large datasets obtained using MPS. The Slavic populations constitute about a third of the population of Europe and inhabit a large area of the continent, while being relatively closely related in population genetics terms. In this proof-of-concept study, various ML techniques were used to classify DNA samples from Slavic and non-Slavic individuals. The primary objective of this study was to empirically evaluate the feasibility of discerning the genetic provenance of individuals of Slavic descent who exhibit genetic similarity, with the overarching goal of categorizing DNA specimens derived from diverse Slavic population representatives. Raw sequencing data were pre-processed, to obtain a 1200 character-long binary vector. A total of three classifiers were used-Random Forest, Support Vector Machine (SVM), and XGBoost. The most-promising results were obtained using SVM with a linear kernel, with 99.9% accuracy and F1-scores of 0.9846-1.000 for all classes.
引用
收藏
页数:12
相关论文
共 30 条
[1]   Dissecting polygenic signals from genome-wide association studies on human behaviour [J].
Abdellaoui, Abdel ;
Verweij, Karin J. H. .
NATURE HUMAN BEHAVIOUR, 2021, 5 (06) :686-694
[2]   Geography and genography: prediction of continental origin using randomly selected single nucleotide polymorphisms [J].
Allocco, Dominic J. ;
Song, Qing ;
Gibbons, Gary H. ;
Ramoni, Marco F. ;
Kohane, Isaac S. .
BMC GENOMICS, 2007, 8 (1)
[3]   A global reference for human genetic variation [J].
Altshuler, David M. ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Donnelly, Peter ;
Eichler, Evan E. ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Green, Eric D. ;
Hurles, Matthew E. ;
Knoppers, Bartha M. ;
Korbel, Jan O. ;
Lander, Eric S. ;
Lee, Charles ;
Lehrach, Hans ;
Mardis, Elaine R. ;
Marth, Gabor T. ;
McVean, Gil A. ;
Nickerson, Deborah A. ;
Wang, Jun ;
Wilson, Richard K. ;
Boerwinkle, Eric ;
Doddapaneni, Harsha ;
Han, Yi ;
Korchina, Viktoriya ;
Kovar, Christie ;
Lee, Sandra ;
Muzny, Donna ;
Reid, Jeffrey G. ;
Zhu, Yiming ;
Chang, Yuqi ;
Feng, Qiang ;
Fang, Xiaodong ;
Guo, Xiaosen ;
Jian, Min ;
Jiang, Hui ;
Jin, Xin ;
Lan, Tianming ;
Li, Guoqing ;
Li, Jingxiang ;
Li, Yingrui ;
Liu, Shengmao ;
Liu, Xiao ;
Lu, Yao ;
Ma, Xuedi ;
Tang, Meifang ;
Wang, Bo .
NATURE, 2015, 526 (7571) :68-+
[4]   Deep learning for computational biology [J].
Angermueller, Christof ;
Parnamaa, Tanel ;
Parts, Leopold ;
Stegle, Oliver .
MOLECULAR SYSTEMS BIOLOGY, 2016, 12 (07)
[5]   Predicting geographic location from genetic variation with deep neural networks [J].
Battey, C. J. ;
Ralph, Peter L. ;
Kern, Andrew D. .
ELIFE, 2020, 9 :1-22
[6]   Can Deep Learning Improve Genomic Prediction of Complex Human Traits? [J].
Bellot, Pau ;
de los Campos, Gustavo ;
Perez-Enciso, Miguel .
GENETICS, 2018, 210 (03) :809-819
[7]  
Boidot R., 2022, Front. Oncol, V12, P863057
[8]   CoVaCS: a consensus variant calling system [J].
Chiara, Matteo ;
Gioiosa, Silvia ;
Chillemi, Giovanni ;
D'Antonio, Mattia ;
Flati, Tiziano ;
Picardi, Ernesto ;
Zambelli, Federico ;
Horner, David Stephen ;
Pesole, Graziano ;
Castrignano, Tiziana .
BMC GENOMICS, 2018, 19
[9]   Big data in healthcare: management, analysis and future prospects [J].
Dash, Sabyasachi ;
Shakyawar, Sushil Kumar ;
Sharma, Mohit ;
Kaushik, Sandeep .
JOURNAL OF BIG DATA, 2019, 6 (01)
[10]   Genetic Landscape of Slovenians: Past Admixture and Natural Selection Pattern [J].
Delser, Pierpaolo Maisano ;
Ravnik-Glavac, Metka ;
Gasparini, Paolo ;
Glavac, Damjan ;
Mezzavilla, Massimo .
FRONTIERS IN GENETICS, 2018, 9