GenoCore: A simple and fast algorithm for core subset selection from large genotype datasets

被引:42
作者
Jeong, Seongmun [1 ]
Kim, Jae-Yoon [1 ,2 ]
Jeong, Soon-Chun [3 ]
Kang, Sung-Taeg [4 ]
Moon, Jung-Kyung [5 ]
Kim, Namshin [1 ,2 ]
机构
[1] Korea Res Inst Biosci & Biotechnol, Personalized Genom Med Res Ctr, Div Strateg Res Grp, Daejeon, South Korea
[2] Korea Univ Sci & Technol, KRIBB Sch, Dept Biol Sci, Daejeon, South Korea
[3] Korea Res Inst Biosci & Biotechnol, Bioevaluat Ctr, Cheongju, Chungbuk, South Korea
[4] Dankook Univ, Dept Crop Sci & Biotechnol, Cheonan, Chungnam, South Korea
[5] Rural Dev Adm, Natl Inst Crop Sci, Jeonju, Jeonbuk, South Korea
基金
新加坡国家研究基金会;
关键词
RESOURCES;
D O I
10.1371/journal.pone.0181420
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Selecting core subsets from plant genotype datasets is important for enhancing cost-effectiveness and to shorten the time required for analyses of genome-wide association studies (GWAS), and genomics-assisted breeding of crop species, etc. Recently, a large number of genetic markers (>100,000 single nucleotide polymorphisms) have been identified from high-density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) data. However, there is no software available for picking out the efficient and consistent core subset from such a huge dataset. It is necessary to develop software that can extract genetically important samples in a population with coherence. We here present a new program, GenoCore, which can find quickly and efficiently the core subset representing the entire population. We introduce simple measures of coverage and diversity scores, which reflect genotype errors and genetic variations, and can help to select a sample rapidly and accurately for crop genotype dataset. Comparison of our method to other core collection software using example datasets are performed to validate the performance according to genetic distance, diversity, coverage, required system resources, and the number of selected samples. GenoCore selects the smallest, most consistent, and most representative core collection from all samples, using less memory with more efficient scores, and shows greater genetic coverage compared to the other software tested. GenoCore was written in R language, and can be accessed online with an example dataset and test results at https://github.com/lovemun/Genocore.
引用
收藏
页数:10
相关论文
共 12 条
[1]  
[Anonymous], 1978, VARIABILITY NATURAL
[2]   Genetic Architecture of Maize Kernel Composition in the Nested Association Mapping and Inbred Association Panels [J].
Cook, Jason P. ;
McMullen, Michael D. ;
Holland, James B. ;
Tian, Feng ;
Bradbury, Peter ;
Ross-Ibarra, Jeffrey ;
Buckler, Edward S. ;
Flint-Garcia, Sherry A. .
PLANT PHYSIOLOGY, 2012, 158 (02) :824-834
[3]   Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search [J].
De Beukelaer, Herman ;
Smykal, Petr ;
Davenport, Guy F. ;
Fack, Veerle .
BMC BIOINFORMATICS, 2012, 13
[4]  
Frankel OH, 1984, Crop genetic resources: conservation and evaluation, P249
[5]   MSTRAT: An algorithm for building germ plasm core collections by maximizing allelic or phenotypic richness [J].
Gouesnard, B ;
Bataillon, TM ;
Decoux, G ;
Rozale, C ;
Schoen, DJ ;
David, JL .
JOURNAL OF HEREDITY, 2001, 92 (01) :93-94
[6]   PowerCore: a program applying the advanced M strategy with a heuristic search for establishing core sets [J].
Kim, Kyu-Won ;
Chung, Hun-Ki ;
Cho, Gyu-Taek ;
Ma, Kyung-Ho ;
Chandrabalan, Dorothy ;
Gwag, Jae-Gyun ;
Kim, Tae-San ;
Cho, Eun-Gi ;
Park, Yong-Jin .
BIOINFORMATICS, 2007, 23 (16) :2155-2162
[7]   Development, validation and genetic analysis of a large soybean SNP genotyping array [J].
Lee, Yun-Gyeong ;
Jeong, Namhee ;
Kim, Ji Hong ;
Lee, Kwanghee ;
Kim, Kil Hyun ;
Pirani, Ali ;
Ha, Bo-Keun ;
Kang, Sung-Taeg ;
Park, Beom-Seok ;
Moon, Jung-Kyung ;
Kim, Namshin ;
Jeong, Soon-Chun .
PLANT JOURNAL, 2015, 81 (04) :625-636
[8]   Open access resources for genome-wide association mapping in rice [J].
McCouch, Susan R. ;
Wright, Mark H. ;
Tung, Chih-Wei ;
Maron, Lyza G. ;
McNally, Kenneth L. ;
Fitzgerald, Melissa ;
Singh, Namrata ;
DeClerck, Genevieve ;
Agosto-Perez, Francisco ;
Korniliev, Pavel ;
Greenberg, Anthony J. ;
Naredo, Ma. Elizabeth B. ;
Mercado, Sheila Mae Q. ;
Harrington, Sandra E. ;
Shi, Yuxin ;
Branchini, Darcy A. ;
Kuser-Falcao, Paula R. ;
Leung, Hei ;
Ebana, Kowaru ;
Yano, Masahiro ;
Eizenga, Georgia ;
McClung, Anna ;
Mezey, Jason .
NATURE COMMUNICATIONS, 2016, 7
[9]   A MATHEMATICAL THEORY OF COMMUNICATION [J].
SHANNON, CE .
BELL SYSTEM TECHNICAL JOURNAL, 1948, 27 (03) :379-423
[10]   Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures [J].
Thachuk, Chris ;
Crossa, Jose ;
Franco, Jorge ;
Dreisigacker, Susanne ;
Warburton, Marilyn ;
Davenport, Guy F. .
BMC BIOINFORMATICS, 2009, 10