AN ALGORITHM FOR DECIDING THE NUMBER OF CLUSTERS AND VALIDATION USING SIMULATED DATA WITH APPLICATION TO EXPLORING CROP POPULATION STRUCTURE

被引:18
作者
Newell, Mark A. [1 ]
Cook, Dianne [2 ]
Hofmann, Heike [2 ]
Jannink, Jean-Luc [3 ]
机构
[1] Samuel Roberts Noble Fdn Inc, Ardmore, OK 73401 USA
[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA
[3] Cornell Univ, Dept Plant Breeding & Genet, USDA ARS, Robert W Holley Ctr Agr & Hlth, Ithaca, NY 14853 USA
关键词
Cluster analysis; high dimensional; low sample size; simulation; genetic marker data; visualization; bootstrap; dimension reduction; LINKAGE DISEQUILIBRIUM; HIGH-DIMENSION; GEOMETRIC REPRESENTATION; GENOME;
D O I
10.1214/13-AOAS671
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are of the high-dimension, low sample size (HDLSS) situation. An algorithm for deciding the number of clusters is proposed, and is validated on simulated data sets varying in both the level of structure and the number of clusters covering the range of variation observed empirically. The algorithm was then tested on six empirical data sets across three small grain species. The algorithm uses bootstrapping, three methods of clustering, and defines the optimum number of clusters based on a common criterion, the Hubert's gamma statistic. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be used for a wide variety of genetic data sets.
引用
收藏
页码:1898 / 1916
页数:19
相关论文
共 29 条
  • [1] The high-dimension, low-sample-size geometric representation holds under mild conditions
    Ahn, Jeongyoun
    Marron, J. S.
    Muller, Keith M.
    Chi, Yueh-Yun
    [J]. BIOMETRIKA, 2007, 94 (03) : 760 - 766
  • [2] [Anonymous], ADAPTIVE CONTROL PRO
  • [3] [Anonymous], 1993, J COMPUT GRAPH STAT, DOI [10.2307/1390644, DOI 10.2307/1390644]
  • [4] Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats
    Asoro, Franco G.
    Newell, Mark A.
    Beavis, William D.
    Scott, M. Paul
    Jannink, Jean-Luc
    [J]. PLANT GENOME, 2011, 4 (02) : 132 - 144
  • [5] Chang WC, 1983, J ROY STAT SOC C, V32, P267, DOI 10.2307/2347949
  • [6] Population- and genome-specific patterns of linkage disequilibrium and SNP variation in spring and winter wheat (Triticum aestivum L.)
    Chao, Shiaoman
    Dubcovsky, Jorge
    Dvorak, Jan
    Luo, Ming-Cheng
    Baenziger, Stephen P.
    Matnyazov, Rustam
    Clark, Dale R.
    Talbert, Luther E.
    Anderson, James A.
    Dreisigacker, Susanne
    Glover, Karl
    Chen, Jianli
    Campbell, Kim
    Bruckner, Phil L.
    Rudd, Jackie C.
    Haley, Scott
    Carver, Brett F.
    Perry, Sid
    Sorrells, Mark E.
    Akhunov, Eduard D.
    [J]. BMC GENOMICS, 2010, 11
  • [7] RELATIONSHIPS AMONG ANALYTICAL METHODS USED TO STUDY GENOTYPIC VARIATION AND GENOTYPE-BY-ENVIRONMENT INTERACTION IN PLANT-BREEDING MULTI ENVIRONMENT EXPERIMENTS
    COOPER, M
    DELACY, IH
    [J]. THEORETICAL AND APPLIED GENETICS, 1994, 88 (05) : 561 - 572
  • [8] Model-based clustering, discriminant analysis, and density estimation
    Fraley, C
    Raftery, AE
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) : 611 - 631
  • [9] Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST
    Fraley, C
    Raftery, AE
    [J]. JOURNAL OF CLASSIFICATION, 2003, 20 (02) : 263 - 286
  • [10] Fraley C., 2011, mclust: Model-based clustering / normal mixture modeling