AN ALGORITHM FOR DECIDING THE NUMBER OF CLUSTERS AND VALIDATION USING SIMULATED DATA WITH APPLICATION TO EXPLORING CROP POPULATION STRUCTURE

被引：18

作者：

Newell, Mark A. ^{[1
]}

Cook, Dianne ^{[2
]}

Hofmann, Heike ^{[2
]}

Jannink, Jean-Luc ^{[3
]}

机构：

[1] Samuel Roberts Noble Fdn Inc, Ardmore, OK 73401 USA

[2] Iowa State Univ, Dept Stat, Ames, IA 50011 USA

[3] Cornell Univ, Dept Plant Breeding & Genet, USDA ARS, Robert W Holley Ctr Agr & Hlth, Ithaca, NY 14853 USA

来源：

ANNALS OF APPLIED STATISTICS | 2013年 / 7卷 / 04期

关键词：

Cluster analysis; high dimensional; low sample size; simulation; genetic marker data; visualization; bootstrap; dimension reduction; LINKAGE DISEQUILIBRIUM; HIGH-DIMENSION; GEOMETRIC REPRESENTATION; GENOME;

D O I：

10.1214/13-AOAS671

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are of the high-dimension, low sample size (HDLSS) situation. An algorithm for deciding the number of clusters is proposed, and is validated on simulated data sets varying in both the level of structure and the number of clusters covering the range of variation observed empirically. The algorithm was then tested on six empirical data sets across three small grain species. The algorithm uses bootstrapping, three methods of clustering, and defines the optimum number of clusters based on a common criterion, the Hubert's gamma statistic. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be used for a wide variety of genetic data sets.

引用

页码：1898 / 1916

页数：19

共 29 条

[1] The high-dimension, low-sample-size geometric representation holds under mild conditions
Ahn, Jeongyoun
Marron, J. S.
Muller, Keith M.
Chi, Yueh-Yun
[J]. BIOMETRIKA, 2007, 94 (03) : 760 - 766
[2] [Anonymous], ADAPTIVE CONTROL PRO
[3] [Anonymous], 1993, J COMPUT GRAPH STAT, DOI [10.2307/1390644, DOI 10.2307/1390644]
[4] Accuracy and Training Population Design for Genomic Selection on Quantitative Traits in Elite North American Oats
Asoro, Franco G.
Newell, Mark A.
Beavis, William D.
Scott, M. Paul
Jannink, Jean-Luc
[J]. PLANT GENOME, 2011, 4 (02) : 132 - 144
[5] Chang WC, 1983, J ROY STAT SOC C, V32, P267, DOI 10.2307/2347949
[6] Population- and genome-specific patterns of linkage disequilibrium and SNP variation in spring and winter wheat (Triticum aestivum L.)
Chao, Shiaoman
Dubcovsky, Jorge
Dvorak, Jan
Luo, Ming-Cheng
Baenziger, Stephen P.
Matnyazov, Rustam
Clark, Dale R.
Talbert, Luther E.
Anderson, James A.
Dreisigacker, Susanne
Glover, Karl
Chen, Jianli
Campbell, Kim
Bruckner, Phil L.
Rudd, Jackie C.
Haley, Scott
Carver, Brett F.
Perry, Sid
Sorrells, Mark E.
Akhunov, Eduard D.
[J]. BMC GENOMICS, 2010, 11
[7] RELATIONSHIPS AMONG ANALYTICAL METHODS USED TO STUDY GENOTYPIC VARIATION AND GENOTYPE-BY-ENVIRONMENT INTERACTION IN PLANT-BREEDING MULTI ENVIRONMENT EXPERIMENTS
COOPER, M
DELACY, IH
[J]. THEORETICAL AND APPLIED GENETICS, 1994, 88 (05) : 561 - 572
[8] Model-based clustering, discriminant analysis, and density estimation
Fraley, C
Raftery, AE
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2002, 97 (458) : 611 - 631
[9] Enhanced model-based clustering, density estimation, and discriminant analysis software: MCLUST
Fraley, C
Raftery, AE
[J]. JOURNAL OF CLASSIFICATION, 2003, 20 (02) : 263 - 286
[10] Fraley C., 2011, mclust: Model-based clustering / normal mixture modeling

← 1 2 3 →