PCA and Clustering Reveal Alternate mtDNA Phylogeny of N and M Clades

被引:12
作者
Alexe, G. [1 ,2 ]
Vijaya Satya, R. [3 ]
Seiler, M. [4 ]
Platt, D. [5 ]
Bhanot, T. [6 ]
Hui, S. [4 ]
Tanaka, M. [7 ]
Levine, A. J. [1 ]
Bhanot, G. [1 ,4 ,8 ,9 ,10 ]
机构
[1] Simons Ctr Syst Biol, Inst Adv Study, Princeton, NJ 08540 USA
[2] Broad Inst MIT & Harvard, Cambridge, MA 02142 USA
[3] Univ Cent Florida, Sch Comp Sci, Orlando, FL 32816 USA
[4] Rutgers State Univ, BioMaPS Inst, Piscataway, NJ 08854 USA
[5] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA
[6] Rutgers State Univ, Grad Program Microbiol & Mol Genet, Piscataway, NJ 08854 USA
[7] Tokyo Metropolitan Inst Gerontol, Itabashi Ku, Tokyo 1730015, Japan
[8] Rutgers State Univ, Dept Phys, Piscataway, NJ 08854 USA
[9] Rutgers State Univ, Dept Mol Biol & Biochem, Piscataway, NJ 08854 USA
[10] Canc Inst New Jersey, New Brunswick, NJ 08903 USA
关键词
mtDNA phylogeny; Principal component analysis; Unsupervised consensus ensemble clustering; Clade tree; Homoplasy; Time to most recent common ancestor;
D O I
10.1007/s00239-008-9148-7
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a "common" polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,... (max) clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a "European R clade" containing the haplogroups H, V, H/V, J, T, and U and a "Eurasian N subclade" including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1-20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 +/- 14,000 years before present.
引用
收藏
页码:465 / 487
页数:23
相关论文
共 53 条
[1]   Consensus algorithms for the generation of all maximal bicliques [J].
Alexe, G ;
Alexe, S ;
Crama, Y ;
Foldes, S ;
Hammer, PL ;
Simeone, B .
DISCRETE APPLIED MATHEMATICS, 2004, 145 (01) :11-21
[2]  
[Anonymous], 1975, CLUSTERING ALGORITHM
[3]  
[Anonymous], 2004, PHYLIP PHYLOGENY INF
[4]  
[Anonymous], 2006, Human Mitochondrial DNA and the Evolution of Home sapiens
[5]   MITOCHONDRIAL-DNA AND HUMAN-EVOLUTION [J].
CANN, RL ;
STONEKING, M ;
WILSON, AC .
NATURE, 1987, 325 (6099) :31-36
[7]  
Cheng C.-H., 1999, P ACM SIGKDD INT C K
[8]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[9]  
Densmore LD, 2001, METH MOL B, V176, P23
[10]   Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA [J].
Drummond, A ;
Rodrigo, AG .
MOLECULAR BIOLOGY AND EVOLUTION, 2000, 17 (12) :1807-1815