From gene trees to species trees

被引:113
作者
Ma, B
Li, M
Zhang, LX
机构
[1] City Univ Hong Kong, Dept Comp Sci, Kowloon, Hong Kong, Peoples R China
[2] Univ Waterloo, Dept Comp Sci, Waterloo, ON N2L 3G1, Canada
[3] Kent Ridge Digital Labs, Bioinformat Ctr, Singapore 119613, Singapore
关键词
gene trees; species trees; NP-hardness; algorithms;
D O I
10.1137/S0097539798343362
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper studies various algorithmic issues in reconstructing a species tree from gene trees under the duplication and the mutation cost model. This is a fundamental problem in computational molecular biology. Our main results are as follows. 1. A linear time algorithm is presented for computing all the losses in duplications associated with the least common ancestor mapping from a gene tree to a species tree. This answers a problem raised recently by Eulenstein, Mirkin, and Vingron [J. Comput. Bio., 5 ( 1998), pp. 135-148]. 2. The complexity of finding an optimal species tree from gene trees is studied. The problem is proved to be NP-hard for the duplication cost and for the mutation cost. Further, the concept of reconciled trees was introduced by Goodman et al. and formalized by Page for visualizing the relationship between gene and species trees. We show that constructing an optimal reconciled tree for gene trees is also NP-hard. Finally, we consider a general reconstruction problem and show it to be NP-hard even for the well-known nearest neighbor interchange distance. 3. A new and efficiently computable metric is defined based on the duplication cost. We show that the problem of finding an optimal species tree from gene trees is NP-hard under this new metric but it can be approximated within factor 2 in polynomial time. Using this approximation result, we propose a heuristic method for finding a species tree from gene trees with uniquely labeled leaves under the duplication cost. Our experimental tests demonstrate that when the number of species is larger than 15 and gene trees are close to each other, our heuristic method is significantly better than the existing program in Page's GeneTree 1.0 that starts the search from a random tree.
引用
收藏
页码:729 / 752
页数:24
相关论文
共 34 条
[1]  
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theoryof NP-Completeness
[2]  
Cormen T. H., 1990, INTRO ALGORITHMS
[3]   Duplication-based measures of difference between gene and species trees [J].
Eulenstein, O ;
Mirkin, B ;
Vingron, M .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (01) :135-148
[4]  
FARACH M, 1994, PROCEEDINGS OF THE FIFTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P481
[5]  
Fellows M, 1998, LECT NOTES COMPUT SC, V1533, P347
[6]   PHYLOGENIES FROM MOLECULAR SEQUENCES - INFERENCE AND RELIABILITY [J].
FELSENSTEIN, J .
ANNUAL REVIEW OF GENETICS, 1988, 22 :521-565
[7]   CONSTRUCTION OF PHYLOGENETIC TREES [J].
FITCH, WM ;
MARGOLIASH, E .
SCIENCE, 1967, 155 (3760) :279-+
[8]   DISTINGUISHING HOMOLOGOUS FROM ANALOGOUS PROTEINS [J].
FITCH, WM .
SYSTEMATIC ZOOLOGY, 1970, 19 (02) :99-&
[9]  
Galil Z., 1977, Theoretical Computer Science, V5, P179, DOI 10.1016/0304-3975(77)90005-6
[10]   FITTING THE GENE LINEAGE INTO ITS SPECIES LINEAGE, A PARSIMONY STRATEGY ILLUSTRATED BY CLADOGRAMS CONSTRUCTED FROM GLOBIN SEQUENCES [J].
GOODMAN, M ;
CZELUSNIAK, J ;
MOORE, GW ;
ROMEROHERRERA, AE ;
MATSUDA, G .
SYSTEMATIC ZOOLOGY, 1979, 28 (02) :132-163