A Fast Quartet tree heuristic for hierarchical clustering

被引:21
作者
Cilibrasi, Rudi L. [1 ]
Vitanyi, Paul M. B. [1 ,2 ]
机构
[1] CWI, Natl Res Inst Math & Comp Sci CWI, NL-1098 XG Amsterdam, Netherlands
[2] Univ Amsterdam, Amsterdam, Netherlands
关键词
Data and knowledge visualization; Pattern matching-Clustering-Algorithms/Similarity measures; Pattern matching-Applications; Hierarchical clustering; Global optimization; Monte Carlo method; Quartet tree; Randomized hill-climbing; INFORMATION;
D O I
10.1016/j.patcog.2010.08.033
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3((n)(4)) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill-climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA. BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. (C) 2010 Elsevier Ltd. All rights reserved.
引用
收藏
页码:662 / 677
页数:16
相关论文
共 44 条
[1]  
[Anonymous], 1989, Kolmogorov Complexity and Its Applications
[2]  
[Anonymous], 2000, Pattern Classification
[3]  
Bao LC, 2008, LECT NOTES COMPUT SC, V5165, P469
[4]   Constructing phylogenies from quartets: Elucidation of eutherian superordinal relationships [J].
Ben-Dor, A ;
Chor, B ;
Graur, D ;
Ophir, R ;
Pelleg, D .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1998, 5 (03) :377-390
[5]   Chain letters & evolutionary histories [J].
Bennett, CH ;
Li, M ;
Ma, B .
SCIENTIFIC AMERICAN, 2003, 288 (06) :76-81
[6]   Information distance [J].
Bennett, CH ;
Gacs, P ;
Li, M ;
Vitanyi, FMB ;
Zurek, WH .
IEEE TRANSACTIONS ON INFORMATION THEORY, 1998, 44 (04) :1407-1423
[7]  
Berry V, 1999, LECT NOTES COMPUT SC, V1643, P313
[8]   CALCULATING SUMS OF INFINITE SERIES [J].
BRADEN, B .
AMERICAN MATHEMATICAL MONTHLY, 1992, 99 (07) :649-655
[9]  
BUNEMAN P, 1971, P ANGL ROM C ROY SOC, P387
[10]   Shared information and program plagiarism detection [J].
Chen, X ;
Francia, B ;
Li, M ;
McKinnon, B ;
Seker, A .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2004, 50 (07) :1545-1551