Convex Clustering: An Attractive Alternative to Hierarchical Clustering

被引:34
作者
Chen, Gary K. [1 ]
Chi, Eric C. [2 ]
Ranola, John Michael O. [3 ]
Lange, Kenneth [4 ,5 ,6 ]
机构
[1] Univ So Calif, Dept Prevent Med, Div Biostat, Los Angeles, CA 90089 USA
[2] Rice Univ, Dept Elect & Comp Engn, Houston, TX 77251 USA
[3] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[4] Univ Calif Los Angeles, Dept Biomath, Los Angeles, CA USA
[5] Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA USA
[6] Univ Calif Los Angeles, Dept Stat, Los Angeles, CA USA
基金
美国国家卫生研究院;
关键词
POPULATION; ANCESTRY; DISEASE;
D O I
10.1371/journal.pcbi.1004228
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/
引用
收藏
页数:31
相关论文
共 40 条
[1]   Genetic diversities of cytochrome B in Xinjiang Uyghur unveiled its origin and migration history [J].
Ablimit, Abdurahman ;
Qin, Wenbei ;
Shan, Wenjuan ;
Wu, Weiwei ;
Ling, Fengjun ;
Ling, Kaitelynn H. ;
Zhao, Changjie ;
Zhang, Fuchun ;
Ma, Zhenghai ;
Zheng, Xiufen .
BMC GENETICS, 2013, 14
[2]   Fast model-based estimation of ancestry in unrelated individuals [J].
Alexander, David H. ;
Novembre, John ;
Lange, Kenneth .
GENOME RESEARCH, 2009, 19 (09) :1655-1664
[3]  
[Anonymous], 1995, Recent Advances in Descriptive Multivariate Analysis
[4]  
Bache K., 2013, UCI Machine Learning Repository
[5]  
Borg I., 2005, MODERN MULTIDIMENSIO, DOI DOI 10.18637/JSS.V014.B04
[6]  
Borwein JM, 2006, CMS BOOKS IN MATHEMA, V3
[7]  
census bureau C, 1990, THE FOURTH POPULATIO
[8]  
census bureau C, 2000, POPULATION CENSUS OF
[9]  
Chi EC, 2014, ARXIV 1408 0856 STAT
[10]  
Chi EC, 2013, JOURNAL OF COMPUTATI