Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics

被引:24
作者
Ambroise, Christophe [1 ]
Dehman, Alia [2 ]
Neuvial, Pierre [3 ]
Rigaill, Guillem [1 ,4 ]
Vialaneix, Nathalie [5 ]
机构
[1] Univ Evry Val Essonne, CNRS, UMR 8071, Lab Math & Modelisat Evry, 23 Blvd France, F-91037 Evry, France
[2] Hyphen Stat, 195 Route Espagne, F-31036 Toulouse, France
[3] Univ Toulouse, Inst Math Toulouse, UPS IMT, CNRS,UMR5219, F-31062 Toulouse 9, France
[4] INRA, CNRS, Inst Plant Sci Paris Saclay IPS2, Gif Sur Yvette, France
[5] Univ Toulouse, MIAT, INRA, Castanet Tolosan, France
关键词
Hierarchical agglomerative clustering; Adjacency constraint; Segmentation; Ward's linkage; Similarity; Min heap; Genome-Wide Association Studies and Hi-C; CHANGE-POINT DETECTION;
D O I
10.1186/s13015-019-0157-4
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of 10(4) to 10(5) for each chromosome. Results: By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds. Availability and implementation: Software and sample data are available as an R package, adjclust, that can be downloaded from the Comprehensive R Archive Network (CRAN).
引用
收藏
页数:14
相关论文
共 36 条
[1]  
[Anonymous], ACM J EXPT ALGORITHM
[2]  
Arlot S, 2016, ARXIV12023878
[3]  
Arlot S., 2016, CAPUSHE CALIBRATING
[4]   THEORY OF REPRODUCING KERNELS [J].
ARONSZAJN, N .
TRANSACTIONS OF THE AMERICAN MATHEMATICAL SOCIETY, 1950, 68 (MAY) :337-404
[5]   STABILITY OF 2 HIERARCHICAL GROUPING TECHNIQUES CASE 1 - SENSITIVITY TO DATA ERRORS [J].
BAKER, FB .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1974, 69 (346) :440-445
[6]   Determination of the number of zones in a biostratigraphical sequence [J].
Bennett, KD .
NEW PHYTOLOGIST, 1996, 132 (01) :155-170
[7]  
Bostrom Henrik, 2016, ADV INTELLIGENT DATA
[8]   New efficient algorithms for multiple change-point detection with reproducing kernels [J].
Celisse, A. ;
Marot, G. ;
Pierre-Jean, M. ;
Rigaill, G. J. .
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2018, 128 :200-220
[9]   <bold>ClustGeo</bold>: an R package for hierarchical clustering with spatial constraints [J].
Chavent, Marie ;
Kuentz-Simonet, Vanessa ;
Labenne, Amaury ;
Saracco, Jerome .
COMPUTATIONAL STATISTICS, 2018, 33 (04) :1799-1822
[10]  
Clayton D., 2015, snpStats: SnpMatrix and XSnpMatrix classes and methods