Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model

被引:9
作者
The Minh Luong [1 ]
Rozenholc, Yves [1 ]
Nuel, Gregory [1 ]
机构
[1] Univ Paris 05, MAP5, F-75006 Paris, France
关键词
Change-point estimation; Segmentation; Posterior distribution of change-points; Constrained hidden Markov model; Forward backward algorithm; Fast computation; COMPARATIVE GENOMIC HYBRIDIZATION; ARRAY CGH DATA; CIRCULAR BINARY SEGMENTATION; COPY NUMBER; REGRESSION; ALGORITHM;
D O I
10.1016/j.csda.2013.06.020
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The detection of change-points in heterogeneous sequences is a statistical challenge with applications across a wide variety of fields. In bioinformatics, a vast amount of methodology exists to identify an ideal set of change-points for detecting Copy Number Variation (CNV). While considerable efficient algorithms are currently available for finding the best segmentation of the data in CNV, relatively few approaches consider the important problem of assessing the uncertainty of the change-point location. Asymptotic and stochastic approaches exist but often require additional model assumptions to speed up the computations, while exact methods generally have quadratic complexity which may be intractable for large data sets of tens of thousands points or more. A hidden Markov model, with constraints specifically chosen to correspond to a segment-based change-point model, provides an exact method for obtaining the posterior distribution of change-points with linear complexity. The methods are implemented in the R package postCP, which uses the results of a given change-point detection algorithm to estimate the probability that each observation is a change-point. The results include an implementation of postCP on a publicly available CNV data set (n = 120). Due to its frequentist framework, postCP obtains less conservative confidence intervals than previously published Bayesian methods, but with linear complexity instead of quadratic. Simulations showed that postCP provided comparable loss to a Bayesian MCMC method when estimating posterior means, specifically when assessing larger scale changes, while being more computationally efficient. On another high-resolution CNV data set (n = 14,241), the implementation processed information in less than one second on a mid-range laptop computer. (C) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:129 / 140
页数:12
相关论文
共 40 条
[1]   Implied distributions in multiple change point problems [J].
Aston, J. A. D. ;
Peng, J. Y. ;
Martin, D. E. K. .
STATISTICS AND COMPUTING, 2012, 22 (04) :981-993
[2]   Computation and analysis of multiple structural change models [J].
Bai, J ;
Perron, P .
JOURNAL OF APPLIED ECONOMETRICS, 2003, 18 (01) :1-22
[3]  
CAPPE O, 2005, SPR S STAT, P1
[4]  
Cleynen A, 2012, ARXIV12045564
[5]   QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data [J].
Colella, Stefano ;
Yau, Christopher ;
Taylor, Jennifer M. ;
Mirza, Ghazala ;
Butler, Helen ;
Clouston, Penny ;
Bassett, Anne S. ;
Seller, Anneke ;
Holmes, Christopher C. ;
Ragoussis, Jiannis .
NUCLEIC ACIDS RESEARCH, 2007, 35 (06) :2013-2025
[6]   A new algorithm for fixed design regression and denoising [J].
Comte, F ;
Rozenholc, Y .
ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2004, 56 (03) :449-473
[7]   Effects of KRAS, BRAF, NRAS, and PIK3CA mutations on the efficacy of cetuximab plus chemotherapy in chemotherapy-refractory metastatic colorectal cancer: a retrospective consortium analysis [J].
De Roock, Wendy ;
Claes, Bart ;
Bernasconi, David ;
De Schutter, Jef ;
Biesmans, Bart ;
Fountzilas, George ;
Kalogeras, Konstantine T. ;
Kotoula, Vassiliki ;
Papamichael, Demetris ;
Laurent-Puig, Pierre ;
Penault-Llorca, Frederique ;
Rougier, Philippe ;
Vincenzi, Bruno ;
Santini, Daniele ;
Tonini, Giuseppe ;
Cappuzzo, Federico ;
Frattini, Milo ;
Molinari, Francesca ;
Saletti, Piercarlo ;
De Dosso, Sara ;
Martini, Miriam ;
Bardelli, Alberto ;
Siena, Salvatore ;
Sartore-Bianchi, Andrea ;
Tabernero, Josep ;
Macarulla, Teresa ;
Di Fiore, Frederic ;
Gangloff, Alice Oden ;
Ciardiello, Fortunato ;
Pfeiffer, Per ;
Qvortrup, Camilla ;
Hansen, Tine Plato ;
Van Cutsem, Eric ;
Piessevaux, Hubert ;
Lambrechts, Diether ;
Delorenzi, Mauro ;
Tejpar, Sabine .
LANCET ONCOLOGY, 2010, 11 (08) :753-762
[8]  
Durbin R., 1998, Biological sequence analysis: probabilistic models of proteins and nucleic acids
[9]   Quantile smoothing of array CGH data [J].
Eilers, PHC ;
de Menezes, RX .
BIOINFORMATICS, 2005, 21 (07) :1146-1153
[10]   A fast Bayesian change point analysis for the segmentation of microarray data [J].
Erdman, Chandra ;
Emerson, John W. .
BIOINFORMATICS, 2008, 24 (19) :2143-2148