A model selection approach for multiple sequence segmentation and dimensionality reduction

被引:3
作者
Castro, Bruno M. [1 ]
Lemes, Renan B. [2 ]
Cesar, Jonatas [2 ]
Hunemeier, Tabita [2 ]
Leonardi, Florencia [3 ]
机构
[1] Univ Fed Rio Grande do Norte, Dept Estat, Natal, RN, Brazil
[2] Univ Sao Paulo, Inst Biociencias, Sao Paulo, Brazil
[3] Univ Sao Paulo, Inst Matemat & Estat, Sao Paulo, Brazil
基金
巴西圣保罗研究基金会;
关键词
WHOLE-GENOME ASSOCIATION; COMPUTATION; PLINK;
D O I
10.1016/j.jmva.2018.05.006
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In this paper we consider the problem of segmenting n aligned random sequences of equal length m into a finite number of independent blocks. We propose a penalized maximum likelihood criterion to infer simultaneously the number of points of independence as well as the position of each point. We show how to compute exactly the estimator by means of a dynamic programming algorithm with time complexity O(m(2)n). We also propose another method, called hierarchical algorithm, that provides an approximation to the estimator when the sample size increases and runs in time O{m In(m)n}. Our main theoretical results are the strong consistency of both estimators when the sample size n grows to infinity. We illustrate the convergence of these algorithms through some simulation examples and we apply the method to identify recombination hotspots in real SNPs data. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:319 / 330
页数:12
相关论文
共 23 条
[1]   Investigating genomic structure using changept: A Bayesian segmentation model [J].
Algama, Manjula ;
Keith, Jonathan M. .
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2014, 10 (17) :107-115
[2]  
[Anonymous], 1992, PROBABILITY
[3]   Computation and analysis of multiple structural change models [J].
Bai, J ;
Perron, P .
JOURNAL OF APPLIED ECONOMETRICS, 2003, 18 (01) :1-22
[4]   A Bayesian approach to DNA sequence segmentation - Discussion - Reply [J].
Boys, RJ ;
Henderson, DA .
BIOMETRICS, 2004, 60 (03) :585-588
[5]   Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering [J].
Browning, Sharon R. ;
Browning, Brian L. .
AMERICAN JOURNAL OF HUMAN GENETICS, 2007, 81 (05) :1084-1097
[6]   Second-generation PLINK: rising to the challenge of larger and richer datasets [J].
Chang, Christopher C. ;
Chow, Carson C. ;
Tellier, Laurent C. A. M. ;
Vattikuti, Shashaank ;
Purcell, Shaun M. ;
Lee, James J. .
GIGASCIENCE, 2015, 4
[7]   Context tree estimation for not necessarily finite memory processes, via BIC and MDL [J].
Csiszár, I ;
Talata, Z .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2006, 52 (03) :1007-1016
[8]   Detecting the borders between coding and non-coding DNA regions in prokaryotes based on recursive segmentation and nucleotide doublets statistics [J].
Deng, Suping ;
Shi, Yixiang ;
Yuan, Liyun ;
Li, Yixue ;
Ding, Guohui .
BMC GENOMICS, 2012, 13
[9]   COMPUTATION OF BIOPOLYMERS - A GENERAL-APPROACH TO DIFFERENT PROBLEMS [J].
FINKELSTEIN, AV ;
ROYTBERG, MA .
BIOSYSTEMS, 1993, 30 (1-3) :1-19
[10]   A second generation human haplotype map of over 3.1 million SNPs [J].
Frazer, Kelly A. ;
Ballinger, Dennis G. ;
Cox, David R. ;
Hinds, David A. ;
Stuve, Laura L. ;
Gibbs, Richard A. ;
Belmont, John W. ;
Boudreau, Andrew ;
Hardenbol, Paul ;
Leal, Suzanne M. ;
Pasternak, Shiran ;
Wheeler, David A. ;
Willis, Thomas D. ;
Yu, Fuli ;
Yang, Huanming ;
Zeng, Changqing ;
Gao, Yang ;
Hu, Haoran ;
Hu, Weitao ;
Li, Chaohua ;
Lin, Wei ;
Liu, Siqi ;
Pan, Hao ;
Tang, Xiaoli ;
Wang, Jian ;
Wang, Wei ;
Yu, Jun ;
Zhang, Bo ;
Zhang, Qingrun ;
Zhao, Hongbin ;
Zhao, Hui ;
Zhou, Jun ;
Gabriel, Stacey B. ;
Barry, Rachel ;
Blumenstiel, Brendan ;
Camargo, Amy ;
Defelice, Matthew ;
Faggart, Maura ;
Goyette, Mary ;
Gupta, Supriya ;
Moore, Jamie ;
Nguyen, Huy ;
Onofrio, Robert C. ;
Parkin, Melissa ;
Roy, Jessica ;
Stahl, Erich ;
Winchester, Ellen ;
Ziaugra, Liuda ;
Altshuler, David ;
Shen, Yan .
NATURE, 2007, 449 (7164) :851-U3