Resolving Multicopy Duplications de novo Using Polyploid Phasing

被引：17

作者：

Chaisson, Mark J. ^{[1
]}

Mukherjee, Sudipto ^{[2
]}

Kannan, Sreeram ^{[2
]}

Eichler, Evan E. ^{[1
,3
]}

机构：

[1] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA

[2] Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA

[3] Univ Washington, Howard Hughes Med Inst, Seattle, WA 98195 USA

来源：

RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017 | 2017年 / 10229卷

基金：

美国国家卫生研究院;

关键词：

HAPLOTYPE ASSEMBLY PROBLEM; HUMAN GENOME; MATRIX COMPLETION; ALGORITHM; INFORMATION; GRAPHS;

D O I：

10.1007/978-3-319-56970-3_8

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralogspecific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.

引用

页码：117 / 133

页数：17

共 39 条

[1] Haplotype assembly in polyploid genomes and identical by descent shared tracts [J].

Aguiar, Derek ;

Istrail, Sorin .

BIOINFORMATICS, 2013, 29 (13) :352-360

[2] Aggregating Inconsistent Information: Ranking and Clustering [J].

Ailon, Nir ;

Charikar, Moses ;

Newman, Alantha .

JOURNAL OF THE ACM, 2008, 55 (05)

[3] Correlation clustering [J].

Bansal, N ;

Blum, A ;

Chawla, S .

MACHINE LEARNING, 2004, 56 (1-3) :89-113

[4] HapCUT: an efficient and accurate algorithm for the haplotype assembly problem [J].

Bansal, Vikas ;

Bafna, Vineet .

BIOINFORMATICS, 2008, 24 (16) :I153-I159

[5] HapTree: A Novel Bayesian Framework for Single Individual Polyplotyping Using NGS Data [J].

Berger, Emily ;

Yorukoglu, Deniz ;

Peng, Jian ;

Berger, Bonnie .

PLOS COMPUTATIONAL BIOLOGY, 2014, 10 (03)

[6] Assembling large genomes with single-molecule sequencing and locality-sensitive hashing [J].

Berlin, Konstantin ;

Koren, Sergey ;

Chin, Chen-Shan ;

Drake, James P. ;

Landolin, Jane M. ;

Phillippy, Adam M. .

NATURE BIOTECHNOLOGY, 2015, 33 (06) :623-+

[7] On the Minimum Error Correction Problem for Haplotype Assembly in Diploid and Polyploid Genomes [J].

Bonizzoni, Paola ;

Dondi, Riccardo ;

Klau, Gunnar W. ;

Pirola, Yuri ;

Pisanti, Nadia ;

Zaccaria, Simone .

JOURNAL OF COMPUTATIONAL BIOLOGY, 2016, 23 (09) :718-736

[8] Structured Low-Rank Matrix Factorization for Haplotype Assembly [J].

Cai, Changxiao ;

Sanghavi, Sujay ;

Vikalo, Haris .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2016, 10 (04) :647-657

[9] A SINGULAR VALUE THRESHOLDING ALGORITHM FOR MATRIX COMPLETION [J].

Cai, Jian-Feng ;

Candes, Emmanuel J. ;

Shen, Zuowei .

SIAM JOURNAL ON OPTIMIZATION, 2010, 20 (04) :1956-1982

[10] Exact Matrix Completion via Convex Optimization [J].

Candes, Emmanuel ;

Recht, Benjamin .

COMMUNICATIONS OF THE ACM, 2012, 55 (06) :111-119

← 1 2 3 4 →