共 39 条
Resolving Multicopy Duplications de novo Using Polyploid Phasing
被引:17
作者:
Chaisson, Mark J.
[1
]
Mukherjee, Sudipto
[2
]
Kannan, Sreeram
[2
]
Eichler, Evan E.
[1
,3
]
机构:
[1] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[2] Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA
[3] Univ Washington, Howard Hughes Med Inst, Seattle, WA 98195 USA
来源:
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2017
|
2017年
/
10229卷
基金:
美国国家卫生研究院;
关键词:
HAPLOTYPE ASSEMBLY PROBLEM;
HUMAN GENOME;
MATRIX COMPLETION;
ALGORITHM;
INFORMATION;
GRAPHS;
D O I:
10.1007/978-3-319-56970-3_8
中图分类号:
Q5 [生物化学];
学科分类号:
071010 ;
081704 ;
摘要:
While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog-specific variants. In this paper, we study the problem of resolving the variations in multicopy, long segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralogspecific variants. We develop a detailed simulation methodology and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation-clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on average 7.0 haplotypes in 10-copy duplication datasets whereas existing algorithms reconstruct less than one copy on average.
引用
收藏
页码:117 / 133
页数:17
相关论文