Better Identification of Repeats in Metagenomic Scaffolding

被引:6
作者
Ghurye, Jay [1 ]
Pop, Mihai [1 ]
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
来源
ALGORITHMS IN BIOINFORMATICS | 2016年 / 9838卷
关键词
Metagenomics; Random forest; Betweenness centrality; Scaffolding; Algorithms; Graph; ALGORITHM; GENOMES;
D O I
10.1007/978-3-319-43681-4_14
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genomic repeats are the most important challenge in genomic assembly. While for single genomes the effect of repeats is largely addressed by modern long-read sequencing technologies, in metagenomic data intra-genome and, more importantly, inter-genome repeats continue to be a significant impediment to effective genome reconstruction. Detecting repeats in metagenomic samples is complicated by characteristic features of these data, primarily uneven depths of coverage and the presence of genomic polymorphisms. The scaffolder Bambus 2 introduced a new strategy for repeat detection based on the betweenness centrality measure - a concept originally used in social network analysis. The exact computation of the betweenness centrality measure is, however, computationally intensive and impractical in large metagenomic datasets. Here we explore the effectiveness of approximate algorithms for network centrality to accurately detect genomic repeats within metagenomic samples. We show that an approximate measure of centrality achieves much higher computational efficiencies with a minimal loss in the accuracy of detecting repeats in metagenomic data. We also show that the combination of multiple features of the scaffold graph provides a more effective strategy for identifying metagenomic repeats, significantly outperforming all other commonly used approaches.
引用
收藏
页码:174 / 184
页数:11
相关论文
共 24 条
[1]  
[Anonymous], 1979, Computers and Intractablity: A Guide to the Theory of NP-Completeness
[2]   A faster algorithm for betweenness centrality [J].
Brandes, U .
JOURNAL OF MATHEMATICAL SOCIOLOGY, 2001, 25 (02) :163-177
[3]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[4]  
Dayarian A., 2010, BMC BIOINFORMATICS, V11, P1
[5]  
Delcher Arthur L, 2003, Curr Protoc Bioinformatics, VChapter 10, DOI 10.1002/0471250953.bi1003s00
[6]  
Fass J.N., SICKLE SLIDING WINDO
[7]   Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences [J].
Gao, Song ;
Sung, Wing-Kin ;
Nagarajan, Niranjan .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2011, 18 (11) :1681-1691
[8]  
Geisberger R, 2008, SIAM PROC S, P90
[9]   The greedy path-merging algorithm for Contig Scaffolding [J].
Huson, DH ;
Reinert, K ;
Myers, EW .
JOURNAL OF THE ACM, 2002, 49 (05) :603-615
[10]   COMBINATORIAL ALGORITHMS FOR DNA-SEQUENCE ASSEMBLY [J].
KECECIOGLU, JD ;
MYERS, EW .
ALGORITHMICA, 1995, 13 (1-2) :7-51