SeedsGraph: an efficient assembler for next-generation sequencing data

被引:2
作者
Wang, Chunyu [1 ]
Guo, Maozu [1 ]
Liu, Xiaoyan [1 ]
Liu, Yang [1 ]
Zou, Quan [2 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, 92 West Dazhi St, Harbin 150001, Peoples R China
[2] Xiamen Univ, Dept Comp Sci, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金; 高等学校博士学科点专项科研基金;
关键词
ALGORITHMS; GENOMES;
D O I
10.1186/1755-8794-8-S2-S13
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
DNA sequencing technology has been rapidly evolving, and produces a large number of short reads with a fast rising tendency. This has led to a resurgence of research in whole genome shotgun assembly algorithms. We start the assembly algorithm by clustering the short reads in a cloud computing framework, and the clustering process groups fragments according to their original consensus long-sequence similarity. We condense each group of reads to a chain of seeds, which is a kind of substring with reads aligned, and then build a graph accordingly. Finally, we analyze the graph to find Euler paths, and assemble the reads related in the paths into contigs, and then lay out contigs with mate-pair information for scaffolds. The result shows that our algorithm is efficient and feasible for a large set of reads such as in next-generation sequencing technology.
引用
收藏
页数:9
相关论文
共 18 条
[1]   Cloud computing [J].
Bateman, Alex ;
Wood, Matt .
BIOINFORMATICS, 2009, 25 (12) :1475-1475
[2]  
Batzoglou S, 2002, GENOME RES, V12, P177, DOI 10.1101/gr.208902
[3]   Short read fragment assembly of bacterial genomes [J].
Chaisson, Mark J. ;
Pevzner, Pavel A. .
GENOME RESEARCH, 2008, 18 (02) :324-330
[4]   Mapreduce: Simplified data processing on large clusters [J].
Dean, Jeffrey ;
Ghemawat, Sanjay .
COMMUNICATIONS OF THE ACM, 2008, 51 (01) :107-113
[5]  
Ghemawat S, 2003, ACM SIGOPS Operating Systems Review, P29, DOI [10.1145/1165389.945450, 10.1145/945445.945450]
[6]   Readjoiner: a fast and memory efficient string graph-based sequence assembler [J].
Gonnella, Giorgio ;
Kurtz, Stefan .
BMC BIOINFORMATICS, 2012, 13
[7]   A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes [J].
Kurtz, Stefan ;
Narechania, Apurva ;
Stein, Joshua C. ;
Ware, Doreen .
BMC GENOMICS, 2008, 9 (1) :517
[8]   The Sequence Read Archive [J].
Leinonen, Rasko ;
Sugawara, Hideaki ;
Shumway, Martin .
NUCLEIC ACIDS RESEARCH, 2011, 39 :D19-D21
[9]   Genome sequencing in microfabricated high-density picolitre reactors [J].
Margulies, M ;
Egholm, M ;
Altman, WE ;
Attiya, S ;
Bader, JS ;
Bemben, LA ;
Berka, J ;
Braverman, MS ;
Chen, YJ ;
Chen, ZT ;
Dewell, SB ;
Du, L ;
Fierro, JM ;
Gomes, XV ;
Godwin, BC ;
He, W ;
Helgesen, S ;
Ho, CH ;
Irzyk, GP ;
Jando, SC ;
Alenquer, MLI ;
Jarvie, TP ;
Jirage, KB ;
Kim, JB ;
Knight, JR ;
Lanza, JR ;
Leamon, JH ;
Lefkowitz, SM ;
Lei, M ;
Li, J ;
Lohman, KL ;
Lu, H ;
Makhijani, VB ;
McDade, KE ;
McKenna, MP ;
Myers, EW ;
Nickerson, E ;
Nobile, JR ;
Plant, R ;
Puc, BP ;
Ronan, MT ;
Roth, GT ;
Sarkis, GJ ;
Simons, JF ;
Simpson, JW ;
Srinivasan, M ;
Tartaro, KR ;
Tomasz, A ;
Vogt, KA ;
Volkmer, GA .
NATURE, 2005, 437 (7057) :376-380
[10]   Assembly algorithms for next-generation sequencing data [J].
Miller, Jason R. ;
Koren, Sergey ;
Sutton, Granger .
GENOMICS, 2010, 95 (06) :315-327