Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction

被引:200
作者
Laehnemann, David [2 ,3 ,4 ]
Borkhardt, Arndt [5 ]
McHardy, Alice Carolyn [1 ,6 ]
机构
[1] Helmholtz Ctr Infect Res, Computat Biol Infect Res, Inhoffenstr 7, D-38124 Braunschweig, Germany
[2] Univ Dusseldorf, Dept Algorithm Bioinformat & Paediat Oncol, Dusseldorf, Germany
[3] Univ Dusseldorf, Dept Haematol, Dusseldorf, Germany
[4] Univ Dusseldorf, Dept Immunol, Dusseldorf, Germany
[5] Heinrich Heine Univ Hosp, Dept Paediat Oncol Haematol & Immunol, Dusseldorf, Germany
[6] Univ Dusseldorf, Dept Algorithm Bioinformat, Dusseldorf, Germany
关键词
next-generation sequencing; high-throughput sequencing; error profile; error correction; error model; bias; SHORT-READ DATA; MICROBIAL GENOMES; HYBRID APPROACH; ION TORRENT; EFFICIENT; ACCURATE; PARALLEL; IDENTIFICATION; ASSEMBLIES; SAMPLE;
D O I
10.1093/bib/bbv029
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.
引用
收藏
页码:154 / 179
页数:26
相关论文
共 115 条
[91]   Fiona: a parallel and automatic strategy for read error correction [J].
Schulz, Marcel H. ;
Weese, David ;
Holtgrewe, Manuel ;
Dimitrova, Viktoria ;
Niu, Sijia ;
Reinert, Knut ;
Richard, Hugues .
BIOINFORMATICS, 2014, 30 (17) :I356-I363
[92]   Quality-score guided error correction for short-read sequencing data using CUDA [J].
Shi, Haixiang ;
Schmidt, Bertil ;
Liu, Weiguo ;
Mueller-Wittig, Wolfgang .
ICCS 2010 - INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE, PROCEEDINGS, 2010, 1 (01) :1123-1132
[93]   A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware [J].
Shi, Haixiang ;
Schmidt, Bertil ;
Liu, Weiguo ;
Mueller-Wittig, Wolfgang .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (04) :603-615
[94]   Correction of sequence-dependent ambiguous bases (Ns) from the 454 pyrosequencing system [J].
Shin, Sunguk ;
Park, Joonhong .
NUCLEIC ACIDS RESEARCH, 2014, 42 (07) :e51
[95]   Efficient de novo assembly of large genomes using compressed data structures [J].
Simpson, Jared T. ;
Durbin, Richard .
GENOME RESEARCH, 2012, 22 (03) :549-556
[96]   Efficient error correction for next-generation sequencing of viral amplicons [J].
Skums, Pavel ;
Dimitrova, Zoya ;
Campo, David S. ;
Vaughan, Gilberto ;
Rossi, Livia ;
Forbi, Joseph C. ;
Yokosawa, Jonny ;
Zelikovsky, Alex ;
Khudyakov, Yury .
BMC BIOINFORMATICS, 2012, 13 :S6
[97]   Sequencing error correction without a reference genome [J].
Sleep, Julie A. ;
Schreiber, Andreas W. ;
Baumann, Ute .
BMC BIOINFORMATICS, 2013, 14
[98]   IDENTIFICATION OF COMMON MOLECULAR SUBSEQUENCES [J].
SMITH, TF ;
WATERMAN, MS .
JOURNAL OF MOLECULAR BIOLOGY, 1981, 147 (01) :195-197
[99]   Lighter: fast and memory-efficient sequencing error correction without counting [J].
Song, Li ;
Florea, Liliana ;
Langmead, Ben .
GENOME BIOLOGY, 2014, 15 (11) :509
[100]   Correcting errors in shotgun sequences [J].
Tammi, MT ;
Arner, E ;
Kindlund, E ;
Andersson, B .
NUCLEIC ACIDS RESEARCH, 2003, 31 (15) :4663-4672