SAMStat: monitoring biases in next generation sequencing data

被引:116
作者
Lassmann, Timo [1 ]
Hayashizaki, Yoshihide [1 ]
Daub, Carsten O. [1 ]
机构
[1] Riken Yokohama Inst, Omics Sci Ctr, Tsurumi Ku, Yokohama, Kanagawa 2300045, Japan
关键词
GENOMES; FORMAT;
D O I
10.1093/bioinformatics/btq614
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: The sequence alignment/map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time. Results: We demonstrate that studying sequence features in mapped data can be used to identify biases particular to one sequencing protocol. Once identified, such biases can be considered in the downstream analysis or even be removed by read trimming or filtering techniques.
引用
收藏
页码:130 / 131
页数:2
相关论文
共 8 条
[1]   Genome-wide analysis of mammalian promoter architecture and evolution [J].
Carninci, Piero ;
Sandelin, Albin ;
Lenhard, Boris ;
Katayama, Shintaro ;
Shimokawa, Kazuro ;
Ponjavic, Jasmina ;
Semple, Colin A. M. ;
Taylor, Martin S. ;
Engström, Par G. ;
Frith, Martin C. ;
Forrest, Alistair R. R. ;
Alkema, Wynand B. ;
Tan, Sin Lam ;
Plessy, Charles ;
Kodzius, Rimantas ;
Ravasi, Timothy ;
Kasukawa, Takeya ;
Fukuda, Shiro ;
Kanamori-Katayama, Mutsumi ;
Kitazume, Yayoi ;
Kawaji, Hideya ;
Kai, Chikatoshi ;
Nakamura, Mari ;
Konno, Hideaki ;
Nakano, Kenji ;
Mottagui-Tabar, Salim ;
Arner, Peter ;
Chesi, Alessandra ;
Gustincich, Stefano ;
Persichetti, Francesca ;
Suzuki, Harukazu ;
Grimmond, Sean M. ;
Wells, Christine A. ;
Orlando, Valerio ;
Wahlestedt, Claes ;
Liu, Edison T. ;
Harbers, Matthias ;
Kawai, Jun ;
Bajic, Vladimir B. ;
Hume, David A. ;
Hayashizaki, Yoshihide .
NATURE GENETICS, 2006, 38 (06) :626-635
[2]   The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].
Cock, Peter J. A. ;
Fields, Christopher J. ;
Goto, Naohisa ;
Heuer, Michael L. ;
Rice, Peter M. .
NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771
[3]   A code for transcription initiation in mammalian genomes [J].
Frith, Martin C. ;
Valen, Eivind ;
Krogh, Anders ;
Hayashizaki, Yoshihide ;
Carninci, Piero ;
Sandelin, Albin .
GENOME RESEARCH, 2008, 18 (01) :1-12
[4]   Mapping short DNA sequencing reads and calling variants using mapping quality scores [J].
Li, Heng ;
Ruan, Jue ;
Durbin, Richard .
GENOME RESEARCH, 2008, 18 (11) :1851-1858
[5]   The Sequence Alignment/Map format and SAMtools [J].
Li, Heng ;
Handsaker, Bob ;
Wysoker, Alec ;
Fennell, Tim ;
Ruan, Jue ;
Homer, Nils ;
Marth, Gabor ;
Abecasis, Goncalo ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (16) :2078-2079
[6]   Fast and accurate short read alignment with Burrows-Wheeler transform [J].
Li, Heng ;
Durbin, Richard .
BIOINFORMATICS, 2009, 25 (14) :1754-1760
[7]  
Plessy C, 2010, NAT METHODS, V7, P528, DOI [10.1038/NMETH.1470, 10.1038/nmeth.1470]
[8]   How to map billions of short reads onto genomes [J].
Trapnell, Cole ;
Salzberg, Steven L. .
NATURE BIOTECHNOLOGY, 2009, 27 (05) :455-457