Summarizing and correcting the GC content bias in high-throughput sequencing

被引:598
作者
Benjamini, Yuval [1 ]
Speed, Terence P. [1 ,2 ]
机构
[1] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[2] Walter & Eliza Hall Inst Med Res, Bioinformat Div, Parkville, Vic 3052, Australia
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
HUMAN GENOME; ILLUMINA; ALIGNMENT;
D O I
10.1093/nar/gks001
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation (DNA-seq). The bias is not consistent between samples; and there is no consensus as to the best methods to remove it in a single sample. We analyze regularities in the GC bias patterns, and find a compact description for this unimodal curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC-rich fragments and AT-rich fragments are underrepresented in the sequencing results. This empirical evidence strengthens the hypothesis that PCR is the most important cause of the GC bias. We propose a model that produces predictions at the base pair level, allowing strand-specific GC-effect correction regardless of the downstream smoothing or binning. These GC modeling considerations can inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Data structures and compression algorithms for high-throughput sequencing technologies
    Daily, Kenny
    Rigor, Paul
    Christley, Scott
    Xie, Xiaohui
    Baldi, Pierre
    BMC BIOINFORMATICS, 2010, 11
  • [22] Methods for the detection and assembly of novel sequence in high-throughput sequencing data
    Holtgrewe, Manuel
    Kuchenbecker, Leon
    Reinert, Knut
    BIOINFORMATICS, 2015, 31 (12) : 1904 - 1912
  • [23] miRanalyzer: an update on the detection and analysis of microRNAs in high-throughput sequencing experiments
    Hackenberg, Michael
    Rodriguez-Ezpeleta, Naiara
    Aransay, Ana M.
    NUCLEIC ACIDS RESEARCH, 2011, 39 : W132 - W138
  • [24] Microbiome characterization by high-throughput transfer RNA sequencing and modification analysis
    Schwartz, Michael H.
    Wang, Haipeng
    Pan, Jessica N.
    Clark, Wesley C.
    Cui, Steven
    Eckwahl, Matthew J.
    Pan, David W.
    Parisien, Marc
    Owens, Sarah M.
    Cheng, Brian L.
    Martinez, Kristina
    Xu, Jinbo
    Chang, Eugene B.
    Pan, Tao
    Eren, A. Murat
    NATURE COMMUNICATIONS, 2018, 9
  • [25] High-Throughput Sequencing Analysis of the Actinobacterial Spatial Diversity in Moonmilk Deposits
    Maciejewska, Marta
    Calusinska, Magdalena
    Cornet, Luc
    Adam, Delphine
    Pessi, Igor S.
    Malchair, Sandrine
    Delfosse, Philippe
    Baurain, Denis
    Barton, Hazel A.
    Carnol, Monique
    Rigali, Sebastien
    ANTIBIOTICS-BASEL, 2018, 7 (02):
  • [26] Discovery of tandem and interspersed segmental duplications using high-throughput sequencing
    Soylev, Arda
    Thong Minh Le
    Amini, Hajar
    Alkan, Can
    Hormozdiari, Fereydoun
    BIOINFORMATICS, 2019, 35 (20) : 3923 - 3930
  • [27] A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses
    Kutnjak, Denis
    Tamisier, Lucie
    Adams, Ian
    Boonham, Neil
    Candresse, Thierry
    Chiumenti, Michela
    De Jonghe, Kris
    Kreuze, Jan F.
    Lefebvre, Marie
    Silva, Goncalo
    Malapi-Wight, Martha
    Margaria, Paolo
    Plesko, Irena Mavriric
    McGreig, Sam
    Miozzi, Laura
    Remenant, Benoit
    Reynard, Jean-Sebastien
    Rollin, Johan
    Rott, Mike
    Schumpp, Olivier
    Massart, Sebastien
    Haegeman, Annelies
    MICROORGANISMS, 2021, 9 (04)
  • [28] Identifying micro-inversions using high-throughput sequencing reads
    He, Feifei
    Li, Yang
    Tang, Yu-Hang
    Ma, Jian
    Zhu, Huaiqiu
    BMC GENOMICS, 2016, 17
  • [29] High-Throughput Identification of Adapters in Single-Read Sequencing Data
    Mohideen, Asan M. S. H.
    Johansen, Steinar D.
    Babiak, Igor
    BIOMOLECULES, 2020, 10 (06) : 1 - 12
  • [30] HTSlib: C library for reading/writing high-throughput sequencing data
    Bonfield, James K.
    Marshall, John
    Danecek, Petr
    Li, Heng
    Ohan, Valeriu
    Whitwham, Andrew
    Keane, Thomas
    Davies, Robert M.
    GIGASCIENCE, 2021, 10 (02):