Impact of short-read sequencing on the misassembly of a plant genome

被引:4
|
作者
Wang, Peipei [1 ,2 ]
Meng, Fanrui [1 ,2 ]
Moore, Bethany M. [1 ,3 ]
Shiu, Shin-Han [1 ,2 ,3 ,4 ]
机构
[1] Michigan State Univ, Dept Plant Biol, E Lansing, MI 48824 USA
[2] Michigan State Univ, DOE Great Lake Bioenergy Res Ctr, E Lansing, MI 48824 USA
[3] Michigan State Univ, Ecol Evolut & Behav Biol Program, E Lansing, MI 48824 USA
[4] Michigan State Univ, Dept Computat Math Sci & Engn, E Lansing, MI 48824 USA
基金
美国国家科学基金会;
关键词
Genome misassembly; Read coverage; Machine learning; Solanum lycopersicum; QUALITY ASSESSMENT; DNA; EVOLUTION; SIGNATURES; TOOL;
D O I
10.1186/s12864-021-07397-5
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1Mb) and 9.7% (79.6Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.
引用
收藏
页数:18
相关论文
共 50 条
  • [21] Evaluating Short-Read Whole-Genome Sequencing Accuracy through Pseudo-Replication
    Herzig, A.
    Velo-Suarez, L.
    Le Folgoc, G.
    Genin, E.
    HUMAN HEREDITY, 2020, 84 (4-5) : 210 - 210
  • [22] Long-read genome sequencing secondary processing pipelines provide variant call accuracy that exceeds current clinical standards for short-read genome sequencing
    Holt, James
    Handley, Lori
    Lawlor, James
    Hiatt, Susan
    Cooper, Gregory
    Grimwood, Jane
    Nakouzi, Ghunwa
    GENETICS IN MEDICINE, 2022, 24 (03) : S89 - S89
  • [23] Short-read genome sequencing allows 'en route' diagnosis of patients with atypical Friedreich ataxia
    Fleszar, Zofia
    Dufke, Claudia
    Sturm, Marc
    Schuele, Rebecca
    Schoels, Ludger
    Haack, Tobias B.
    Synofzik, Matthis
    JOURNAL OF NEUROLOGY, 2023, 270 (08) : 4112 - 4117
  • [24] Determining Streptococcus suis serotype from short-read whole-genome sequencing data
    Athey, Taryn B. T.
    Teatero, Sarah
    Lacouture, Sonia
    Takamatsu, Daisuke
    Gottschalk, Marcelo
    Fittipaldi, Nahuel
    BMC MICROBIOLOGY, 2016, 16
  • [25] Short-Read Whole-Genome Sequencing for Laboratory-Based Surveillance of Bordetella pertussis
    Marchand-Austin, Alex
    Tsang, Raymond S. W.
    Guthrie, Jennifer L.
    Ma, Jennifer H.
    Lim, Gillian H.
    Crowcroft, Natasha S.
    Deeks, Shelley L.
    Farrell, David J.
    Jamieson, Frances B.
    JOURNAL OF CLINICAL MICROBIOLOGY, 2017, 55 (05) : 1446 - 1453
  • [26] Determining Streptococcus suis serotype from short-read whole-genome sequencing data
    Taryn B. T. Athey
    Sarah Teatero
    Sonia Lacouture
    Daisuke Takamatsu
    Marcelo Gottschalk
    Nahuel Fittipaldi
    BMC Microbiology, 16
  • [27] Comparative Analysis of Structural Variant Callers on Short-Read Whole-Genome Sequencing Data
    Mkrtchyan, A. A.
    Grammatikati, K. S.
    Kazakova, P. G.
    Mitrofanov, S. I.
    Zemsky, P. U.
    Ivashechkin, A. A.
    Pilipenko, M. N.
    Svetlichny, D. V.
    Sergeev, A. P.
    Snigir, E. A.
    Frolova, L. V.
    Shpakova, T. A.
    Yudin, V. S.
    Keskinov, A. A.
    Yudin, S. M.
    Skvortsova, V. I.
    RUSSIAN JOURNAL OF GENETICS, 2023, 59 (06) : 595 - 613
  • [28] Comparative Analysis of Structural Variant Callers on Short-Read Whole-Genome Sequencing Data
    A. A. Mkrtchyan
    K. S. Grammatikati
    P. G. Kazakova
    S. I. Mitrofanov
    P. U. Zemsky
    A. A. Ivashechkin
    M. N. Pilipenko
    D. V. Svetlichny
    A. P. Sergeev
    E. A. Snigir
    L. V. Frolova
    T. A. Shpakova
    V. S. Yudin
    A. A. Keskinov
    S. M. Yudin
    V. I. Skvortsova
    Russian Journal of Genetics, 2023, 59 : 595 - 613
  • [29] Short-read genome sequencing allows ‘en route’ diagnosis of patients with atypical Friedreich ataxia
    Zofia Fleszar
    Claudia Dufke
    Marc Sturm
    Rebecca Schüle
    Ludger Schöls
    Tobias B. Haack
    Matthis Synofzik
    Journal of Neurology, 2023, 270 : 4112 - 4117
  • [30] Genome sequencing in cytogenetics: Comparison of short-read and linked-read approaches for germline structural variant detection and characterization
    Uguen, Kevin
    Jubin, Claire
    Duffourd, Yannis
    Bardel, Claire
    Malan, Valerie
    Dupont, Jean-Michel
    El Khattabi, Laila
    Chatron, Nicolas
    Vitobello, Antonio
    Rollat-Farnier, Pierre-Antoine
    Baulard, Celine
    Lelorch, Marc
    Leduc, Aurelie
    Tisserant, Emilie
    Mau-Them, Frederic Tran
    Danjean, Vincent
    Delepine, Marc
    Till, Marianne
    Meyer, Vincent
    Lyonnet, Stanislas
    Mosca-Boidron, Anne-laure
    Thevenon, Julien
    Faivre, Laurence
    Thauvin-Robinet, Christel
    Schluth-Bolard, Caroline
    Boland, Anne
    Olaso, Robert
    Callier, Patrick
    Romana, Serge
    Deleuze, Jean-Francois
    Sanlaville, Damien
    MOLECULAR GENETICS & GENOMIC MEDICINE, 2020, 8 (03):