Impact of short-read sequencing on the misassembly of a plant genome

被引:4
|
作者
Wang, Peipei [1 ,2 ]
Meng, Fanrui [1 ,2 ]
Moore, Bethany M. [1 ,3 ]
Shiu, Shin-Han [1 ,2 ,3 ,4 ]
机构
[1] Michigan State Univ, Dept Plant Biol, E Lansing, MI 48824 USA
[2] Michigan State Univ, DOE Great Lake Bioenergy Res Ctr, E Lansing, MI 48824 USA
[3] Michigan State Univ, Ecol Evolut & Behav Biol Program, E Lansing, MI 48824 USA
[4] Michigan State Univ, Dept Computat Math Sci & Engn, E Lansing, MI 48824 USA
基金
美国国家科学基金会;
关键词
Genome misassembly; Read coverage; Machine learning; Solanum lycopersicum; QUALITY ASSESSMENT; DNA; EVOLUTION; SIGNATURES; TOOL;
D O I
10.1186/s12864-021-07397-5
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1Mb) and 9.7% (79.6Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively. Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads and the generality of these causes and factors should be tested further in other species.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] An assessment of bioinformatics tools for the detection of human endogenous retroviral insertions in short-read genome sequencing data
    Bowles, Harry
    Kabiljo, Renata
    Al Khleifat, Ahmad
    Jones, Ashley
    Quinn, John P.
    Dobson, Richard J. B.
    Swanson, Chad M.
    Al-Chalabi, Ammar
    Iacoangeli, Alfredo
    FRONTIERS IN BIOINFORMATICS, 2023, 2
  • [32] Comparative analysis of 7 short-read sequencing platforms using the Korean Reference Genome: MGI and Illumina sequencing benchmark for whole-genome sequencing
    Kim, Hak-Min
    Jeon, Sungwon
    Chung, Oksung
    Jun, Je Hoon
    Kim, Hui-Su
    Blazyte, Asta
    Lee, Hwang-Yeol
    Yu, Youngseok
    Cho, Yun Sung
    Bolser, Dan M.
    Bhak, Jong
    GIGASCIENCE, 2021, 10 (03):
  • [33] Indel variant analysis of short-read sequencing data with Scalpel
    Fang, Han
    Bergmann, Ewa A.
    Arora, Kanika
    Vacic, Vladimir
    Zody, Michael C.
    Iossifov, Ivan
    O'Rawe, Jason A.
    Wu, Yiyang
    Barron, Laura T. Jimenez
    Rosenbaum, Julie
    Ronemus, Michael
    Lee, Yoon-ha
    Wang, Zihua
    Dikoglu, Esra
    Jobanputra, Vaidehi
    Lyon, Gholson J.
    Wigler, Michael
    Schatz, Michael C.
    Narzisi, Giuseppe
    NATURE PROTOCOLS, 2016, 11 (12) : 2529 - 2548
  • [34] Leveraging Short-Read Sequencing to Explore the Genomics of Sepiolid Squid
    Heath-Heckman, Elizabeth
    Nishiguchi, Michele K.
    INTEGRATIVE AND COMPARATIVE BIOLOGY, 2021, 61 (05) : 1753 - 1761
  • [35] Polypolish: Short-read polishing of long-read bacterial genome assemblies
    Wick, Ryan R.
    Holt, Kathryn E.
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (01)
  • [36] Analysis of Short-read Aligners using Genome Sequence Complexity
    Quang Tran
    Nam Sy Vo
    Hicks, Eric
    Tin Nguyen
    Vinhthuy Phan
    2020 12TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (IEEE KSE 2020), 2020, : 312 - 317
  • [37] Illumina short-read sequencing data, de novo assembly and annotations of the Drosophila nasuta nasuta genome
    DSouza, Stafny
    Ponnanna, Koushik
    Chokkanna, Amruthavalli
    Ramachandra, Nallur
    DATA IN BRIEF, 2021, 34
  • [38] Advancing molecular diagnostics of myotonic dystrophy type 1 using short-read whole genome sequencing
    Lojova, Ingrid
    Kucharik, Marcel
    Pos, Zuzana
    Balaz, Andrej
    Zatkova, Andrea
    Tarova, Eva Tothova
    Budis, Jaroslav
    Kadasi, Ludevit
    Szemes, Tomas
    Radvanszky, Jan
    MOLECULAR AND CELLULAR PROBES, 2025, 79
  • [39] Short-read whole genome sequencing identifies causative variants in most individuals with previously unexplained aniridia
    Hall, Hildegard Nikki
    Parry, David
    Halachev, Mihail
    Williamson, Kathleen A.
    Donnelly, Kevin
    Campos Parada, Jose
    Bhatia, Shipra
    Joseph, Jeffrey
    Holden, Simon
    Prescott, Trine E.
    Bitoun, Pierre
    Kirk, Edwin P.
    Newbury-Ecob, Ruth
    Lachlan, Katherine
    Bernar, Juan
    van Heyningen, Veronica
    Fitzpatrick, David R.
    Meynert, Alison
    JOURNAL OF MEDICAL GENETICS, 2024, 61 (03) : 250 - 261
  • [40] Short-read DNA Sequencing Yields Microsatellite Markers for Rheum
    Gilmore, Barbara S.
    Bassil, Nahla V.
    Barney, Danny L.
    Knaus, Brian J.
    Hummer, Kim E.
    JOURNAL OF THE AMERICAN SOCIETY FOR HORTICULTURAL SCIENCE, 2014, 139 (01) : 22 - 29