K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

被引:7
作者
Contreras-Moreira, Bruno [1 ]
Filippi, Carla, V [1 ,2 ,3 ,4 ]
Naamati, Guy [1 ]
Giron, Carlos Garcia [1 ]
Allen, James E. [1 ]
Flicek, Paul [1 ]
机构
[1] European Bioinformat Inst, European Mol Biol Lab, Wellcome Genome Campus, Cambridge CB10 1SD, England
[2] Inst Nacl Tecnol Agr INTA, Ctr Invest Ciencias Veterinarias & Agronom CICVyA, Inst Biotecnol, Nicolas Repetto & Los Reseros S-N 1686, Buenos Aires, DF, Argentina
[3] INTA Consejo Nacl Invest Cient & Tecn CONICET, Inst Agrobiotecnol & Biol Mol IABIMO, Nicolas Repetto & Los Reseros S-N 1686, Buenos Aires, DF, Argentina
[4] Consejo Nacl Invest Cient & Tecn, Av Rivadavia 1917,C1033AAJ, Buenos Aires, Argentina
基金
美国国家科学基金会;
关键词
SEQUENCE; COMPLEXITY; DISCOVERY; DATABASE; SPACE; TOOL;
D O I
10.1002/tpg2.20143
中图分类号
Q94 [植物学];
学科分类号
071001 ;
摘要
The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for thiswork). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
引用
收藏
页数:14
相关论文
共 71 条
  • [1] Transposons played a major role in the diversification between the closely related almond and peach genomes: results from the almond genome sequence
    Alioto, Tyler
    Alexiou, Konstantinos G.
    Bardil, Amelie
    Barteri, Fabio
    Castanera, Raul
    Cruz, Fernando
    Dhingra, Amit
    Duval, Henri
    Fernandez i Marti, Angel
    Frias, Leonor
    Galan, Beatriz
    Garcia, Jose L.
    Howad, Werner
    Gomez-Garrido, Jessica
    Gut, Marta
    Julca, Irene
    Morata, Jordi
    Puigdomenech, Pere
    Ribeca, Paolo
    Rubio Cabetas, Maria J.
    Vlasova, Anna
    Wirthensohn, Michelle
    Garcia-Mas, Jordi
    Gabaldon, Toni
    Casacuberta, Josep M.
    Arus, Pere
    [J]. PLANT JOURNAL, 2020, 101 (02) : 455 - 472
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] RepetDB: a unified resource for transposable element references
    Amselem, Joelle
    Cornut, Guillaume
    Choisne, Nathalie
    Alaux, Michael
    Alfama-Depauw, Francoise
    Jamilloux, Veronique
    Maumus, Florian
    Letellier, Thomas
    Luyten, Isabelle
    Pommier, Cyril
    Adam-Blondon, Anne-Francoise
    Quesneville, Hadi
    [J]. MOBILE DNA, 2019, 10 (1)
  • [4] The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution
    Badouin, Helene
    Gouzy, Jerome
    Grassa, Christopher J.
    Murat, Florent
    Staton, S. Evan
    Cottret, Ludovic
    Lelandais-Briere, Christine
    Owens, Gregory L.
    Carrere, Sebastien
    Mayjonade, Baptiste
    Legrand, Ludovic
    Gill, Navdeep
    Kane, Nolan C.
    Bowers, John E.
    Hubner, Sariel
    Bellec, Arnaud
    Berard, Aurelie
    Berges, Helene
    Blanchet, Nicolas
    Boniface, Marie-Claude
    Brunel, Dominique
    Catrice, Olivier
    Chaidir, Nadia
    Claudel, Clotilde
    Donnadieu, Cecile
    Faraut, Thomas
    Fievet, Ghislain
    Helmstetter, Nicolas
    King, Matthew
    Knapp, Steven J.
    Lai, Zhao
    Le Paslier, Marie-Christine
    Lippi, Yannick
    Lorenzon, Lolita
    Mandel, Jennifer R.
    Marage, Gwenola
    Marchand, Gwenaelle
    Marquand, Elodie
    Bret-Mestries, Emmanuelle
    Morien, Evan
    Nambeesan, Savithri
    Thuy Nguyen
    Pegot-Espagnet, Prune
    Pouilly, Nicolas
    Raftis, Frances
    Sallet, Erika
    Schiex, Thomas
    Thomas, Justine
    Vandecasteele, Celine
    Vares, Didier
    [J]. NATURE, 2017, 546 (7656) : 148 - +
  • [5] Repbase Update, a database of repetitive elements in eukaryotic genomes
    Bao, Weidong
    Kojima, Kenji K.
    Kohany, Oleksiy
    [J]. MOBILE DNA, 2015, 6
  • [6] UniProt: a worldwide hub of protein knowledge
    Bateman, Alex
    Martin, Maria-Jesus
    Orchard, Sandra
    Magrane, Michele
    Alpi, Emanuele
    Bely, Benoit
    Bingley, Mark
    Britto, Ramona
    Bursteinas, Borisas
    Busiello, Gianluca
    Bye-A-Jee, Hema
    Da Silva, Alan
    De Giorgi, Maurizio
    Dogan, Tunca
    Castro, Leyla Garcia
    Garmiri, Penelope
    Georghiou, George
    Gonzales, Daniel
    Gonzales, Leonardo
    Hatton-Ellis, Emma
    Ignatchenko, Alexandr
    Ishtiaq, Rizwan
    Jokinen, Petteri
    Joshi, Vishal
    Jyothi, Dushyanth
    Lopez, Rodrigo
    Luo, Jie
    Lussi, Yvonne
    MacDougall, Alistair
    Madeira, Fabio
    Mahmoudy, Mahdi
    Menchi, Manuela
    Nightingale, Andrew
    Onwubiko, Joseph
    Palka, Barbara
    Pichler, Klemens
    Pundir, Sangya
    Qi, Guoying
    Raj, Shriya
    Renaux, Alexandre
    Lopez, Milagros Rodriguez
    Saidi, Rabie
    Sawford, Tony
    Shypitsyna, Aleksandra
    Speretta, Elena
    Turner, Edward
    Tyagi, Nidhi
    Vasudev, Preethi
    Volynkin, Vladimir
    Wardell, Tony
    [J]. NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) : D506 - D515
  • [7] Baud A., 2019, TRACES PAST TRANPOSA, DOI 10.1101/547877
  • [8] Bias in resistance gene prediction due to repeat masking
    Bayer, Philipp E.
    Edwards, David
    Batley, Jacqueline
    [J]. NATURE PLANTS, 2018, 4 (10) : 762 - 765
  • [9] Kmasker plants - a tool for assessing complex sequence space in plant species
    Beier, Sebastian
    Ulpinnis, Chris
    Schwalbe, Markus
    Muench, Thomas
    Hoffie, Robert
    Koeppel, Iris
    Hertig, Christian
    Budhagatapalli, Nagaveni
    Hiekel, Stefan
    Pathi, Krishna M.
    Hensel, Goetz
    Grosse, Martin
    Chamas, Sindy
    Gerasimova, Sophia
    Kumlehn, Jochen
    Scholz, Uwe
    Schmutzer, Thomas
    [J]. PLANT JOURNAL, 2020, 102 (03) : 631 - 642
  • [10] Tandem repeats finder: a program to analyze DNA sequences
    Benson, G
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (02) : 573 - 580