The Dfam database of repetitive DNA families

被引:441
作者
Hubley, Robert [1 ]
Finn, Robert D. [2 ]
Clements, Jody [3 ]
Eddy, Sean R. [4 ]
Jones, Thomas A. [4 ]
Bao, Weidong [5 ]
Smit, Arian F. A. [1 ]
Wheelers, Travis J. [6 ]
机构
[1] Inst Syst Biol, Seattle, WA 98109 USA
[2] European Bioinformat Inst EMBL EBI, European Mol Biol Lab, Wellcome Trust Genome Campus, Cambridge CB10 1RQ, England
[3] HHMI Janelia Res Campus, Ashburn, VA 20147 USA
[4] Harvard Univ, Howard Hughes Med Inst, Cambridge, MA 02138 USA
[5] Genet Informat Res Inst, Los Altos, CA 94022 USA
[6] Univ Montana, Missoula, MT 59812 USA
基金
美国国家卫生研究院;
关键词
DE-NOVO IDENTIFICATION; INTERSPERSED REPEATS; ELEMENTS; ORGANIZATION; MATRICES; REPBASE; SEARCH; MOUSE; SINES;
D O I
10.1093/nar/gkv1272
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
引用
收藏
页码:D81 / D89
页数:9
相关论文
共 30 条
  • [1] AMINO-ACID SUBSTITUTION MATRICES FROM AN INFORMATION THEORETIC PERSPECTIVE
    ALTSCHUL, SF
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1991, 219 (03) : 555 - 565
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] Repbase Update, a database of repetitive elements in eukaryotic genomes
    Bao, Weidong
    Kojima, Kenji K.
    Kohany, Oleksiy
    [J]. MOBILE DNA, 2015, 6
  • [4] Automated de novo identification of repeat sequence families in sequenced genomes
    Bao, ZR
    Eddy, SR
    [J]. GENOME RESEARCH, 2002, 12 (08) : 1269 - 1276
  • [5] BARNES TM, 1995, GENETICS, V141, P159
  • [6] Tandem repeats finder: a program to analyze DNA sequences
    Benson, G
    [J]. NUCLEIC ACIDS RESEARCH, 1999, 27 (02) : 573 - 580
  • [7] Realistic artificial DNA sequences as negative controls for computational genomics
    Caballero, Juan
    Smit, Arian F. A.
    Hood, Leroy
    Glusman, Gustavo
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (12) : e99
  • [8] Cover TM., 1991, ELEMENTS INFORM THEO, V1, P279
  • [9] Durbin R., 1998, BIOL SEQUENCE ANAL P
  • [10] Eddy Sean R, 2009, Genome Inform, V23, P205