A new repeat-masking method enables specific detection of homologous sequences

被引:113
作者
Frith, Martin C. [1 ]
机构
[1] Computat Biol Res Ctr, Inst Adv Ind Sci & Technol, Sequence Anal Team, Koto Ku, Tokyo 1350064, Japan
关键词
ACID SUBSTITUTION MATRICES; COMPARATIVE GENOMICS; DNA-SEQUENCES; DATABASE;
D O I
10.1093/nar/gkq1212
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Biological sequences are often analyzed by detecting homologous regions between them. Homology search is confounded by simple repeats, which give rise to strong similarities that are not homologies. Standard repeat-masking methods fail to eliminate this problem, and they are especially ill-suited to AT-rich DNA such as malaria and slime-mould genomes. We present a new repeat-masking method, tantan, which is motivated by the mechanisms that create simple repeats. This method thoroughly eliminates spurious homology predictions for DNA-DNA, protein-protein and DNA-protein comparisons. Moreover, it enables accurate homology search for non-coding DNA with extreme A + T composition.
引用
收藏
页数:8
相关论文
共 19 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   Tandem repeats finder: a program to analyze DNA sequences [J].
Benson, G .
NUCLEIC ACIDS RESEARCH, 1999, 27 (02) :573-580
[3]   Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii [J].
Carlton, JM ;
Angiuoli, SV ;
Suh, BB ;
Kooij, TW ;
Pertea, M ;
Silva, JC ;
Ermolaeva, MD ;
Allen, JE ;
Selengut, JD ;
Koo, HL ;
Peterson, JD ;
Pop, M ;
Kosack, DS ;
Shumway, MF ;
Bidwell, SL ;
Shallom, SJ ;
van Aken, SE ;
Riedmuller, SB ;
Feldblyum, TV ;
Cho, JK ;
Quackenbush, J ;
Sedegah, M ;
Shoaibi, A ;
Cummings, LM ;
Florens, L ;
Yates, JR ;
Raine, JD ;
Sinden, RE ;
Harris, MA ;
Cunningham, DA ;
Preiser, PR ;
Bergman, LW ;
Vaidya, AB ;
Van Lin, LH ;
Janse, CJ ;
Waters, AP ;
Smith, HO ;
White, OR ;
Salzberg, SL ;
Venter, JC ;
Fraser, CM ;
Hoffman, SL ;
Gardner, MJ ;
Carucci, DJ .
NATURE, 2002, 419 (6906) :512-519
[4]  
Durbin R., 2002, BIOL SEQUENCE ANAL
[5]   Parameters for accurate genome alignment [J].
Frith, Martin C. ;
Hamada, Michiaki ;
Horton, Paul .
BMC BIOINFORMATICS, 2010, 11
[6]   Modeling the percolation of annotation errors in a database of protein sequences [J].
Gilks, WR ;
Audit, B ;
De Angelis, D ;
Tsoka, S ;
Ouzounis, CA .
BIOINFORMATICS, 2002, 18 (12) :1641-1649
[7]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[8]   Detecting microsatellites within genomes: significant variation among algorithms [J].
Leclercq, Sebastien ;
Rivals, Eric ;
Jarne, Philippe .
BMC BIOINFORMATICS, 2007, 8 (1)
[9]   A fast and symmetric DUST implementation to mask low-complexity DNA sequences [J].
Morgulis, Aleksandr ;
Gertz, E. Michael ;
Schaffer, Alejandro A. ;
Agarwala, Richa .
JOURNAL OF COMPUTATIONAL BIOLOGY, 2006, 13 (05) :1028-1040
[10]   ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES [J].
Park, Yonil ;
Sheetlin, Sergey ;
Spouge, John L. .
ANNALS OF STATISTICS, 2009, 37 (6A) :3697-3714