A simple method to control over-alignment in the MAFFT multiple sequence alignment program

被引:440
作者
Katoh, Kazutaka [1 ]
Standley, Daron M. [1 ,2 ]
机构
[1] Osaka Univ, Immunol Frontier Res Ctr, Suita, Osaka 5650871, Japan
[2] Kyoto Univ, Inst Virus Res, Kyoto 6068507, Japan
关键词
ACID SUBSTITUTION MATRICES; HIDDEN MARKOV-MODELS; ACCURACY; CLUSTAL; REFINEMENT; GENERATION; STRATEGY; MUSCLE; ERRORS; COFFEE;
D O I
10.1093/bioinformatics/btw108
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment.
引用
收藏
页码:1933 / 1942
页数:10
相关论文
共 53 条
  • [1] A STRATEGY FOR THE RAPID MULTIPLE ALIGNMENT OF PROTEIN SEQUENCES - CONFIDENCE LEVELS FROM TERTIARY STRUCTURE COMPARISONS
    BARTON, GJ
    STERNBERG, MJE
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1987, 198 (02) : 327 - 337
  • [2] BERGER MP, 1991, COMPUT APPL BIOSCI, V7, P479
  • [3] Class of Multiple Sequence Alignment Algorithm Affects Genomic Analysis
    Blackburne, Benjamin P.
    Whelan, Simon
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2013, 30 (03) : 642 - 653
  • [4] Measuring the distance between multiple sequence alignments
    Blackburne, Benjamin P.
    Whelan, Simon
    [J]. BIOINFORMATICS, 2012, 28 (04) : 495 - 502
  • [5] Fast Statistical Alignment
    Bradley, Robert K.
    Roberts, Adam
    Smoot, Michael
    Juvekar, Sudeep
    Do, Jaeyoung
    Dewey, Colin
    Holmes, Ian
    Pachter, Lior
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2009, 5 (05)
  • [6] Genome annotation past, present, and future: How to define an ORF at each locus
    Brent, MR
    [J]. GENOME RESEARCH, 2005, 15 (12) : 1777 - 1786
  • [7] trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses
    Capella-Gutierrez, Salvador
    Silla-Martinez, Jose M.
    Gabaldon, Toni
    [J]. BIOINFORMATICS, 2009, 25 (15) : 1972 - 1973
  • [8] Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis
    Castresana, J
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2000, 17 (04) : 540 - 552
  • [9] TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction
    Chang, Jia-Ming
    Di Tommaso, Paolo
    Notredame, Cedric
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2014, 31 (06) : 1625 - 1637
  • [10] Dayhoff MO., 1972, Atlas of Protein Seq Struct, V5, P89