A statistical score for assessing the quality of multiple sequence alignments

被引:38
作者
Ahola, Virpi [1 ]
Aittokallio, Tero
Vihinen, Mauno
Uusipaikka, Esa
机构
[1] MTT Agrifood Res Finland, Jokioinen, Finland
[2] Univ Turku, Dept Stat, Turku, Finland
[3] Univ Turku, Dept Math, Turku, Finland
[4] Univ Tampere, Inst Med Technol, FIN-33101 Tampere, Finland
[5] Tampere Univ Hosp, Res Unit, Tampere, Finland
[6] Inst Pasteur, Syst Biol Unit, Paris, France
关键词
D O I
10.1186/1471-2105-7-484
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments. Results: To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and beta-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method. Conclusion: The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments.
引用
收藏
页数:19
相关论文
共 50 条
[1]   Efficient estimation of emission probabilities in profile hidden Markov models [J].
Ahola, V ;
Aittokallio, T ;
Uusipaikka, E ;
Vihinen, M .
BIOINFORMATICS, 2003, 19 (18) :2359-2368
[2]  
AHOLA V, 2004, STAT APPL GENET MOL, V3
[3]   BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations [J].
Bahr, A ;
Thompson, JD ;
Thierry, JC ;
Poch, O .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :323-326
[4]  
Bateman A, 2004, NUCLEIC ACIDS RES, V32, pD138, DOI [10.1093/nar/gkp985, 10.1093/nar/gkr1065, 10.1093/nar/gkh121]
[5]  
Benjamini Y, 2001, ANN STAT, V29, P1165
[6]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[7]   AMINO-ACID SUBSTITUTION DURING FUNCTIONALLY CONSTRAINED DIVERGENT EVOLUTION OF PROTEIN SEQUENCES [J].
BENNER, SA ;
COHEN, MA ;
GONNET, GH .
PROTEIN ENGINEERING, 1994, 7 (11) :1323-1332
[8]  
BERNARD GA, 1963, J R STAT SOC B, V25, P294
[9]   M13 endopeptidases: New conserved motifs correlated with structure, and simultaneous phylogenetic occurrence of PHEX and the bony fish [J].
Bianchetti, L ;
Oudet, C ;
Poch, O .
PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2002, 47 (04) :481-488
[10]  
Bradshaw JM, 2003, ADV PROTEIN CHEM, V61, P161