Characterization of pairwise and multiple sequence alignment errors

被引:31
作者
Landan, Giddy [1 ]
Graur, Dan [1 ]
机构
[1] Univ Houston, Dept Biol & Biochem, Houston, TX 77204 USA
关键词
Multiple sequence alignment; Pairwise sequence alignment; Alignment errors; STATISTICAL SIGNIFICANCE; ACCURACY; PROGRAMS; RELIABILITY; SENSITIVITY; CONSISTENCY; PHYLOGENY; BIAS;
D O I
10.1016/j.gene.2008.05.016
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
We characterize pairwise and multiple sequence alignment (MSA) errors by comparing true alignments from simulations of sequence evolution with reconstructed alignments. The vast majority of reconstructed alignments contain many errors. Error rates rapidly increase with sequence divergence, thus, for even intermediate degrees of sequence divergence, more than half of the columns of a reconstructed alignment may be expected to be erroneous. In closely related sequences, most errors consist of the erroneous positioning of a single indel event and their effect is local. As sequences diverge, errors become more complex as a result of the simultaneous mis-reconstruction of many indel events, and the lengths of the affected MSA segments increase dramatically. We found a systematic bias towards underestimation of the number of gaps, which leads to the reconstructed MSA being on average shorter than the true one. Alignment errors are unavoidable even when the evolutionary parameters are known in advance. Correct reconstruction can only be guaranteed when the likelihood of true alignment is uniquely optimal. However, true alignment features are very frequently sub-optimal or co-optimal, with the result that optimal albeit erroneous features are incorporated into the reconstructed MSA. Progressive MSA utilizes a guide-tree in the reconstruction of MSAs. The quality of the guide-tree was found to affect MSA error levels only marginally. (C) 2008 Elsevier B.V. All rights reserved.
引用
收藏
页码:141 / 147
页数:7
相关论文
共 41 条
[1]  
[Anonymous], 1995, Introduction to computational biology: maps, sequences and genomes
[2]  
[Anonymous], 1997, ACM SIGACT NEWS
[3]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[4]   Multiple sequence alignment [J].
Edgar, Robert C. ;
Batzoglou, Serafim .
CURRENT OPINION IN STRUCTURAL BIOLOGY, 2006, 16 (03) :368-373
[5]   EFFECTS OF SEQUENCE ALIGNMENT ON THE PHYLOGENY OF SARCOCYSTIS DEDUCED FROM 18S RDNA SEQUENCES [J].
ELLIS, J ;
MORRISON, D .
PARASITOLOGY RESEARCH, 1995, 81 (08) :696-699
[6]   PROGRESSIVE SEQUENCE ALIGNMENT AS A PREREQUISITE TO CORRECT PHYLOGENETIC TREES [J].
FENG, DF ;
DOOLITTLE, RF .
JOURNAL OF MOLECULAR EVOLUTION, 1987, 25 (04) :351-360
[7]   On the significance of sequence alignments when using multiple scoring matrices [J].
Frommlet, F ;
Futschik, A ;
Bogdan, M .
BIOINFORMATICS, 2004, 20 (06) :881-887
[8]   POISSON, COMPOUND POISSON AND PROCESS APPROXIMATIONS FOR TESTING STATISTICAL SIGNIFICANCE IN SEQUENCE COMPARISONS [J].
GOLDSTEIN, L ;
WATERMAN, MS .
BULLETIN OF MATHEMATICAL BIOLOGY, 1992, 54 (05) :785-812
[9]   Mind the gaps: Evidence of bias in estimates of multiple sequence alignments [J].
Golubchik, Tanya ;
Wise, Michael J. ;
Easteal, Simon ;
Jermiin, Lars S. .
MOLECULAR BIOLOGY AND EVOLUTION, 2007, 24 (11) :2433-2442
[10]   CONSISTENCY OF OPTIMAL SEQUENCE ALIGNMENTS [J].
GOTOH, O .
BULLETIN OF MATHEMATICAL BIOLOGY, 1990, 52 (04) :509-525