Multiple alignment by aligning alignments

被引:179
作者
Wheeler, Travis J. [1 ]
Kececioglu, John D. [1 ]
机构
[1] Univ Arizona, Dept Comp Sci, Tucson, AZ 85721 USA
关键词
D O I
10.1093/bioinformatics/btm226
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Multiple sequence alignment is a fundamental task in bioinformatics. Current tools typically form an initial alignment by merging subalignments, and then polish this alignment by repeated splitting and merging of subalignments to obtain an improved final alignment. In general this form-and-polish strategy consists of several stages, and a profusion of methods have been tried at every stage. We carefully investigate: ( 1) how to utilize a new algorithm for aligning alignments that optimally solves the common subproblem of merging subalignments, and ( 2) what is the best choice of method for each stage to obtain the highest quality alignment. Results: We study six stages in the form-and-polish strategy for multiple alignment: parameter choice, distance estimation, merge-tree construction, sequence-pair weighting, alignment merging, and polishing. For each stage, we consider novel approaches as well as standard ones. Interestingly, the greatest gains in alignment quality come from (i) estimating distances by a new approach using normalized alignment costs, and (ii) polishing by a new approach using 3-cuts. Experiments with a parameter-value oracle suggest large gains in quality may be possible through an input-dependent choice of alignment parameters, and we present a promising approach for building such an oracle. Combining the best approaches to each stage yields a new tool we call Opal that on benchmark alignments matches the quality of the top tools, without employing alignment consistency or hydrophobic gap penalties.
引用
收藏
页码:I559 / I568
页数:10
相关论文
共 37 条
[1]  
ALTSCHUL S, 1989, SIAM J DISCRETE MATH, V2, P293
[2]   GAP COSTS FOR MULTIPLE SEQUENCE ALIGNMENT [J].
ALTSCHUL, SF .
JOURNAL OF THEORETICAL BIOLOGY, 1989, 138 (03) :297-309
[3]   WEIGHTS FOR DATA RELATED BY A TREE [J].
ALTSCHUL, SF ;
CARROLL, RJ ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1989, 207 (04) :647-653
[4]  
[Anonymous], 1978, Atlas of protein sequence and structure
[5]   BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations [J].
Bahr, A ;
Thompson, JD ;
Thierry, JC ;
Poch, O .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :323-326
[6]   PALI - a database of Phylogeny and ALIgnment of homologous protein structures [J].
Balaji, S ;
Sujatha, S ;
Kumar, SSC ;
Srinivasan, N .
NUCLEIC ACIDS RESEARCH, 2001, 29 (01) :61-65
[7]  
BERGER MP, 1991, COMPUT APPL BIOSCI, V7, P479
[8]   THE MULTIPLE SEQUENCE ALIGNMENT PROBLEM IN BIOLOGY [J].
CARRILLO, H ;
LIPMAN, D .
SIAM JOURNAL ON APPLIED MATHEMATICS, 1988, 48 (05) :1073-1082
[9]   ProbCons: Probabilistic consistency-based multiple sequence alignment [J].
Do, CB ;
Mahabhashyam, MSP ;
Brudno, M ;
Batzoglou, S .
GENOME RESEARCH, 2005, 15 (02) :330-340
[10]   MUSCLE: multiple sequence alignment with high accuracy and high throughput [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (05) :1792-1797