The whole alignment and nothing but the alignment: the problem of spurious alignment flanks

被引:16
作者
Frith, Martin C. [2 ]
Park, Yonil [1 ]
Sheetlin, Sergey L. [1 ]
Spouge, John L. [1 ]
机构
[1] Natl Inst Hlth, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
[2] Natl Inst Adv Ind Sci & Technol, Computat Biol Res Ctr, Tokyo 1350064, Japan
基金
美国国家卫生研究院;
关键词
D O I
10.1093/nar/gkn579
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.
引用
收藏
页码:5863 / 5871
页数:9
相关论文
共 27 条
[1]   Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [J].
Altschul, SF ;
Madden, TL ;
Schaffer, AA ;
Zhang, JH ;
Zhang, Z ;
Miller, W ;
Lipman, DJ .
NUCLEIC ACIDS RESEARCH, 1997, 25 (17) :3389-3402
[2]   CRITICAL PHENOMENA IN SEQUENCE MATCHING [J].
ARRATIA, R ;
WATERMAN, MS .
ANNALS OF PROBABILITY, 1985, 13 (04) :1236-1249
[4]   Centroid estimation in discrete high-dimensional spaces with applications in biology [J].
Carvalho, Luis E. ;
Lawrence, Charles E. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (09) :3209-3214
[5]  
Chan HP, 2003, BERNOULLI, V9, P183
[6]  
Chiaromonte F, 2002, Pac Symp Biocomput, P115
[7]  
Ewens W, 2005, STAT BIOL HEALTH, P1, DOI 10.1007/b137845
[8]   AN IMPROVED ALGORITHM FOR MATCHING BIOLOGICAL SEQUENCES [J].
GOTOH, O .
JOURNAL OF MOLECULAR BIOLOGY, 1982, 162 (03) :705-708
[9]   AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS [J].
HENIKOFF, S ;
HENIKOFF, JG .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) :10915-10919
[10]   Transition-transversion bias is not universal: A counter example from grasshopper pseudogenes [J].
Keller, Irene ;
Bensasson, Douda ;
Nichols, Richard A. .
PLOS GENETICS, 2007, 3 (02) :185-191