The whole alignment and nothing but the alignment: the problem of spurious alignment flanks

被引：16

作者：

Frith, Martin C. ^{[2
]}

Park, Yonil ^{[1
]}

Sheetlin, Sergey L. ^{[1
]}

Spouge, John L. ^{[1
]}

机构：

[1] Natl Inst Hlth, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA

[2] Natl Inst Adv Ind Sci & Technol, Computat Biol Res Ctr, Tokyo 1350064, Japan

来源：

NUCLEIC ACIDS RESEARCH | 2008年 / 36卷 / 18期

基金：

美国国家卫生研究院;

关键词：

D O I：

10.1093/nar/gkn579

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human-fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple 'overalignment' P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.

引用

页码：5863 / 5871

页数：9

共 27 条

[11] The human genome browser at UCSC [J].

Kent, WJ ;

Sugnet, CW ;

Furey, TS ;

Roskin, KM ;

Pringle, TH ;

Zahler, AM ;

Haussler, D .

GENOME RESEARCH, 2002, 12 (06) :996-1006

[12] Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes [J].

Kent, WJ ;

Baertsch, R ;

Hinrichs, A ;

Miller, W ;

Haussler, D .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2003, 100 (20) :11484-11489

[13] Uncertainty in homology inferences: Assessing and improving genomic sequence alignment [J].

Lunter, Gerton ;

Rocco, Andrea ;

Mimouni, Naila ;

Heger, Andreas ;

Caldeira, Alexandre ;

Hein, Jotun .

GENOME RESEARCH, 2008, 18 (02) :298-309

[14]

Mevissen HT, 1996, PROTEIN ENG, V9, P127

[15] A reliable sequence alignment method based on probabilities of residue correspondences [J].

Miyazawa, S .

PROTEIN ENGINEERING, 1995, 8 (10) :999-1009

[16] Approximate statistics of gapped alignments [J].

Mott, R ;

Tribe, R .

JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (01) :91-112

[17]

PARK Y, 2008, ANN STAT UNPUB

[18] IMPROVED TOOLS FOR BIOLOGICAL SEQUENCE COMPARISON [J].

PEARSON, WR ;

LIPMAN, DJ .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1988, 85 (08) :2444-2448

[19] NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins [J].

Pruitt, Kim D. ;

Tatusova, Tatiana ;

Maglott, Donna R. .

NUCLEIC ACIDS RESEARCH, 2007, 35 :D61-D65

[20] Continued colonization of the human genome by mitochondrial DNA [J].

Ricchetti, M ;

Tekaia, F ;

Dujon, B .

PLOS BIOLOGY, 2004, 2 (09) :1313-1324

← 1 2 3 →