Uncertainty in homology inferences: Assessing and improving genomic sequence alignment

被引:91
作者
Lunter, Gerton [1 ]
Rocco, Andrea [2 ]
Mimouni, Naila [2 ]
Heger, Andreas [1 ]
Caldeira, Alexandre [2 ]
Hein, Jotun [2 ]
机构
[1] Univ Oxford, MRC Funct Genet Unit, Dept Physiol Anat & Genet, Oxford OX1 3QX, England
[2] Univ Oxford, Oxford Ctr Gene Funct, Dept Stat, Oxford OX1 2TG, England
基金
英国医学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
MULTIPLE ALIGNMENT; PATTERN-RECOGNITION; MAMMALIAN EVOLUTION; MAXIMUM-LIKELIHOOD; LOCAL RELIABILITY; GENERAL-METHOD; DNA-SEQUENCES; SUBSTITUTION; MOUSE; RATES;
D O I
10.1101/gr.6725608
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human - mouse divergence. We find that > 15% of aligned bases are incorrect in existing whole- genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding ( MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman - Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.
引用
收藏
页码:298 / 309
页数:12
相关论文
共 62 条
  • [1] LOCALLY OPTIMAL SUBALIGNMENTS USING NONLINEAR SIMILARITY FUNCTIONS
    ALTSCHUL, SF
    ERICKSON, BW
    [J]. BULLETIN OF MATHEMATICAL BIOLOGY, 1986, 48 (5-6) : 633 - 660
  • [2] DNA sequence evolution with neighbor-dependent mutation
    Arndt, PF
    Burge, CB
    Hwa, T
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2003, 10 (3-4) : 313 - 322
  • [3] The many faces of sequence alignment
    Batzoglou, S
    [J]. BRIEFINGS IN BIOINFORMATICS, 2005, 6 (01) : 6 - 22
  • [4] Aligning multiple genomic sequences with the threaded blockset aligner
    Blanchette, M
    Kent, WJ
    Riemer, C
    Elnitski, L
    Smit, AFA
    Roskin, KM
    Baertsch, R
    Rosenbloom, K
    Clawson, H
    Green, ED
    Haussler, D
    Miller, W
    [J]. GENOME RESEARCH, 2004, 14 (04) : 708 - 715
  • [5] MAVID: Constrained ancestral alignment of multiple sequences
    Bray, N
    Pachter, L
    [J]. GENOME RESEARCH, 2004, 14 (04) : 693 - 699
  • [6] LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA
    Brudno, M
    Do, CB
    Cooper, GM
    Kim, MF
    Davydov, E
    Green, ED
    Sidow, A
    Batzoglou, S
    [J]. GENOME RESEARCH, 2003, 13 (04) : 721 - 731
  • [7] Automated whole-genome multiple alignment of rat, mouse, and human
    Brudno, M
    Poliakov, A
    Salamov, A
    Cooper, GM
    Sidow, A
    Rubin, EM
    Solovyev, V
    Batzoglou, S
    Dubchak, I
    [J]. GENOME RESEARCH, 2004, 14 (04) : 685 - 692
  • [8] DETERMINING ALL OPTIMAL AND NEAR-OPTIMAL SOLUTIONS WHEN SOLVING SHORTEST-PATH PROBLEMS BY DYNAMIC-PROGRAMMING
    BYERS, TH
    WATERMAN, MS
    [J]. OPERATIONS RESEARCH, 1984, 32 (06) : 1381 - 1384
  • [9] CHAO KM, 1993, COMPUT APPL BIOSCI, V9, P387
  • [10] Chiaromonte F, 2002, Pac Symp Biocomput, P115