AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

被引:68
作者
Abyzov, Alexej [1 ,2 ]
Gerstein, Mark [1 ,2 ,3 ]
机构
[1] Yale Univ, Program Computat Biol & Bioinformat, New Haven, CT 06520 USA
[2] Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA
[3] Yale Univ, Dept Comp Sci, New Haven, CT 06520 USA
基金
美国国家卫生研究院;
关键词
ALGORITHM; SEQUENCE;
D O I
10.1093/bioinformatics/btq713
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e. g. make a large gap) and perform precise local alignments at the flanking ends. Results: Here, we formulate the computations involved in this problem and describe a dynamic-programming algorithm for its solution. Specifically, our algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5' and 3' ends of two given sequences and introducing a 'large-gap jump' between the local end alignments to maximize the total alignment score. We also describe extensions allowing the application of AGE to tandem duplications, inversions and complex events involving two large gaps. We develop a memory-efficient implementation of AGE (allowing application to long contigs) and make it available as a downloadable software package. Finally, we applied AGE for breakpoint determination and standardization in the 1000 Genomes Project by aligning locally assembled contigs to the human genome.
引用
收藏
页码:595 / 603
页数:9
相关论文
共 19 条
[1]  
ABYZOV A, 2011, GENOME RES UNPUB
[2]   A map of human genome variation from population-scale sequencing [J].
Altshuler, David ;
Durbin, Richard M. ;
Abecasis, Goncalo R. ;
Bentley, David R. ;
Chakravarti, Aravinda ;
Clark, Andrew G. ;
Collins, Francis S. ;
De la Vega, Francisco M. ;
Donnelly, Peter ;
Egholm, Michael ;
Flicek, Paul ;
Gabriel, Stacey B. ;
Gibbs, Richard A. ;
Knoppers, Bartha M. ;
Lander, Eric S. ;
Lehrach, Hans ;
Mardis, Elaine R. ;
McVean, Gil A. ;
Nickerson, DebbieA. ;
Peltonen, Leena ;
Schafer, Alan J. ;
Sherry, Stephen T. ;
Wang, Jun ;
Wilson, Richard K. ;
Gibbs, Richard A. ;
Deiros, David ;
Metzker, Mike ;
Muzny, Donna ;
Reid, Jeff ;
Wheeler, David ;
Wang, Jun ;
Li, Jingxiang ;
Jian, Min ;
Li, Guoqing ;
Li, Ruiqiang ;
Liang, Huiqing ;
Tian, Geng ;
Wang, Bo ;
Wang, Jian ;
Wang, Wei ;
Yang, Huanming ;
Zhang, Xiuqing ;
Zheng, Huisong ;
Lander, Eric S. ;
Altshuler, David L. ;
Ambrogio, Lauren ;
Bloom, Toby ;
Cibulskis, Kristian ;
Fennell, Tim J. ;
Gabriel, Stacey B. .
NATURE, 2010, 467 (7319) :1061-1073
[3]  
Chao K M, 1994, J Comput Biol, V1, P271, DOI 10.1089/cmb.1994.1.271
[4]   AN IMPROVED ALGORITHM FOR MATCHING BIOLOGICAL SEQUENCES [J].
GOTOH, O .
JOURNAL OF MOLECULAR BIOLOGY, 1982, 162 (03) :705-708
[5]   LINEAR SPACE ALGORITHM FOR COMPUTING MAXIMAL COMMON SUBSEQUENCES [J].
HIRSCHBERG, DS .
COMMUNICATIONS OF THE ACM, 1975, 18 (06) :341-343
[6]   A generalized global alignment algorithm [J].
Huang, XQ ;
Chao, KM .
BIOINFORMATICS, 2003, 19 (02) :228-233
[7]  
Kent WJ, 2002, GENOME RES, V12, P656, DOI [10.1101/gr.229202, 10.1101/gr.229202. Article published online before March 2002]
[8]   Mapping and sequencing of structural variation from eight human genomes (Reprinted from Nature, vol 453, pg 56-64, 2008) [J].
Kidd, Jeffrey M. ;
Cooper, Gregory M. ;
Donahue, William F. ;
Hayden, Hillary S. ;
Sampas, Nick ;
Graves, Tina ;
Hansen, Nancy ;
Teague, Brian ;
Alkan, Can ;
Antonacci, Francesca ;
Haugen, Eric ;
Zerr, Troy ;
Yamada, N. Alice ;
Tsang, Peter ;
Newman, Tera L. ;
Tuzun, Eray ;
Cheng, Ze ;
Ebling, Heather M. ;
Tusneem, Nadeem ;
David, Robert ;
Gillett, Will ;
Phelps, Karen A. ;
Weaver, Molly ;
Saranga, David ;
Brand, Adrianne ;
Tao, Wei ;
Gustafson, Erik ;
McKernan, Kevin ;
Chen, Lin ;
Malig, Maika ;
Smith, Joshua D. ;
Korn, Joshua M. ;
McCarroll, Steven A. ;
Altshuler, David A. ;
Peiffer, Daniel A. ;
Dorschner, Michael ;
Stamatoyannopoulos, John ;
Schwartz, David ;
Nickerson, Deborah A. ;
Mullikin, James C. ;
Wilson, Richard K. ;
Bruhn, Laurakay ;
Olson, Maynard V. ;
Kaul, Rajinder ;
Smith, Douglas R. ;
Eichler, Evan E. .
NATURE GENETICS, 2009, :S22-S30
[9]   PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data [J].
Korbel, Jan O. ;
Abyzov, Alexej ;
Mu, Xinmeng Jasmine ;
Carriero, Nicholas ;
Cayting, Philip ;
Zhang, Zhengdong ;
Snyder, Michael ;
Gerstein, Mark B. .
GENOME BIOLOGY, 2009, 10 (02)
[10]   Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library [J].
Lam, Hugo Y. K. ;
Mu, Xinmeng Jasmine ;
Stuetz, Adrian M. ;
Tanzer, Andrea ;
Cayting, Philip D. ;
Snyder, Michael ;
Kim, Philip M. ;
Korbel, Jan O. ;
Gerstein, Mark B. .
NATURE BIOTECHNOLOGY, 2010, 28 (01) :47-U76