A novel sequence alignment algorithm based on deep learning of the protein folding code

被引:13
作者
Gao, Mu [1 ]
Skolnick, Jeffrey [1 ]
机构
[1] Georgia Inst Technol, Sch Biol Sci, Ctr Study Syst Biol, Atlanta, GA 30332 USA
关键词
HOMOLOGY DETECTION; TWILIGHT ZONE; PSI-BLAST; IDENTIFICATION; RECOGNITION; TOOL;
D O I
10.1093/bioinformatics/btaa810
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: From evolutionary interference, function annotation to structural prediction, protein sequence comparison has provided crucial biological insights. While many sequence alignment algorithms have been developed, existing approaches often cannot detect hidden structural relationships in the 'twilight zone' of low sequence identity. To address this critical problem, we introduce a computational algorithm that performs protein Sequence Alignments from deep-Learning of Structural Alignments (SAdLSA, silent 'd'). The key idea is to implicitly learn the protein folding code from many thousands of structural alignments using experimentally determined protein structures. Results: To demonstrate that the folding code was learned, we first show that SAdLSA trained on pure alpha-helical proteins successfully recognizes pairs of structurally related pure beta-sheet protein domains. Subsequent training and benchmarking on larger, highly challenging datasets show significant improvement over established approaches. For challenging cases, SAdLSA is similar to 150% better than HHsearch for generating pairwise alignments and similar to 50% better for identifying the proteins with the best alignments in a sequence library. The time complexity of SAdLSA is O(N) thanks to GPU acceleration.
引用
收藏
页码:490 / 496
页数:7
相关论文
共 27 条
  • [1] Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
  • [2] Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
    Altschul, SF
    Madden, TL
    Schaffer, AA
    Zhang, JH
    Zhang, Z
    Miller, W
    Lipman, DJ
    [J]. NUCLEIC ACIDS RESEARCH, 1997, 25 (17) : 3389 - 3402
  • [3] Bishop C.M., 2006, Pattern Recognition and Machine Learning
  • [4] THE RELATION BETWEEN THE DIVERGENCE OF SEQUENCE AND STRUCTURE IN PROTEINS
    CHOTHIA, C
    LESK, AM
    [J]. EMBO JOURNAL, 1986, 5 (04) : 823 - 826
  • [5] Eddy S R, 1995, J Comput Biol, V2, P9, DOI 10.1089/cmb.1995.2.9
  • [6] SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures
    Fox, Naomi K.
    Brenner, Steven E.
    Chandonia, John-Marc
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D304 - D309
  • [7] DESTINI: A deep-learning approach to contact-driven protein structure prediction
    Gao, Mu
    Zhou, Hongyi
    Skolnick, Jeffrey
    [J]. SCIENTIFIC REPORTS, 2019, 9 (1)
  • [8] APoc: large-scale identification of similar protein pockets
    Gao, Mu
    Skolnick, Jeffrey
    [J]. BIOINFORMATICS, 2013, 29 (05) : 597 - 604
  • [9] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [10] AMINO-ACID SUBSTITUTION MATRICES FROM PROTEIN BLOCKS
    HENIKOFF, S
    HENIKOFF, JG
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1992, 89 (22) : 10915 - 10919