DeepDigest: Prediction of Protein Proteolytic Digestion with Deep Learning

被引:28
作者
Yang, Jinghan [1 ,2 ]
Gao, Zhiqiang [1 ,2 ]
Ren, Xiuhan [3 ]
Sheng, Jie [4 ]
Xu, Ping [4 ]
Chang, Cheng [4 ]
Fu, Yan [1 ,2 ]
机构
[1] Chinese Acad Sci, Acad Math & Syst Sci, CEMS, NCMIS,RCSDS, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing 100049, Peoples R China
[3] China Univ Min & Technol, Sch Sci, Beijing 100083, Peoples R China
[4] Beijing Inst Lifeom, Natl Ctr Prot Sci Beijing, Beijing Proteome Res Ctr, State Key Lab Proteom, Beijing 102206, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
PEPTIDE IDENTIFICATION; TRYPTIC PEPTIDES; PROTEOMICS; SEQUENCE; TRYPSIN; CONFIDENCE; PROTEASES; CLEAVAGE; SITES;
D O I
10.1021/acs.analchem.0c04704
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Proteolytic digestion of proteins by one or more proteases is a key step in shotgun proteomics, in which the proteolytic products, i.e., peptides, are taken as the surrogates of their parent proteins for further qualitative or quantitative analysis. The proteases generally cleave proteins at specific amino acid residue sites, but digestion is hardly complete (wide existence of missed cleavage sites). Therefore, it would be of great help to improve the prior experimental design and the posterior data analysis if the digestion behaviors of proteases can be accurately modeled and predicted. At present, systematic studies about the commonly used proteases in proteomics are insufficient, and there is a lack of easy-to-use tools to predict the cleavage sites of different proteases. Here, we propose a novel sequence-based deep learning algorithm-DeepDigest, which integrates convolutional neural networks and long short-term memory networks for protein digestion prediction. DeepDigest can predict the cleavage probability of each potential cleavage site on the protein sequences for eight popular proteases including trypsin, ArgC, chymotrypsin, GluC, LysC, AspN, LysN, and LysargiNase. We compared DeepDigest with three traditional machine learning algorithms, i.e., logistic regression, random forest, and support vector machine. On the eight training data sets, the 10-fold cross-validation accuracies (AUCs) of DeepDigest were 0.956-0.982, significantly higher than those of the three traditional algorithms. On the 11 independent test data sets, DeepDigest achieved AUCs between 0.849 and 0.978, outperforming the other traditional algorithms in most cases. Transfer learning then further improved the prediction accuracy. Besides, some interesting characteristics of different proteases were revealed and discussed. Ultimately, as an application, we used DeepDigest to predict the digestibilities of peptides and demonstrated that peptide digestibility is an informative new feature to discriminate between correct and incorrect peptide identifications.
引用
收藏
页码:6094 / 6103
页数:10
相关论文
共 52 条
  • [1] Abadi M., PROC 12 USENIX C OPE, DOI DOI 10.1126/SCIENCE.AAB4113.4
  • [2] Next-generation proteomics: towards an integrative view of proteome dynamics
    Altelaar, A. F. Maarten
    Munoz, Javier
    Heck, Albert J. R.
    [J]. NATURE REVIEWS GENETICS, 2013, 14 (01) : 35 - 48
  • [3] Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine
    Chi, Hao
    Liu, Chao
    Yang, Hao
    Zeng, Wen-Feng
    Wu, Long
    Zhou, Wen-Jing
    Wang, Rui-Min
    Niu, Xiu-Nan
    Ding, Yue-He
    Zhang, Yao
    Wang, Zhao-Wei
    Chen, Zhen-Lin
    Sun, Rui-Xiang
    Liu, Tao
    Tan, Guang-Ming
    Dong, Meng-Qiu
    Xu, Ping
    Zhang, Pei-Heng
    He, Si-Min
    [J]. NATURE BIOTECHNOLOGY, 2018, 36 (11) : 1059 - +
  • [4] Chollet F., 2015, KERAS, V60, P105
  • [5] iHPDM: In Silico Human Proteome Digestion Map with Proteolytic Peptide Analysis and Graphical Visualizations
    Choong, Wai-Kok
    Chen, Ching-Tai
    Wang, Jen-Hung
    Sung, Ting-Yi
    [J]. JOURNAL OF PROTEOME RESEARCH, 2019, 18 (12) : 4124 - 4132
  • [6] Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment
    Cox, Juergen
    Neuhauser, Nadin
    Michalski, Annette
    Scheltema, Richard A.
    Olsen, Jesper V.
    Mann, Matthias
    [J]. JOURNAL OF PROTEOME RESEARCH, 2011, 10 (04) : 1794 - 1805
  • [7] MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification
    Cox, Juergen
    Mann, Matthias
    [J]. NATURE BIOTECHNOLOGY, 2008, 26 (12) : 1367 - 1372
  • [8] Elbasir A, 2018, IEEE INT C BIOINFORM, P2747, DOI 10.1109/BIBM.2018.8621202
  • [9] Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry
    Elias, Joshua E.
    Gygi, Steven P.
    [J]. NATURE METHODS, 2007, 4 (03) : 207 - 214
  • [10] Comet: An open-source MS/MS sequence database search tool
    Eng, Jimmy K.
    Jahan, Tahmina A.
    Hoopmann, Michael R.
    [J]. PROTEOMICS, 2013, 13 (01) : 22 - 24