Sequence-to-sequence translation from mass spectra to peptides with a transformer model

被引:10
作者
Yilmaz, Melih [1 ]
Fondrie, William E. [2 ]
Bittremieux, Wout [3 ]
Melendez, Carlo F. [4 ]
Nelson, Rowan [4 ]
Ananth, Varun [1 ]
Oh, Sewoong [1 ]
Noble, William Stafford [1 ,4 ]
机构
[1] Univ Washington, Paul G Allen Sch Comp Sci Engn, Seattle, WA 98195 USA
[2] Talus Biosci, Seattle, WA USA
[3] Univ Antwerp, Dept Comp Sci, Antwerp, Belgium
[4] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
基金
美国国家科学基金会;
关键词
IDENTIFICATION; STRATEGY;
D O I
10.1038/s41467-024-49731-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information-de novo peptide sequencing-is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo's superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome. Identification of the peptide that generates each acquired tandem mass spectrum is a fundamental challenge in mass spectrometry-based proteomics. Here, the authors present Casanovo, a machine learning model that translates the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide.
引用
收藏
页数:13
相关论文
共 58 条
  • [1] Mass-spectrometric exploration of proteome structure and function
    Aebersold, Ruedi
    Mann, Matthias
    [J]. NATURE, 2016, 537 (7620) : 347 - 355
  • [2] [Anonymous], 2020, Nucleic Acids Res, V48, pW449, DOI [DOI 10.1093/NAR/GKAA379, 10.1093/nar/gkaa379]
  • [3] Effective gene expression prediction from sequence by integrating long-range interactions
    Avsec, Ziga
    Agarwal, Vikram
    Visentin, Daniel
    Ledsam, Joseph R.
    Grabska-Barwinska, Agnieszka
    Taylor, Kyle R.
    Assael, Yannis
    Jumper, John
    Kohli, Pushmeet
    Kelley, David R.
    [J]. NATURE METHODS, 2021, 18 (10) : 1196 - +
  • [4] Comprehensive evaluation of peptide de novo sequencing tools for monoclonal antibody assembly
    Beslic, Denis
    Tscheuschner, Georg
    Renard, Bernhard Y.
    Weller, Michael G.
    Muth, Thilo
    [J]. BRIEFINGS IN BIOINFORMATICS, 2023, 24 (01)
  • [5] A learned embedding for efficient joint analysis of millions of mass spectra
    Bittremieux, Wout
    May, Damon H.
    Bilmes, Jeffrey
    Noble, William Stafford
    [J]. NATURE METHODS, 2022, 19 (06) : 675 - +
  • [6] spectrum_utils: A Python']Python Package for Mass Spectrometry Data Processing and Visualization
    Bittremieux, Wout
    [J]. ANALYTICAL CHEMISTRY, 2020, 92 (01) : 659 - 661
  • [7] Quality control in mass spectrometry-based proteomics
    Bittremieux, Wout
    Tabb, David L.
    Impens, Francis
    Staes, An
    Timmerman, Evy
    Martens, Lennart
    Laukens, Kris
    [J]. MASS SPECTROMETRY REVIEWS, 2018, 37 (05) : 697 - 711
  • [8] De novo peptide sequencing via tandem mass spectrometry
    Dancík, V
    Addona, TA
    Clauser, KR
    Vath, JE
    Pevzner, PA
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) : 327 - 342
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra
    Diament, Benjamin J.
    Noble, William Stafford
    [J]. JOURNAL OF PROTEOME RESEARCH, 2011, 10 (09) : 3871 - 3879