Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier

被引:4
作者
Smith, Matthew Beauregard [1 ]
Simpson, Zack Booth [2 ]
Marcotte, Edward [3 ]
机构
[1] Univ Texas Austin, Oden Inst, Austin, TX 78712 USA
[2] Erisyon Inc, Austin, TX 78701 USA
[3] Univ Texas Austin, Dept Mol Biosci, Austin, TX 78712 USA
关键词
PROTEIN; IDENTIFICATION; MODEL;
D O I
10.1371/journal.pcbi.1011157
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
We present a machine learning-based interpretive framework (whatprot) for analyzing single molecule protein sequencing data produced by fluorosequencing, a recently developed proteomics technology that determines sparse amino acid sequences for many individual peptide molecules in a highly parallelized fashion. Whatprot uses Hidden Markov Models (HMMs) to represent the states of each peptide undergoing the various chemical processes during fluorosequencing, and applies these in a Bayesian classifier, in combination with pre-filtering by a k-Nearest Neighbors (kNN) classifier trained on large volumes of simulated fluorosequencing data. We have found that by combining the HMM based Bayesian classifier with the kNN pre-filter, we are able to retain the benefits of both, achieving both tractable runtimes and acceptable precision and recall for identifying peptides and their parent proteins from complex mixtures, outperforming the capabilities of either classifier on its own. Whatprot's hybrid kNN-HMM approach enables the efficient interpretation of fluorosequencing data using a full proteome reference database and should now also enable improved sequencing error rate estimates. Author summaryScientists often wish to know which proteins, and at what quantities, are present in a sample. The field of proteomics offers a number of technologies that aid in this, such as tandem mass spectrometry and immunoassays, that provide different tradeoffs between sensitivity, throughput, and generality. One new technology, known as fluorosequencing, detects and provides partial sequences for individual peptide or protein molecules from a sample in a highly parallelized fashion. However, as only partial sequences are measured, the resulting sequencing reads must be matched to a reference database of possible proteins, such as might be obtained from the human genome. We describe a suitable computer algorithm for performing this matching of fluorosequencing reads to a reference database while accounting for the most prevalent types of sequencing errors. We detail its performance and implementation, and describe a number of uncommon algorithmic improvements and approximations which allow this approach to scale to classification against the whole human proteome. The resulting software, known as whatprot, allows researchers to interpret fluorosequencing reads and better apply this emergent single molecule protein sequencing technology.
引用
收藏
页数:26
相关论文
共 27 条
[1]   Review of deep learning: concepts, CNN architectures, challenges, applications, future directions [J].
Alzubaidi, Laith ;
Zhang, Jinglan ;
Humaidi, Amjad J. ;
Al-Dujaili, Ayad ;
Duan, Ye ;
Al-Shamma, Omran ;
Santamaria, J. ;
Fadhel, Mohammed A. ;
Al-Amidie, Muthana ;
Farhan, Laith .
JOURNAL OF BIG DATA, 2021, 8 (01)
[2]   Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions [J].
Cali, Damla Senol ;
Kim, Jeremie S. ;
Ghose, Saugata ;
Alkan, Can ;
Mutlu, Onur .
BRIEFINGS IN BIOINFORMATICS, 2019, 20 (04) :1542-1559
[3]   Strategies for Development of a Next-Generation Protein Sequencing Platform [J].
Callahan, Nicholas ;
Tullman, Jennifer ;
Kelman, Zvi ;
Marino, John .
TRENDS IN BIOCHEMICAL SCIENCES, 2020, 45 (01) :76-89
[4]   Reducing Peptide Sequence Bias in Quantitative Mass Spectrometry Data with Machine Learning [J].
Dincer, Ayse B. ;
Lu, Yang ;
Schweppe, Devin K. ;
Oh, Sewoong ;
Noble, William Stafford .
JOURNAL OF PROTEOME RESEARCH, 2022, :1771-1782
[5]   METHOD FOR DETERMINATION OF THE AMINO ACID SEQUENCE IN PEPTIDES [J].
EDMAN, P .
ACTA CHEMICA SCANDINAVICA, 1950, 4 (02) :283-293
[6]   A PROTEIN SEQUENATOR [J].
EDMAN, P ;
BEGG, G .
EUROPEAN JOURNAL OF BIOCHEMISTRY, 1967, 1 (01) :80-&
[7]  
Elias JE, 2010, METHODS MOL BIOL, V604, P55, DOI 10.1007/978-1-60761-444-9_5
[8]   AN APPROACH TO CORRELATE TANDEM MASS-SPECTRAL DATA OF PEPTIDES WITH AMINO-ACID-SEQUENCES IN A PROTEIN DATABASE [J].
ENG, JK ;
MCCORMACK, AL ;
YATES, JR .
JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, 1994, 5 (11) :976-989
[9]   BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies [J].
Fedurco, M ;
Romieu, A ;
Williams, S ;
Lawrence, I ;
Turcatti, G .
NUCLEIC ACIDS RESEARCH, 2006, 34 (03)
[10]   Protein Sequencing, One Molecule at a Time [J].
Floyd, Brendan M. ;
Marcotte, Edward M. .
ANNUAL REVIEW OF BIOPHYSICS, 2022, 51 :181-200