A NESTED MIXTURE MODEL FOR PROTEIN IDENTIFICATION USING MASS SPECTROMETRY

被引:22
作者
Li, Qunhua [1 ]
MacCoss, Michael J. [2 ]
Stephens, Matthew [3 ]
机构
[1] Univ Washington, Dept Stat, Seattle, WA 98195 USA
[2] Univ Washington, Dept Genome Sci, Seattle, WA 98195 USA
[3] Univ Chicago, Dept Stat & Human Genet, Chicago, IL 60637 USA
关键词
Mixture model; nested structure; EM algorithm; protein identification; peptide identification; mass spectrometry; proteomics; STATISTICAL-MODEL; SPECTRAL DATA; PEPTIDE IDENTIFICATIONS; PROTEOMICS DATASETS; SHOTGUN PROTEOMICS; SEQUENCE DATABASES; PROBABILITY MODEL; VALIDATION; CONFIDENCE; SEARCH;
D O I
10.1214/09-AOAS316
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
Mass spectrometry provides a high-throughput way to identify proteins in biological samples. In a typical experiment, proteins in a sample are first broken into their constituent peptides. The resulting mixture of peptides is then subjected to mass spectrometry, which generates thousands of spectra, each characteristic of its generating peptide. Here we consider the problem of inferring, from these spectra, which proteins and peptides are present in the sample. We develop a statistical approach to the problem, based on a nested mixture model. In contrast to commonly used two-stage approaches, this model provides a one-stage solution that simultaneously identifies which proteins are present, and which peptides are correctly identified. In this way our model incorporates the evidence feedback between proteins and their constituent peptides. Using simulated data and a yeast data set, we compare and contrast our method with existing widely used approaches (Peptide-Prophet/Protein-Prophet) and with a recently published new approach, HSM. For peptide identification, our single-stage approach yields consistently more accurate results. For protein identification the methods have similar accuracy in most settings, although we exhibit some scenarios in which the existing methods perform poorly.
引用
收藏
页码:962 / 987
页数:26
相关论文
共 27 条
[1]  
BLEI D, 2004, ADV NEURAL INFORM PR, V18
[2]   Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics [J].
Choi, Hyungwon ;
Nesvizhskii, Alexey I. .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (01) :254-265
[3]   Tandem mass spectrometry for peptide and protein sequence analysis [J].
Coon, JJ ;
Syka, JEP ;
Shabanowitz, J ;
Hunt, DF .
BIOTECHNIQUES, 2005, 38 (04) :519-+
[4]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[5]   Empirical Bayes analysis of a microarray experiment [J].
Efron, B ;
Tibshirani, R ;
Storey, JD ;
Tusher, V .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (456) :1151-1160
[6]   Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations [J].
Elias, JE ;
Haas, W ;
Faherty, BK ;
Gygi, SP .
NATURE METHODS, 2005, 2 (09) :667-675
[7]   Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry [J].
Elias, Joshua E. ;
Gygi, Steven P. .
NATURE METHODS, 2007, 4 (03) :207-214
[8]   AN APPROACH TO CORRELATE TANDEM MASS-SPECTRAL DATA OF PEPTIDES WITH AMINO-ACID-SEQUENCES IN A PROTEIN DATABASE [J].
ENG, JK ;
MCCORMACK, AL ;
YATES, JR .
JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY, 1994, 5 (11) :976-989
[9]   Probability model for assessing proteins assembled from peptide sequences inferred from tandem mass spectrometry data [J].
Feng, Jian ;
Naiman, Daniel Q. ;
Cooper, Bret .
ANALYTICAL CHEMISTRY, 2007, 79 (10) :3901-3911
[10]   Semi-supervised learning for peptide identification from shotgun proteomics datasets [J].
Kall, Lukas ;
Canterbury, Jesse D. ;
Weston, Jason ;
Noble, William Stafford ;
MacCoss, Michael J. .
NATURE METHODS, 2007, 4 (11) :923-925