Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies

被引:68
作者
Blakeley, Paul [1 ]
Overton, Ian M. [2 ]
Hubbard, Simon J. [1 ]
机构
[1] Univ Manchester, Fac Life Sci, Manchester M13 9PT, Lancs, England
[2] Univ Edinburgh, Western Gen Hosp, MRC Human Genet Unit, MRC Inst Genet & Mol Med, Edinburgh EH4 2XU, Midlothian, Scotland
基金
英国生物技术与生命科学研究理事会; 英国医学研究理事会;
关键词
proteogenomics; peptide spectrum match; false discovery rate; posterior error probability; expressed sequence tag; FALSE DISCOVERY RATES; MASS-SPECTROMETRY; PEPTIDE IDENTIFICATIONS; GENOME ANNOTATION; SHOTGUN PROTEOMICS; TANDEM; CONFIDENCE; REVEALS; GENES; SENSITIVITY;
D O I
10.1021/pr300411q
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy. approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently underestimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds, and we show that this effect is a product of the database and statistical modeling and not the search engine. A variety of approaches to limit database size and remove noncoding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximize the validation and discovery of gene structure in sequenced genomes, while still controlling for false positives.
引用
收藏
页码:5221 / 5234
页数:14
相关论文
共 63 条
[1]   De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics [J].
Adamidi, Catherine ;
Wang, Yongbo ;
Gruen, Dominic ;
Mastrobuoni, Guido ;
You, Xintian ;
Tolle, Dominic ;
Dodt, Matthias ;
Mackowiak, Sebastian D. ;
Gogol-Doering, Andreas ;
Oenal, Pinar ;
Rybak, Agnieszka ;
Ross, Eric ;
Alvarado, Alejandro Sanchez ;
Kempa, Stefan ;
Dieterich, Christoph ;
Rajewsky, Nikolaus ;
Chen, Wei .
GENOME RESEARCH, 2011, 21 (07) :1193-1200
[2]   Enhancing peptide identification confidence by combining search methods [J].
Alves, Gelio ;
Wu, Wells W. ;
Wang, Guanghui ;
Shen, Rong-Fong ;
Yu, Yi-Kuo .
JOURNAL OF PROTEOME RESEARCH, 2008, 7 (08) :3102-3113
[3]   Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics [J].
Baerenfaller, Katja ;
Grossmann, Jonas ;
Grobei, Monica A. ;
Hull, Roger ;
Hirsch-Hoffmann, Matthias ;
Yalovsky, Shaul ;
Zimmermann, Philip ;
Grossniklaus, Ueli ;
Gruissem, Wilhelm ;
Baginsky, Sacha .
SCIENCE, 2008, 320 (5878) :938-941
[4]   Proteomics-based Refinement of Deinococcus deserti Genome Annotation Reveals an Unwonted Use of Non-canonical Translation Initiation Codons [J].
Baudet, Mathieu ;
Ortet, Philippe ;
Gaillard, Jean-Charles ;
Fernandez, Bernard ;
Guerin, Philippe ;
Enjalbal, Christine ;
Subra, Gilles ;
de Groot, Arjan ;
Barakat, Mohamed ;
Dedieu, Alain ;
Armengaud, Jean .
MOLECULAR & CELLULAR PROTEOMICS, 2010, 9 (02) :415-426
[5]   Comment on "Unbiased Statistical Analysis for Multi-Stage Proteomic Search Strategies" [J].
Bern, Marshall ;
Kil, Yong J. .
JOURNAL OF PROTEOME RESEARCH, 2011, 10 (04) :2123-2127
[6]   In Planta Proteomics and Proteogenomics of the Biotrophic Barley Fungal Pathogen Blumeria graminis f. sp hordei [J].
Bindschedler, Laurence V. ;
Burgis, Timothy A. ;
Mills, Davinia J. S. ;
Ho, Jenny T. C. ;
Cramer, Rainer ;
Spanu, Pietro D. .
MOLECULAR & CELLULAR PROTEOMICS, 2009, 8 (10) :2368-2381
[7]   Investigating protein isoforms via proteomics: A feasibility study [J].
Blakeley, Paul ;
Siepen, Jennifer A. ;
Lawless, Craig ;
Hubbard, Simon J. .
PROTEOMICS, 2010, 10 (06) :1127-1140
[8]   A comprehensive collection of chicken cDNAs [J].
Boardman, PE ;
Sanz-Ezquerro, J ;
Overton, IM ;
Burt, DW ;
Bosch, E ;
Fong, WT ;
Tickle, C ;
Brown, WRA ;
Wilson, SA ;
Hubbard, SJ .
CURRENT BIOLOGY, 2002, 12 (22) :1965-1969
[9]   Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models [J].
Borchert, Nadine ;
Dieterich, Christoph ;
Krug, Karsten ;
Schuetz, Wolfgang ;
Jung, Stephan ;
Nordheim, Alfred ;
Sommer, Ralf J. ;
Macek, Boris .
GENOME RESEARCH, 2010, 20 (06) :837-846
[10]   Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome [J].
Brosch, Markus ;
Saunders, Gary I. ;
Frankish, Adam ;
Collins, Mark O. ;
Yu, Lu ;
Wright, James ;
Verstraten, Ruth ;
Adams, David J. ;
Harrow, Jennifer ;
Choudhary, Jyoti S. ;
Hubbard, Tim .
GENOME RESEARCH, 2011, 21 (05) :756-767