1
|
A Cautionary Tale on the Inclusion of Variable Posttranslational Modifications in Database-Dependent Searches of Mass Spectrometry Data. Methods Enzymol 2017; 586:433-452. [PMID: 28137575 DOI: 10.1016/bs.mie.2016.11.007] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
Mass spectrometry-based proteomics allows in principle the identification of unknown target proteins of posttranslational modifications and the sites of attachment. Including a variety of posttranslational modifications in database-dependent searches of high-throughput mass spectrometry data holds the promise to gain spectrum assignments to modified peptides, thereby increasing the number of assigned spectra, and to identify potentially interesting modification events. However, these potential benefits come for the price of an increased search space, which can lead to reduced scores, increased score thresholds, and erroneous peptide spectrum matches. We have assessed here the advantages and disadvantages of including the variable posttranslational modifications methionine oxidation, protein N-terminal acetylation, cysteine carbamidomethylation, transformation of N-terminal glutamine to pyroglutamic acid (Gln→pyro-Glu), and deamidation of asparagine and glutamine. Based on calculations of local false discovery rates and comparisons to known features of the respective modifications, we recommend for searches of samples that were not enriched for specific posttranslational modifications to only include methionine oxidation, protein N-terminal acetylation, and peptide N-terminal Gln→pyro-Glu as variable modifications. The principle of the validation strategy adopted here can also be applied for assessing the inclusion of posttranslational modifications for differently prepared samples, or for additional modifications. In addition, we have reassessed the special properties of the ubiquitin footprint, which is the remainder of ubiquitin moieties attached to lysines after tryptic digest. We show here that the ubiquitin footprint often breaks off as neutral loss and that it can be distinguished from dicarbamidomethylation events.
Collapse
|
2
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
3
|
Perez-Riverol Y, Alpi E, Wang R, Hermjakob H, Vizcaíno JA. Making proteomics data accessible and reusable: current state of proteomics databases and repositories. Proteomics 2015; 15:930-49. [PMID: 25158685 PMCID: PMC4409848 DOI: 10.1002/pmic.201400302] [Citation(s) in RCA: 141] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Revised: 08/06/2014] [Accepted: 08/22/2014] [Indexed: 01/10/2023]
Abstract
Compared to other data-intensive disciplines such as genomics, public deposition and storage of MS-based proteomics, data are still less developed due to, among other reasons, the inherent complexity of the data and the variety of data types and experimental workflows. In order to address this need, several public repositories for MS proteomics experiments have been developed, each with different purposes in mind. The most established resources are the Global Proteome Machine Database (GPMDB), PeptideAtlas, and the PRIDE database. Additionally, there are other useful (in many cases recently developed) resources such as ProteomicsDB, Mass Spectrometry Interactive Virtual Environment (MassIVE), Chorus, MaxQB, PeptideAtlas SRM Experiment Library (PASSEL), Model Organism Protein Expression Database (MOPED), and the Human Proteinpedia. In addition, the ProteomeXchange consortium has been recently developed to enable better integration of public repositories and the coordinated sharing of proteomics information, maximizing its benefit to the scientific community. Here, we will review each of the major proteomics resources independently and some tools that enable the integration, mining and reuse of the data. We will also discuss some of the major challenges and current pitfalls in the integration and sharing of the data.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | | | | | |
Collapse
|
4
|
Arabidopsis proteomics: a simple and standardizable workflow for quantitative proteome characterization. Methods Mol Biol 2014; 1072:275-88. [PMID: 24136529 DOI: 10.1007/978-1-62703-631-3_20] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Arabidopsis is the model plant of choice for large-scale proteome analyses, because its genome is well annotated, essentially free of sequencing errors, and relatively small with little redundancy. Furthermore, most Arabidopsis organs are susceptible to standard protein solubilization protocols making protein extraction relatively simple. Many different facets of functional plant proteomics were established with Arabidopsis such as mapping the subcellular proteomes of organelles, proteo-genomic peptide mapping, and numerous studies on the dynamic changes in protein modification and protein abundances. As most standard proteomics technologies are now routinely applied, research interest is increasingly shifting towards the reverse genetic characterization of gene function at the proteome level, i.e., by profiling the quantitative proteome of wild type in comparison with mutant plant tissue. We report here a simple, standardizable protocol for the large-scale comparative quantitative proteome characterization of different Arabidopsis organs based on normalized spectral counting and suggest a statistical framework for data interpretation. Based on existing organellar proteome maps, proteins can be assigned to organelles, thus allowing the identification of organelle-specific responses.
Collapse
|
5
|
HE LIN, HAN XI, MA BIN. DE NOVO SEQUENCING WITH LIMITED NUMBER OF POST-TRANSLATIONAL MODIFICATIONS PER PEPTIDE. J Bioinform Comput Biol 2013; 11:1350007. [DOI: 10.1142/s0219720013500078] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
De novo sequencing derives the peptide sequence from a tandem mass spectrum without the assistance of protein databases. This analysis has been indispensable for the identification of novel or modified peptides in a biological sample. Currently, the speed of de novo sequencing algorithms is not heavily affected by the number of post-translational modification (PTM) types in consideration. However, the accuracy of the algorithms can be degraded due to the increased search space. Most peptides in a proteomics research contain only a small number of PTMs per peptide, yet the types of PTMs can come from a large number of choices. Therefore, it is desirable to include a large number of PTM types in a de novo sequencing algorithm, yet to limit the number of PTM occurrences in each peptide to increase the accuracy. In this paper, we present an efficient de novo sequencing algorithm, DeNovoPTM, for such a purpose. The implemented software is downloadable from http://www.cs.uwaterloo.ca/~l22he/denovo_ptm .
Collapse
Affiliation(s)
- LIN HE
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| | - XI HAN
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| | - BIN MA
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| |
Collapse
|
6
|
Deep proteome profiling of Trichoplax adhaerens reveals remarkable features at the origin of metazoan multicellularity. Nat Commun 2013; 4:1408. [PMID: 23360999 DOI: 10.1038/ncomms2424] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Accepted: 12/21/2012] [Indexed: 01/05/2023] Open
Abstract
Genome sequencing of arguably the simplest known animal, Trichoplax adhaerens, uncovered a rich array of transcription factor and signalling pathway genes. Although the existence of such genes allows speculation about the presence of complex regulatory events, it does not reveal the level of actual protein expression and functionalization through posttranslational modifications. Using high-resolution mass spectrometry, we here semi-quantify 6,516 predicted proteins, revealing evidence of horizontal gene transfer and the presence at the protein level of nodes important in animal signalling pathways. Moreover, our data demonstrate a remarkably high activity of tyrosine phosphorylation, in line with the hypothesized burst of tyrosine-regulated signalling at the instance of animal multicellularity. Together, this Trichoplax proteomics data set offers significant new insight into the mechanisms underlying the emergence of metazoan multicellularity and provides a resource for interested researchers.
Collapse
|
7
|
An M, Zou X, Wang Q, Zhao X, Wu J, Xu LM, Shen HY, Xiao X, He D, Ji J. High-confidence de novo peptide sequencing using positive charge derivatization and tandem MS spectra merging. Anal Chem 2013; 85:4530-7. [PMID: 23536960 DOI: 10.1021/ac4001699] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
De novo peptide sequencing holds great promise in discovering new protein sequences and modifications but has often been hindered by low success rate of mass spectra interpretation, mainly due to the diversity of fragment ion types and insufficient information for each ion series. Here, we describe a novel methodology that combines highly efficient on-tip charge derivatization and tandem MS spectra merging, which greatly boosts the performance of interpretation. TMPP-Ac-OSu (succinimidyloxycarbonylmethyl tris(2,4,6-trimethoxyphenyl)phosphonium bromide) was used to derivatize peptides at N-termini on tips to reduce mass spectra complexity. Then, a novel approach of spectra merging was adopted to combine the benefits of collision-induced dissociation (CID) and electron transfer dissociation (ETD) fragmentation. We applied this methodology to rat C6 glioma cells and the Cyprinus carpio and searched the resulting peptide sequences against the protein database. Then, we achieved thousands of high-confidence peptide sequences, a level that conventional de novo sequencing methods could not reach. Next, we identified dozens of novel peptide sequences by homology searching of sequences that were fully backbone covered but unmatched during the database search. Furthermore, we randomly chose 34 sequences discovered in rat C6 cells and verified them. Finally, we conclude that this novel methodology that combines on-tip positive charge derivatization and tandem MS spectra merging will greatly facilitate the discovery of novel proteins and the proteome analysis of nonmodel organisms.
Collapse
Affiliation(s)
- Mingrui An
- State Key Laboratory of Protein and Plant Gene Research, College of Life Sciences, Peking University, Beijing 100871, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Abstract
Discovery or shotgun proteomics has emerged as the most powerful technique to comprehensively map out a proteome. Reconstruction of protein identities from the raw mass spectrometric data constitutes a cornerstone of any shotgun proteomics workflow. The inherent uncertainty of mass spectrometric data and the complexity of a proteome render protein inference and the statistical validation of protein identifications a non-trivial task, still being a subject of ongoing research. This review aims to survey the different conceptual approaches to the different tasks of inferring and statistically validating protein identifications and to discuss their implications on the scope of proteome exploration.
Collapse
Affiliation(s)
- Manfred Claassen
- Computer Science Department, Stanford University, Stanford, CA 94305-9010, USA.
| |
Collapse
|
9
|
Helmy M, Sugiyama N, Tomita M, Ishihama Y. Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics. Genes Cells 2012; 17:633-44. [PMID: 22686349 DOI: 10.1111/j.1365-2443.2012.01615.x] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 04/14/2012] [Indexed: 01/18/2023]
Abstract
We have developed a novel bioinformatics method called mass spectrum sequential subtraction (MSSS) to search large peptide spectra datasets produced by liquid chromatography/mass spectrometry (LC-MS/MS) against protein and large-sized nucleotide sequence databases. The main principle in MSSS is to search the peptide spectra set against the protein database, followed by removal of the spectra corresponding to the identified peptides to create a smaller set of the remaining peptide spectra for searching against the nucleotide sequences database. Therefore, we reduce the number of spectra to be searched to limit the peptide search space. Comparing MSSS and conventional search approach using a dataset of 27 LC-MS/MS runs of rice culture cells indicated that MSSS reduced the search queries to 50% and the search time to 75% on average. In addition, MSSS had no effect on the identification false-positive rate (FPR) or the novel peptide sequences identification ability. We used MSSS to analyze another dataset of 34 LC-MS/MS runs, resulting in identifying additional 74 novel peptides. Proteogenomic analysis with these additional peptides yielded 47 new genomic features in 24 rice genes plus 24 intergenic peptides. These results show that the utility of MSSS in searching large databases with large MS/MS datasets for proteogenomics.
Collapse
Affiliation(s)
- Mohamed Helmy
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata 997-0017, Japan
| | | | | | | |
Collapse
|
10
|
Renard BY, Xu B, Kirchner M, Zickmann F, Winter D, Korten S, Brattig NW, Tzur A, Hamprecht FA, Steen H. Overcoming species boundaries in peptide identification with Bayesian information criterion-driven error-tolerant peptide search (BICEPS). Mol Cell Proteomics 2012; 11:M111.014167. [PMID: 22493179 PMCID: PMC3394943 DOI: 10.1074/mcp.m111.014167] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully annotated reference genomes required for proteomic analyses. Standard database search algorithms fail to identify peptides that are not exactly contained in a protein database. De novo searches are generally hindered by their restricted reliability, and current error-tolerant search strategies are limited by global, heuristic tradeoffs between database and spectral information. We propose a Bayesian information criterion-driven error-tolerant peptide search (BICEPS) and offer an open source implementation based on this statistical criterion to automatically balance the information of each single spectrum and the database, while limiting the run time. We show that BICEPS performs as well as current database search algorithms when such algorithms are applied to sequenced organisms, whereas BICEPS only uses a remotely related organism database. For instance, we use a chicken instead of a human database corresponding to an evolutionary distance of more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716). We demonstrate the successful application to cross-species proteomics with a 33% increase in the number of identified proteins for a filarial nematode sample of Litomosoides sigmodontis.
Collapse
Affiliation(s)
- Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch Institute, Berlin 13353, Germany.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
A support for the identification of non-tryptic peptides based on low resolution tandem and sequential mass spectrometry data: The INSPIRE software. Anal Chim Acta 2012; 718:70-7. [DOI: 10.1016/j.aca.2012.01.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2011] [Revised: 12/28/2011] [Accepted: 01/02/2012] [Indexed: 11/17/2022]
|
12
|
Agrawal GK, Bourguignon J, Rolland N, Ephritikhine G, Ferro M, Jaquinod M, Alexiou KG, Chardot T, Chakraborty N, Jolivet P, Doonan JH, Rakwal R. Plant organelle proteomics: collaborating for optimal cell function. MASS SPECTROMETRY REVIEWS 2011; 30:772-853. [PMID: 21038434 DOI: 10.1002/mas.20301] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2009] [Revised: 02/02/2010] [Accepted: 02/02/2010] [Indexed: 05/10/2023]
Abstract
Organelle proteomics describes the study of proteins present in organelle at a particular instance during the whole period of their life cycle in a cell. Organelles are specialized membrane bound structures within a cell that function by interacting with cytosolic and luminal soluble proteins making the protein composition of each organelle dynamic. Depending on organism, the total number of organelles within a cell varies, indicating their evolution with respect to protein number and function. For example, one of the striking differences between plant and animal cells is the plastids in plants. Organelles have their own proteins, and few organelles like mitochondria and chloroplast have their own genome to synthesize proteins for specific function and also require nuclear-encoded proteins. Enormous work has been performed on animal organelle proteomics. However, plant organelle proteomics has seen limited work mainly due to: (i) inter-plant and inter-tissue complexity, (ii) difficulties in isolation of subcellular compartments, and (iii) their enrichment and purity. Despite these concerns, the field of organelle proteomics is growing in plants, such as Arabidopsis, rice and maize. The available data are beginning to help better understand organelles and their distinct and/or overlapping functions in different plant tissues, organs or cell types, and more importantly, how protein components of organelles behave during development and with surrounding environments. Studies on organelles have provided a few good reviews, but none of them are comprehensive. Here, we present a comprehensive review on plant organelle proteomics starting from the significance of organelle in cells, to organelle isolation, to protein identification and to biology and beyond. To put together such a systematic, in-depth review and to translate acquired knowledge in a proper and adequate form, we join minds to provide discussion and viewpoints on the collaborative nature of organelles in cell, their proper function and evolution.
Collapse
Affiliation(s)
- Ganesh Kumar Agrawal
- Research Laboratory for Biotechnology and Biochemistry (RLABB), P.O. Box 13265, Sanepa, Kathmandu, Nepal.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Gfeller A, Baerenfaller K, Loscos J, Chételat A, Baginsky S, Farmer EE. Jasmonate controls polypeptide patterning in undamaged tissue in wounded Arabidopsis leaves. PLANT PHYSIOLOGY 2011; 156:1797-807. [PMID: 21693672 PMCID: PMC3149931 DOI: 10.1104/pp.111.181008] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2011] [Accepted: 06/20/2011] [Indexed: 05/20/2023]
Abstract
Wounding initiates a strong and largely jasmonate-dependent remodelling of the transcriptome in the leaf blades of Arabidopsis (Arabidopsis thaliana). How much control do jasmonates exert on wound-induced protein repatterning in leaves? Replicated shotgun proteomic analyses of 2.5-mm-wide leaf strips adjacent to wounds revealed 106 differentially regulated proteins. Many of these gene products have not emerged as being wound regulated in transcriptomic studies. From experiments using the jasmonic acid (JA)-deficient allene oxide synthase mutant we estimated that approximately 95% of wound-stimulated changes in protein levels were deregulated in the absence of JA. The levels of two tonoplast proteins already implicated in defense response regulation, TWO-PORE CHANNEL1 and the calcium-V-ATPase ACA4 increased on wounding, but their transcripts were not wound inducible. The data suggest new roles for jasmonate in controlling the levels of calcium-regulated pumps and transporters, proteins involved in targeted proteolysis, a putative bacterial virulence factor target, a light-dependent catalyst, and a key redox-controlled enzyme in glutathione synthesis. Extending the latter observation we found that wounding increased the proportion of oxidized glutathione in leaves, but only in plants able to synthesize JA. The oxidizing conditions generated through JA signaling near wounds help to define the cellular environment in which proteome remodelling occurs.
Collapse
|
14
|
Diament BJ, Noble WS. Faster SEQUEST searching for peptide identification from tandem mass spectra. J Proteome Res 2011; 10:3871-9. [PMID: 21761931 DOI: 10.1021/pr101196n] [Citation(s) in RCA: 115] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Computational analysis of mass spectra remains the bottleneck in many proteomics experiments. SEQUEST was one of the earliest software packages to identify peptides from mass spectra by searching a database of known peptides. Though still popular, SEQUEST performs slowly. Crux and TurboSEQUEST have successfully sped up SEQUEST by adding a precomputed index to the search, but the demand for ever-faster peptide identification software continues to grow. Tide, introduced here, is a software program that implements the SEQUEST algorithm for peptide identification and that achieves a dramatic speedup over Crux and SEQUEST. The optimization strategies detailed here employ a combination of algorithmic and software engineering techniques to achieve speeds up to 170 times faster than a recent version of SEQUEST that uses indexing. For example, on a single Xeon CPU, Tide searches 10,000 spectra against a tryptic database of 27,499 Caenorhabditis elegans proteins at a rate of 1550 spectra per second, which compares favorably with a rate of 8.8 spectra per second for a recent version of SEQUEST with index running on the same hardware.
Collapse
Affiliation(s)
- Benjamin J Diament
- Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States
| | | |
Collapse
|
15
|
Baerenfaller K, Hirsch-Hoffmann M, Svozil J, Hull R, Russenberger D, Bischof S, Lu Q, Gruissem W, Baginsky S. pep2pro: a new tool for comprehensive proteome data analysis to reveal information about organ-specific proteomes in Arabidopsis thaliana. Integr Biol (Camb) 2011; 3:225-37. [PMID: 21264403 DOI: 10.1039/c0ib00078g] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
pep2pro is a comprehensive proteome analysis database specifically suitable for flexible proteome data analysis. The pep2pro database schema offers solutions to the various challenges of developing a proteome data analysis database and because data integrated in pep2pro are in relational format, it enables flexible and detailed data analysis. The information provided here will facilitate building proteome data analysis databases for other organisms or applications. The capacity of the pep2pro database for the integration and analysis of large proteome datasets was demonstrated by creating the pep2pro dataset, which is an organ-specific characterisation of the Arabidopsis thaliana proteome containing 14 522 identified proteins based on 2.6 million peptide spectrum assignments. This dataset provides evidence of protein expression and reveals organ-specific processes. The high coverage and density of the dataset are essential for protein quantification by normalised spectral counting and allowed us to extract information that is usually not accessible in low-coverage datasets. With this quantitative protein information we analysed organ- and organelle-specific sub-proteomes. In addition we matched spectra to regions in the genome that were not predicted to have protein coding capacity and provide PCR validation for selected revised gene models. Furthermore, we analysed the peptide features that distinguish detected from non-detected peptides and found substantial disagreement between predicted and detected proteotypic peptides, suggesting that large-scale proteomics data are essential for efficient selection of proteotypic peptides in targeted proteomics surveys. The pep2pro dataset is available as a resource for plant systems biology at www.pep2pro.ethz.ch.
Collapse
Affiliation(s)
- Katja Baerenfaller
- Department of Biology, ETH Zurich, Universitaetstrasse 2, 8092 Zurich, Switzerland.
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Zhou C, Chi H, Wang LH, Li Y, Wu YJ, Fu Y, Sun RX, He SM. Speeding up tandem mass spectrometry-based database searching by longest common prefix. BMC Bioinformatics 2010; 11:577. [PMID: 21108792 PMCID: PMC3000425 DOI: 10.1186/1471-2105-11-577] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2010] [Accepted: 11/25/2010] [Indexed: 11/10/2022] Open
Abstract
Background Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. Some software tools choose peptide indexing to accelerate processing. However, peptide indexing requires a large amount of time and space for construction, especially for the non-specific digestion. Additionally, it is not flexible to use. Results We developed an algorithm based on the longest common prefix (ABLCP) to efficiently organize a protein sequence database. The longest common prefix is a data structure that is always coupled to the suffix array. It eliminates redundant candidate peptides in databases and reduces the corresponding peptide-spectrum matching times, thereby decreasing the identification time. This algorithm is based on the property of the longest common prefix. Even enzymatic digestion poses a challenge to this property, but some adjustments can be made to this algorithm to ensure that no candidate peptides are omitted. Compared with peptide indexing, ABLCP requires much less time and space for construction and is subject to fewer restrictions. Conclusions The ABLCP algorithm can help to improve data analysis efficiency. A software tool implementing this algorithm is available at http://pfind.ict.ac.cn/pfind2dot5/index.htm
Collapse
Affiliation(s)
- Chen Zhou
- Key Lab of Intelligent Information Processing, Chinese Academy of Sciences, Beijing 100190, China
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Krug K, Nahnsen S, Macek B. Mass spectrometry at the interface of proteomics and genomics. MOLECULAR BIOSYSTEMS 2010; 7:284-91. [PMID: 20967315 DOI: 10.1039/c0mb00168f] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
With the onset of modern DNA sequencing technologies, genomics is experiencing a revolution in terms of quantity and quality of sequencing data. Rapidly growing numbers of sequenced genomes and metagenomes present a tremendous challenge for bioinformatics tools that predict protein-coding regions. Experimental evidence of expressed genomic regions, both at the RNA and protein level, is becoming invaluable for genome annotation and training of gene prediction algorithms. Evidence of gene expression at the protein level using mass spectrometry-based proteomics is increasingly used in refinement of raw genome sequencing data. In a typical "proteogenomics" experiment, the whole proteome of an organism is extracted, digested into peptides and measured by a mass spectrometer. The peptide fragmentation spectra are identified by searching against a six-frame translation of the raw genomic assembly, thus enabling the identification of hitherto unpredicted protein-coding genomic regions. Application of mass spectrometry to genome annotation presents a range of challenges to the standard workflows in proteomics, especially in terms of proteome coverage and database search strategies. Here we provide an overview of the field and argue that the latest mass spectrometry technologies that enable high mass accuracy at high acquisition rates will prove to be especially well suited for proteogenomics applications.
Collapse
Affiliation(s)
- Karsten Krug
- Proteome Center Tuebingen, Interdepartmental Institute for Cell Biology, University of Tuebingen, Auf der Morgenstelle 15, 72076 Tuebingen, Germany
| | | | | |
Collapse
|
18
|
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010; 73:2092-123. [PMID: 20816881 DOI: 10.1016/j.jprot.2010.08.009] [Citation(s) in RCA: 358] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Revised: 08/25/2010] [Accepted: 08/25/2010] [Indexed: 12/18/2022]
Abstract
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Collapse
|
19
|
Proteogenomics of Pristionchus pacificus reveals distinct proteome structure of nematode models. Genome Res 2010; 20:837-46. [PMID: 20237107 DOI: 10.1101/gr.103119.109] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Pristionchus pacificus is a nematode model organism whose genome has recently been sequenced. To refine the genome annotation we performed transcriptome and proteome analysis and gathered comprehensive experimental information on gene expression. Transcriptome analysis on a 454 Life Sciences (Roche) FLX platform generated >700,000 expressed sequence tags (ESTs) from two normalized EST libraries, whereas proteome analysis on an LTQ-Orbitrap mass spectrometer detected >27,000 nonredundant peptide sequences from more than 4000 proteins at sub-parts-per-million (ppm) mass accuracy and a false discovery rate of <1%. Retraining of the SNAP gene prediction algorithm using the gene expression data led to a decrease in the number of previously predicted protein-coding genes from 29,000 to 24,000 and refinement of numerous gene models. The P. pacificus proteome contains a high proportion of small proteins with no known homologs in other species ("pioneer" proteins). Some of these proteins appear to be products of highly homologous genes, pointing to their common origin. We show that >50% of all pioneer genes are transcribed under standard culture conditions and that pioneer proteins significantly contribute to a unimodal distribution of predicted protein sizes in P. pacificus, which has an unusually low median size of 240 amino acids (26.8 kDa). In contrast, the predicted proteome of Caenorhabditis elegans follows a distinct bimodal protein size distribution, with significant functional differences between small and large protein populations. Combined, these results provide the first catalog of the expressed genome of P. pacificus, refinement of its genome annotation, and the first comparison of related nematode models at the proteome level.
Collapse
|
20
|
Li Y, Chi H, Wang LH, Wang HP, Fu Y, Yuan ZF, Li SJ, Liu YS, Sun RX, Zeng R, He SM. Speeding up tandem mass spectrometry based database searching by peptide and spectrum indexing. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2010; 24:807-814. [PMID: 20187083 DOI: 10.1002/rcm.4448] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Database searching is the technique of choice for shotgun proteomics, and to date much research effort has been spent on improving its effectiveness. However, database searching faces a serious challenge of efficiency, considering the large numbers of mass spectra and the ever fast increase in peptide databases resulting from genome translations, enzymatic digestions, and post-translational modifications. In this study, we conducted systematic research on speeding up database search engines for protein identification and illustrate the key points with the specific design of the pFind 2.1 search engine as a running example. Firstly, by constructing peptide indexes, pFind achieves a speedup of two to three compared with that without peptide indexes. Secondly, by constructing indexes for observed precursor and fragment ions, pFind achieves another speedup of two. As a result, pFind compares very favorably with predominant search engines such as Mascot, SEQUEST and X!Tandem.
Collapse
Affiliation(s)
- You Li
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Baginsky S. Plant proteomics: concepts, applications, and novel strategies for data interpretation. MASS SPECTROMETRY REVIEWS 2009; 28:93-120. [PMID: 18618656 DOI: 10.1002/mas.20183] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
Proteomics is an essential source of information about biological systems because it generates knowledge about the concentrations, interactions, functions, and catalytic activities of proteins, which are the major structural and functional determinants of cells. In the last few years significant technology development has taken place both at the level of data analysis software and mass spectrometry hardware. Conceptual progress in proteomics has made possible the analysis of entire proteomes at previously unprecedented density and accuracy. New concepts have emerged that comprise quantitative analyses of full proteomes, database-independent protein identification strategies, targeted quantitative proteomics approaches with proteotypic peptides and the systematic analysis of an increasing number of posttranslational modifications at high temporal and spatial resolution. Although plant proteomics is making progress, there are still several analytical challenges that await experimental and conceptual solutions. With this review I will highlight the current status of plant proteomics and put it into the context of the aforementioned conceptual progress in the field, illustrate some of the plant-specific challenges and present my view on the great opportunities for plant systems biology offered by proteomics.
Collapse
Affiliation(s)
- Sacha Baginsky
- Institute of Plant Sciences, Swiss Federal Institute of Technology, Universitätsstrasse 2, 8092 Zurich, Switzerland.
| |
Collapse
|
22
|
Eng JK, Fischer B, Grossmann J, Maccoss MJ. A fast SEQUEST cross correlation algorithm. J Proteome Res 2008; 7:4598-602. [PMID: 18774840 DOI: 10.1021/pr800420s] [Citation(s) in RCA: 167] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The SEQUEST program was the first and remains one of the most widely used tools for assigning a peptide sequence within a database to a tandem mass spectrum. The cross correlation score is the primary score function implemented within SEQUEST and it is this score that makes the tool particularly sensitive. Unfortunately, this score is computationally expensive to calculate, and thus, to make the score manageable, SEQUEST uses a less sensitive but fast preliminary score and restricts the cross correlation to just the top 500 peptides returned by the preliminary score. Classically, the cross correlation score has been calculated using Fast Fourier Transforms (FFT) to generate the full correlation function. We describe an alternate method of calculating the cross correlation score that does not require FFTs and can be computed efficiently in a fraction of the time. The fast calculation allows all candidate peptides to be scored by the cross correlation function, potentially mitigating the need for the preliminary score, and enables an E-value significance calculation based on the cross correlation score distribution calculated on all candidate peptide sequences obtained from a sequence database.
Collapse
Affiliation(s)
- Jimmy K Eng
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA.
| | | | | | | |
Collapse
|
23
|
Baerenfaller K, Grossmann J, Grobei MA, Hull R, Hirsch-Hoffmann M, Yalovsky S, Zimmermann P, Grossniklaus U, Gruissem W, Baginsky S. Genome-Scale Proteomics Reveals Arabidopsis thaliana Gene Models and Proteome Dynamics. Science 2008; 320:938-41. [DOI: 10.1126/science.1157956] [Citation(s) in RCA: 425] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|