1
|
Wen B, Freestone J, Riffle M, MacCoss MJ, Noble WS, Keich U. Assessment of false discovery rate control in tandem mass spectrometry analysis using entrapment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.01.596967. [PMID: 38895431 PMCID: PMC11185562 DOI: 10.1101/2024.06.01.596967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
A pressing statistical challenge in the field of mass spectrometry proteomics is how to assess whether a given software tool provides accurate error control. Each software tool for searching such data uses its own internally implemented methodology for reporting and controlling the error. Many of these software tools are closed source, with incompletely documented methodology, and the strategies for validating the error are inconsistent across tools. In this work, we identify three different methods for validating false discovery rate (FDR) control in use in the field, one of which is invalid, one of which can only provide a lower bound rather than an upper bound, and one of which is valid but under-powered. The result is that the field has a very poor understanding of how well we are doing with respect to FDR control, particularly for the analysis of data-independent acquisition (DIA) data. We therefore propose a new, more powerful method for evaluating FDR control in this setting, and we then employ that method, along with an existing lower bounding technique, to characterize a variety of popular search tools. We find that the search tools for analysis of data-dependent acquisition (DDA) data generally seem to control the FDR at the peptide level, whereas none of the DIA search tools consistently controls the FDR at the peptide level across all the datasets we investigated. Furthermore, this problem becomes much worse when the latter tools are evaluated at the protein level. These results may have significant implications for various downstream analyses, since proper FDR control has the potential to reduce noise in discovery lists and thereby boost statistical power.
Collapse
Affiliation(s)
- Bo Wen
- Department of Genome Sciences, University of Washington
| | - Jack Freestone
- School of Mathematics and Statistics, University of Sydney
| | | | | | - William S Noble
- Department of Genome Sciences, University of Washington
- Paul G. Allen School of Computer Science and Engineering, University of Washington
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney
| |
Collapse
|
2
|
Freestone J, Noble WS, Keich U. Analysis of Tandem Mass Spectrometry Data with CONGA: Combining Open and Narrow Searches with Group-Wise Analysis. J Proteome Res 2024. [PMID: 38652578 DOI: 10.1021/acs.jproteome.3c00399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
Abstract
Searching for tandem mass spectrometry proteomics data against a database is a well-established method for assigning peptide sequences to observed spectra but typically cannot identify peptides harboring unexpected post-translational modifications (PTMs). Open modification searching aims to address this problem by allowing a spectrum to match a peptide even if the spectrum's precursor mass differs from the peptide mass. However, expanding the search space in this way can lead to a loss of statistical power to detect peptides. We therefore developed a method, called CONGA (combining open and narrow searches with group-wise analysis), that takes into account results from both types of searches─a traditional "narrow window" search and an open modification search─while carrying out rigorous false discovery rate control. The result is an algorithm that provides the best of both worlds: the ability to detect unexpected PTMs without a concomitant loss of power to detect unmodified peptides.
Collapse
Affiliation(s)
- Jack Freestone
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics and Statistics F07, University of Sydney, NSW 2006, Australia
| |
Collapse
|
3
|
Madej D, Lam H. On the use of tandem mass spectra acquired from samples of evolutionarily distant organisms to validate methods for false discovery rate estimation. Proteomics 2024:e2300398. [PMID: 38491400 DOI: 10.1002/pmic.202300398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 03/18/2024]
Abstract
Estimating the false discovery rate (FDR) of peptide identifications is a key step in proteomics data analysis, and many methods have been proposed for this purpose. Recently, an entrapment-inspired protocol to validate methods for FDR estimation appeared in articles showcasing new spectral library search tools. That validation approach involves generating incorrect spectral matches by searching spectra from evolutionarily distant organisms (entrapment queries) against the original target search space. Although this approach may appear similar to the solutions using entrapment databases, it represents a distinct conceptual framework whose correctness has not been verified yet. In this viewpoint, we first discussed the background of the entrapment-based validation protocols and then conducted a few simple computational experiments to verify the assumptions behind them. The results reveal that entrapment databases may, in some implementations, be a reasonable choice for validation, while the assumptions underpinning validation protocols based on entrapment queries are likely to be violated in practice. This article also highlights the need for well-designed frameworks for validating FDR estimation methods in proteomics.
Collapse
Affiliation(s)
- Dominik Madej
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Henry Lam
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| |
Collapse
|
4
|
Luo D, Ebadi A, Emery K, He Y, Noble WS, Keich U. Competition-based control of the false discovery proportion. Biometrics 2023; 79:3472-3484. [PMID: 36652258 DOI: 10.1111/biom.13830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 10/12/2022] [Accepted: 01/02/2023] [Indexed: 01/19/2023]
Abstract
Recently, Barber and Candès laid the theoretical foundation for a general framework for false discovery rate (FDR) control based on the notion of "knockoffs." A closely related FDR control methodology has long been employed in the analysis of mass spectrometry data, referred to there as "target-decoy competition" (TDC). However, any approach that aims to control the FDR, which is defined as the expected value of the false discovery proportion (FDP), suffers from a problem. Specifically, even when successfully controlling the FDR at level α, the FDP in the list of discoveries can significantly exceed α. We offer FDP-SD, a new procedure that rigorously controls the FDP in the knockoff/TDC competition setup by guaranteeing that the FDP is bounded by α at a desired confidence level. Compared with the recently published framework of Katsevich and Ramdas, FDP-SD generally delivers more power and often substantially so in simulated and real data.
Collapse
Affiliation(s)
- Dong Luo
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Arya Ebadi
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Kristen Emery
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | - Yilun He
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| | | | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, New South Wales, Australia
| |
Collapse
|
5
|
Tabb DL, Jeong K, Druart K, Gant MS, Brown KA, Nicora C, Zhou M, Couvillion S, Nakayasu E, Williams JE, Peterson HK, McGuire MK, McGuire MA, Metz TO, Chamot-Rooke J. Comparing Top-Down Proteoform Identification: Deconvolution, PrSM Overlap, and PTM Detection. J Proteome Res 2023. [PMID: 37235544 DOI: 10.1021/acs.jproteome.2c00673] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Generating top-down tandem mass spectra (MS/MS) from complex mixtures of proteoforms benefits from improvements in fractionation, separation, fragmentation, and mass analysis. The algorithms to match MS/MS to sequences have undergone a parallel evolution, with both spectral alignment and match-counting approaches producing high-quality proteoform-spectrum matches (PrSMs). This study assesses state-of-the-art algorithms for top-down identification (ProSight PD, TopPIC, MSPathFinderT, and pTop) in their yield of PrSMs while controlling false discovery rate. We evaluated deconvolution engines (ThermoFisher Xtract, Bruker AutoMSn, Matrix Science Mascot Distiller, TopFD, and FLASHDeconv) in both ThermoFisher Orbitrap-class and Bruker maXis Q-TOF data (PXD033208) to produce consistent precursor charges and mass determinations. Finally, we sought post-translational modifications (PTMs) in proteoforms from bovine milk (PXD031744) and human ovarian tissue. Contemporary identification workflows produce excellent PrSM yields, although approximately half of all identified proteoforms from these four pipelines were specific to only one workflow. Deconvolution algorithms disagree on precursor masses and charges, contributing to identification variability. Detection of PTMs is inconsistent among algorithms. In bovine milk, 18% of PrSMs produced by pTop and TopMG were singly phosphorylated, but this percentage fell to 1% for one algorithm. Applying multiple search engines produces more comprehensive assessments of experiments. Top-down algorithms would benefit from greater interoperability.
Collapse
Affiliation(s)
- David L Tabb
- Université Paris Cité, Institut Pasteur, CNRS UAR 2024, Mass Spectrometry for Biology Unit, Paris 75015, France
| | - Kyowon Jeong
- Applied Bioinformatics, Computer Science Department, University of Tübingen, Tübingen 72076, Germany
| | - Karen Druart
- Université Paris Cité, Institut Pasteur, CNRS UAR 2024, Mass Spectrometry for Biology Unit, Paris 75015, France
| | - Megan S Gant
- Université Paris Cité, Institut Pasteur, CNRS UAR 2024, Mass Spectrometry for Biology Unit, Paris 75015, France
| | - Kyle A Brown
- School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin 53705, United States
| | - Carrie Nicora
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Mowei Zhou
- Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | - Sneha Couvillion
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Ernesto Nakayasu
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Janet E Williams
- Department of Animal, Veterinary, and Food Sciences, University of Idaho, Moscow, Idaho 83844, United States
| | - Haley K Peterson
- Department of Animal, Veterinary, and Food Sciences, University of Idaho, Moscow, Idaho 83844, United States
| | - Michelle K McGuire
- Margaret Ritchie School of Family and Consumer Sciences, University of Idaho, Moscow, Idaho 83844, United States
| | - Mark A McGuire
- Department of Animal, Veterinary, and Food Sciences, University of Idaho, Moscow, Idaho 83844, United States
| | - Thomas O Metz
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Julia Chamot-Rooke
- Université Paris Cité, Institut Pasteur, CNRS UAR 2024, Mass Spectrometry for Biology Unit, Paris 75015, France
| |
Collapse
|
6
|
Lin A, Short T, Noble WS, Keich U. Improving Peptide-Level Mass Spectrometry Analysis via Double Competition. J Proteome Res 2022; 21:2412-2420. [PMID: 36166314 PMCID: PMC10108709 DOI: 10.1021/acs.jproteome.2c00282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The analysis of shotgun proteomics data often involves generating lists of inferred peptide-spectrum matches (PSMs) and/or of peptides. The canonical approach for generating these discovery lists is by controlling the false discovery rate (FDR), most commonly through target-decoy competition (TDC). At the PSM level, TDC is implemented by competing each spectrum's best-scoring target (real) peptide match with its best match against a decoy database. This PSM-level procedure can be adapted to the peptide level by selecting the top-scoring PSM per peptide prior to FDR estimation. Here, we first highlight and empirically augment a little known previous work by He et al., which showed that TDC-based PSM-level FDR estimates can be liberally biased. We thus propose that researchers instead focus on peptide-level analysis. We then investigate three ways to carry out peptide-level TDC and show that the most common method ("PSM-only") offers the lowest statistical power in practice. An alternative approach that carries out a double competition, first at the PSM and then at the peptide level ("PSM-and-peptide"), is the most powerful method, yielding an average increase of 17% more discovered peptides at 1% FDR threshold relative to the PSM-only method.
Collapse
Affiliation(s)
- Andy Lin
- Chemical and Biological Signatures, Pacific Northwest National Laboratory, Seattle, Washington 98109, United States
| | - Temana Short
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, United States
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Uri Keich
- School of Mathematics & Statistics, University of Sydney, New South Wales, 2006, Australia
| |
Collapse
|
7
|
Miller RM, Jordan BT, Mehlferber MM, Jeffery ED, Chatzipantsiou C, Kaur S, Millikin RJ, Dai Y, Tiberi S, Castaldi PJ, Shortreed MR, Luckey CJ, Conesa A, Smith LM, Deslattes Mays A, Sheynkman GM. Enhanced protein isoform characterization through long-read proteogenomics. Genome Biol 2022; 23:69. [PMID: 35241129 PMCID: PMC8892804 DOI: 10.1186/s13059-022-02624-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 02/02/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g., PacBio or Oxford Nanopore) provides full-length transcripts which can be used to predict full-length protein isoforms. RESULTS We describe here a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data to enable detection of protein isoforms previously intractable to MS-based detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis. CONCLUSIONS Our work suggests that the incorporation of long-read sequencing and proteomic data can facilitate improved characterization of human protein isoform diversity. Our first-generation pipeline provides a strong foundation for future development of long-read proteogenomics and its adoption for both basic and translational research.
Collapse
Affiliation(s)
- Rachel M. Miller
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Ben T. Jordan
- grid.27755.320000 0000 9136 933XDepartment of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA USA
| | - Madison M. Mehlferber
- grid.27755.320000 0000 9136 933XDepartment of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA USA ,grid.27755.320000 0000 9136 933XDepartment of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA USA
| | - Erin D. Jeffery
- grid.27755.320000 0000 9136 933XDepartment of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA USA
| | | | - Simi Kaur
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Robert J. Millikin
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Yunxiang Dai
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Simone Tiberi
- grid.7400.30000 0004 1937 0650Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland ,grid.7400.30000 0004 1937 0650Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Peter J. Castaldi
- grid.62560.370000 0004 0378 8294Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA USA ,grid.62560.370000 0004 0378 8294Division of General Medicine and Primary Care, Brigham and Women’s Hospital, Boston, MA USA
| | - Michael R. Shortreed
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Chance John Luckey
- grid.27755.320000 0000 9136 933XDepartment of Pathology, University of Virginia, Charlottesville, VA USA
| | - Ana Conesa
- grid.4711.30000 0001 2183 4846Institute for Integrative Systems Biology, Spanish National Research Council (CSIC), Paterna, Spain ,grid.15276.370000 0004 1936 8091Microbiology and Cell Science Department, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, FL USA
| | - Lloyd M. Smith
- grid.14003.360000 0001 2167 3675Department of Chemistry, University of Wisconsin-Madison, Madison, WI USA
| | - Anne Deslattes Mays
- grid.420089.70000 0000 9635 8082 Office of Data Science and Sharing, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Rockville, MD USA
| | - Gloria M. Sheynkman
- grid.27755.320000 0000 9136 933XDepartment of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA USA ,grid.27755.320000 0000 9136 933XCenter for Public Health Genomics, University of Virginia, Charlottesville, VA USA ,grid.27755.320000 0000 9136 933XUVA Cancer Center, University of Virginia, Charlottesville, VA USA
| |
Collapse
|
8
|
Lin A, Plubell DL, Keich U, Noble WS. Accurately Assigning Peptides to Spectra When Only a Subset of Peptides Are Relevant. J Proteome Res 2021; 20:4153-4164. [PMID: 34236864 PMCID: PMC8489664 DOI: 10.1021/acs.jproteome.1c00483] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
The standard proteomics database search strategy involves searching spectra against a peptide database and estimating the false discovery rate (FDR) of the resulting set of peptide-spectrum matches. One assumption of this protocol is that all the peptides in the database are relevant to the hypothesis being investigated. However, in settings where researchers are interested in a subset of peptides, alternative search and FDR control strategies are needed. Recently, two methods were proposed to address this problem: subset-search and all-sub. We show that both methods fail to control the FDR. For subset-search, this failure is due to the presence of "neighbor" peptides, which are defined as irrelevant peptides with a similar precursor mass and fragmentation spectrum as a relevant peptide. Not considering neighbors compromises the FDR estimate because a spectrum generated by an irrelevant peptide can incorrectly match well to a relevant peptide. Therefore, we have developed a new method, "subset-neighbor search" (SNS), that accounts for neighbor peptides. We show evidence that SNS controls the FDR when neighbors are present and that SNS outperforms group-FDR, the only other method that appears to control the FDR relative to a subset of relevant peptides.
Collapse
Affiliation(s)
- Andy Lin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Deanna L. Plubell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Uri Keich
- School of Mathematics and Statistics, University of Sydney, NSW, Australia
| | - William S. Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
9
|
Lucke K, Pennington J, Kreitzberg P, Käll L, Serang O. Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps. J Proteome Res 2021; 20:1849-1854. [PMID: 33529032 DOI: 10.1021/acs.jproteome.0c00711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω(n log(n)) time to sort n values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the p value when investigating the family-wise error rate and q values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in O(n), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., p values or q values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.
Collapse
Affiliation(s)
- Kyle Lucke
- Department of Computer Science, University of Montana, Missoula, Montana 59812, United States
| | - Jake Pennington
- Department of Mathematics, University of Montana, Missoula, Montana 59812, United States
| | - Patrick Kreitzberg
- Department of Mathematics, University of Montana, Missoula, Montana 59812, United States
| | - Lukas Käll
- Science for Life Laboratory, KTH Royal Institute of Technology, Solna 171 65, Sweden
| | - Oliver Serang
- Department of Computer Science, University of Montana, Missoula, Montana 59812, United States
| |
Collapse
|
10
|
Ramachandran S, Thomas T. A Frequency-Based Approach to Predict the Low-Energy Collision-Induced Dissociation Fragmentation Spectra. ACS OMEGA 2020; 5:12615-12622. [PMID: 32548445 PMCID: PMC7288360 DOI: 10.1021/acsomega.9b03935] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 05/12/2020] [Indexed: 06/11/2023]
Abstract
Peptide identification algorithms rely on the comparison between the experimental tandem mass spectrometry spectrum and the theoretical spectrum to identify a peptide from the tandem mass spectra. Hence, it is important to understand the fragmentation process and predict the tandem mass spectra for high-throughput proteomics research. In this study, a novel method was developed to predict the theoretical ion trap collision-induced dissociation (CID) tandem mass spectra of the singly, doubly, and triply charged tryptic peptides. The fragmentation statistics of the ion trap CID spectra were used to predict the theoretical tandem mass spectra of the peptide sequence. The study estimated the relative cleavage frequency for each pair of adjacent amino acids along the peptide length. The study showed that the cleavage frequency can be directly used to predict the tandem mass spectra. The predicted spectra show a high correlation with the experimental spectra used in this study; 99.73% of the high-quality reference spectra have correlation scores greater than 0.8. The new method predicts the theoretical spectrum and correlates significantly better with the experimental spectrum as compared to the existing spectrum prediction tools OpenMS_Simulator, MS2PIP, and MS2PBPI, where only 80, 85.76, and 85.80% of the spectral count, respectively, has a correlation score greater than 0.8.
Collapse
|
11
|
Lombard-Banek C, Schiel JE. Mass Spectrometry Advances and Perspectives for the Characterization of Emerging Adoptive Cell Therapies. Molecules 2020; 25:E1396. [PMID: 32204371 PMCID: PMC7144572 DOI: 10.3390/molecules25061396] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 03/06/2020] [Accepted: 03/11/2020] [Indexed: 12/12/2022] Open
Abstract
Adoptive cell therapy is an emerging anti-cancer modality, whereby the patient's own immune cells are engineered to express T-cell receptor (TCR) or chimeric antigen receptor (CAR). CAR-T cell therapies have advanced the furthest, with recent approvals of two treatments by the Food and Drug Administration of Kymriah (trisagenlecleucel) and Yescarta (axicabtagene ciloleucel). Recent developments in proteomic analysis by mass spectrometry (MS) make this technology uniquely suited to enable the comprehensive identification and quantification of the relevant biochemical architecture of CAR-T cell therapies and fulfill current unmet needs for CAR-T product knowledge. These advances include improved sample preparation methods, enhanced separation technologies, and extension of MS-based proteomic to single cells. Innovative technologies such as proteomic analysis of raw material quality attributes (MQA) and final product quality attributes (PQA) may provide insights that could ultimately fuel development strategies and lead to broad implementation.
Collapse
Affiliation(s)
- Camille Lombard-Banek
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA;
- Institute for Bioscience and Biotechnology Research, Rockville, MD 20850, USA
| | - John E. Schiel
- National Institute of Standards and Technology, Gaithersburg, MD 20899, USA;
- Institute for Bioscience and Biotechnology Research, Rockville, MD 20850, USA
| |
Collapse
|
12
|
Response to "Mass spectrometrists should search for all peptides, but assess only the ones they care about". Nat Methods 2019; 14:644. [PMID: 28661496 DOI: 10.1038/nmeth.4339] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
13
|
Bittremieux W, Tabb DL, Impens F, Staes A, Timmerman E, Martens L, Laukens K. Quality control in mass spectrometry-based proteomics. MASS SPECTROMETRY REVIEWS 2018; 37:697-711. [PMID: 28802010 DOI: 10.1002/mas.21544] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2017] [Revised: 07/24/2017] [Accepted: 07/24/2017] [Indexed: 05/21/2023]
Abstract
Mass spectrometry is a highly complex analytical technique and mass spectrometry-based proteomics experiments can be subject to a large variability, which forms an obstacle to obtaining accurate and reproducible results. Therefore, a comprehensive and systematic approach to quality control is an essential requirement to inspire confidence in the generated results. A typical mass spectrometry experiment consists of multiple different phases including the sample preparation, liquid chromatography, mass spectrometry, and bioinformatics stages. We review potential sources of variability that can impact the results of a mass spectrometry experiment occurring in all of these steps, and we discuss how to monitor and remedy the negative influences on the experimental results. Furthermore, we describe how specialized quality control samples of varying sample complexity can be incorporated into the experimental workflow and how they can be used to rigorously assess detailed aspects of the instrument performance.
Collapse
Affiliation(s)
- Wout Bittremieux
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (Biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - David L Tabb
- Division of Molecular Biology and Human Genetics, Stellenbosch University Faculty of Medicine and Health Sciences, Tygerberg Hospital, Cape Town, South Africa
| | - Francis Impens
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - An Staes
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Evy Timmerman
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Zwijnaarde, Belgium
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (Biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| |
Collapse
|
14
|
Abstract
Background In mass spectrometry-based proteomics, protein identification is an essential task. Evaluating the statistical significance of the protein identification result is critical to the success of proteomics studies. Controlling the false discovery rate (FDR) is the most common method for assuring the overall quality of the set of identifications. Existing FDR estimation methods either rely on specific assumptions or rely on the two-stage calculation process of first estimating the error rates at the peptide-level, and then combining them somehow at the protein-level. We propose to estimate the FDR in a non-parametric way with less assumptions and to avoid the two-stage calculation process. Results We propose a new protein-level FDR estimation framework. The framework contains two major components: the Permutation+BH (Benjamini–Hochberg) FDR estimation method and the logistic regression-based null inference method. In Permutation+BH, the null distribution of a sample is generated by searching data against a large number of permuted random protein database and therefore does not rely on specific assumptions. Then, p-values of proteins are calculated from the null distribution and the BH procedure is applied to the p-values to achieve the relationship of the FDR and the number of protein identifications. The Permutation+BH method generates the null distribution by the permutation method, which is inefficient for online identification. The logistic regression model is proposed to infer the null distribution of a new sample based on existing null distributions obtained from the Permutation+BH method. Conclusions In our experiment based on three public available datasets, our Permutation+BH method achieves consistently better performance than MAYU, which is chosen as the benchmark FDR calculation method for this study. The null distribution inference result shows that the logistic regression model achieves a reasonable result both in the shape of the null distribution and the corresponding FDR estimation result.
Collapse
Affiliation(s)
- Guanying Wu
- The Dental Center of China-Japan Friendship Hospital, Beijing, China
| | - Xiang Wan
- ShenZhen Research Institute of Big Data, ShenZhen, China
| | - Baohua Xu
- The Dental Center of China-Japan Friendship Hospital, Beijing, China.
| |
Collapse
|
15
|
Clark DJ, Hu Y, Bocik W, Chen L, Schnaubelt M, Roberts R, Shah P, Whiteley G, Zhang H. Evaluation of NCI-7 Cell Line Panel as a Reference Material for Clinical Proteomics. J Proteome Res 2018; 17:2205-2215. [PMID: 29718670 PMCID: PMC6670293 DOI: 10.1021/acs.jproteome.8b00165] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Reference materials are vital to benchmarking the reproducibility of clinical tests and essential for monitoring laboratory performance for clinical proteomics. The reference material utilized for mass spectrometric analysis of the human proteome would ideally contain enough proteins to be suitably representative of the human proteome, as well as exhibit a stable protein composition in different batches of sample regeneration. Previously, The Clinical Proteomic Tumor Analysis Consortium (CPTAC) utilized a PDX-derived comparative reference (CompRef) materials for the longitudinal assessment of proteomic performance; however, inherent drawbacks of PDX-derived material, including extended time needed to grow tumors and high level of expertise needed, have resulted in efforts to identify a new source of CompRef material. In this study, we examined the utility of using a panel of seven cancer cell lines, NCI-7 Cell Line Panel, as a reference material for mass spectrometric analysis of human proteome. Our results showed that not only is the NCI-7 material suitable for benchmarking laboratory sample preparation methods, but also NCI-7 sample generation is highly reproducible at both the global and phosphoprotein levels. In addition, the predicted genomic and experimental coverage of the NCI-7 proteome suggests the NCI-7 material may also have applications as a universal standard proteomic reference.
Collapse
Affiliation(s)
- David J. Clark
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| | - Yingwei Hu
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| | - William Bocik
- Antibody Characterization Laboratory, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, Maryland 21701, United States
| | - Lijun Chen
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| | - Michael Schnaubelt
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| | - Rhonda Roberts
- Antibody Characterization Laboratory, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, Maryland 21701, United States
| | - Punit Shah
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| | - Gordon Whiteley
- Antibody Characterization Laboratory, Cancer Research Technology Program, Frederick National Laboratory for Cancer Research, Frederick, Maryland 21701, United States
| | - Hui Zhang
- Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, Maryland 21205, United States
| |
Collapse
|
16
|
The M, Edfors F, Perez-Riverol Y, Payne SH, Hoopmann MR, Palmblad M, Forsström B, Käll L. A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms. J Proteome Res 2018; 17:1879-1886. [PMID: 29631402 DOI: 10.1021/acs.jproteome.7b00899] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Fredrik Edfors
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Samuel H Payne
- Biological Sciences Division , Pacific Northwest National Laboratory , Richland , Washington 99352 , United States
| | - Michael R Hoopmann
- Institute for Systems Biology , Seattle , Washington 98109 , United States
| | - Magnus Palmblad
- Center for Proteomics and Metabolomics , Leiden University Medical Center , 2300 RC Leiden , The Netherlands
| | - Björn Forsström
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| |
Collapse
|
17
|
Kim M, Eetemadi A, Tagkopoulos I. DeepPep: Deep proteome inference from peptide profiles. PLoS Comput Biol 2017; 13:e1005661. [PMID: 28873403 PMCID: PMC5600403 DOI: 10.1371/journal.pcbi.1005661] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Revised: 09/15/2017] [Accepted: 06/27/2017] [Indexed: 11/24/2022] Open
Abstract
Protein inference, the identification of the protein set that is the origin of a given peptide profile, is a fundamental challenge in proteomics. We present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a proteomics mixture, given the sequence universe of possible proteins and a target peptide profile. In its core, DeepPep quantifies the change in probabilistic score of peptide-spectrum matches in the presence or absence of a specific protein, hence selecting as candidate proteins with the largest impact to the peptide profile. Application of the method across datasets argues for its competitive predictive ability (AUC of 0.80±0.18, AUPR of 0.84±0.28) in inferring proteins without need of peptide detectability on which the most competitive methods rely. We find that the convolutional neural network architecture outperforms the traditional artificial neural network architectures without convolution layers in protein inference. We expect that similar deep learning architectures that allow learning nonlinear patterns can be further extended to problems in metagenome profiling and cell type inference. The source code of DeepPep and the benchmark datasets used in this study are available at https://deeppep.github.io/DeepPep/. The accurate identification of proteins in a proteomics sample, called the protein inference problem, is a fundamental challenge in biomedical sciences. Current approaches are based on applications of traditional neural networks, linear optimization and Bayesian techniques. We here present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a standard proteomics mixture, given all protein sequences and a peptide profile. Comparison to leading methods shows that DeepPep has most robust performance with various instruments and datasets. Our results provide evidence that using sequence-level location information of a peptide in the context of proteome sequence can result in more accurate and robust protein inference. We conclude that Deep Learning on protein sequence leads to superior platforms for protein inference that can be further refined with additional features and extended for far reaching applications.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ameen Eetemadi
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
18
|
Abstract
In computational proteomics, the identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost associated with database search increases exponentially with respect to the number of modified amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible PTM patterns. To address this issue, one group of methods named restricted tools (including Mascot, Comet, and MS-GF+) only allow a small number of PTM types in database search process. Alternatively, the other group of methods named unrestricted tools (including MS-Alignment, ProteinProspector, and MODa) avoids enumerating PTM patterns with an alignment-based approach to localizing and characterizing modified amino acids. However, because of the large search space and PTM localization issue, the sensitivity of these unrestricted tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI belongs to the category of unrestricted tools. It first codes peptide sequences into Boolean vectors and codes experimental spectra into real-valued vectors. For each coded spectrum, it then searches the coded sequence database to find the top scored peptide sequences as candidates. After that, PIPI uses dynamic programming to localize and characterize modified amino acids in each candidate. We used simulation experiments and real data experiments to evaluate the performance in comparison with restricted tools (i.e., Mascot, Comet, and MS-GF+) and unrestricted tools (i.e., Mascot with error tolerant search, MS-Alignment, ProteinProspector, and MODa). Comparison with restricted tools shows that PIPI has a close sensitivity and running speed. Comparison with unrestricted tools shows that PIPI has the highest sensitivity except for Mascot with error tolerant search and ProteinProspector. These two tools simplify the task by only considering up to one modified amino acid in each peptide, which results in a higher sensitivity but has difficulty in dealing with multiple modified amino acids. The simulation experiments also show that PIPI has the lowest false discovery proportion, the highest PTM characterization accuracy, and the shortest running time among the unrestricted tools.
Collapse
Affiliation(s)
- Fengchao Yu
- Division of Biomedical Engineering, The Hong Kong University of Science and Technology , Hong Kong, China
| | - Ning Li
- Division of Biomedical Engineering, The Hong Kong University of Science and Technology , Hong Kong, China.,Division of Life Science, The Hong Kong University of Science and Technology , Hong Kong, China
| | - Weichuan Yu
- Division of Biomedical Engineering, The Hong Kong University of Science and Technology , Hong Kong, China.,Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology , Hong Kong, China
| |
Collapse
|
19
|
The M, Tasnim A, Käll L. How to talk about protein-level false discovery rates in shotgun proteomics. Proteomics 2016; 16:2461-9. [PMID: 27503675 PMCID: PMC5096025 DOI: 10.1002/pmic.201500431] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Revised: 05/12/2016] [Accepted: 07/20/2016] [Indexed: 12/04/2022]
Abstract
A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Ayesha Tasnim
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden.
| |
Collapse
|
20
|
|
21
|
Protein inference: A protein quantification perspective. Comput Biol Chem 2016; 63:21-29. [DOI: 10.1016/j.compbiolchem.2016.02.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 02/01/2016] [Indexed: 01/04/2023]
|
22
|
Bennett KL, Wang X, Bystrom CE, Chambers MC, Andacht TM, Dangott LJ, Elortza F, Leszyk J, Molina H, Moritz RL, Phinney BS, Thompson JW, Bunger MK, Tabb DL. The 2012/2013 ABRF Proteomic Research Group Study: Assessing Longitudinal Intralaboratory Variability in Routine Peptide Liquid Chromatography Tandem Mass Spectrometry Analyses. Mol Cell Proteomics 2015; 14:3299-309. [PMID: 26435129 DOI: 10.1074/mcp.o115.051888] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Indexed: 11/06/2022] Open
Abstract
Questions concerning longitudinal data quality and reproducibility of proteomic laboratories spurred the Protein Research Group of the Association of Biomolecular Resource Facilities (ABRF-PRG) to design a study to systematically assess the reproducibility of proteomic laboratories over an extended period of time. Developed as an open study, initially 64 participants were recruited from the broader mass spectrometry community to analyze provided aliquots of a six bovine protein tryptic digest mixture every month for a period of nine months. Data were uploaded to a central repository, and the operators answered an accompanying survey. Ultimately, 45 laboratories submitted a minimum of eight LC-MSMS raw data files collected in data-dependent acquisition (DDA) mode. No standard operating procedures were enforced; rather the participants were encouraged to analyze the samples according to usual practices in the laboratory. Unlike previous studies, this investigation was not designed to compare laboratories or instrument configuration, but rather to assess the temporal intralaboratory reproducibility. The outcome of the study was reassuring with 80% of the participating laboratories performing analyses at a medium to high level of reproducibility and quality over the 9-month period. For the groups that had one or more outlying experiments, the major contributing factor that correlated to the survey data was the performance of preventative maintenance prior to the LC-MSMS analyses. Thus, the Protein Research Group of the Association of Biomolecular Resource Facilities recommends that laboratories closely scrutinize the quality control data following such events. Additionally, improved quality control recording is imperative. This longitudinal study provides evidence that mass spectrometry-based proteomics is reproducible. When quality control measures are strictly adhered to, such reproducibility is comparable among many disparate groups. Data from the study are available via ProteomeXchange under the accession code PXD002114.
Collapse
Affiliation(s)
- Keiryn L Bennett
- From the ‡CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria;
| | - Xia Wang
- §University of Cincinnati, Department of Mathematical Sciences, University of Cincinnati, Cincinnati, Ohio, 45221-0025
| | - Cory E Bystrom
- ¶Cleveland HeartLab, Inc., Research and Development, Cleveland HeartLab, Inc., Cleveland, Ohio, 44103
| | - Matthew C Chambers
- ‖Vanderbilt University, Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, 37232
| | - Tracy M Andacht
- **Centers for Disease Control and Prevention, Emergency Response Branch, Division of Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention, Atlanta, Georgia, 30341
| | - Larry J Dangott
- ‡‡Texas A&M University, Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, 77843
| | - Félix Elortza
- §§CIC bioGUNE, Centro de Investigacion Cooperativa en Biociencias, ProteoRed-ISCIII, Bilbao, Spain
| | - John Leszyk
- ¶¶University of Massachusetts, Department of Biochemistry and Molecular Pharmacology Proteomics and Mass Spectrometry Facility, University of Massachusetts Medical School, Shrewsbury, Massachusetts, 01545
| | - Henrik Molina
- ‖‖The Rockefeller University, Proteomics Resource Center, The Rockefeller University, New York, New York, 10065
| | | | - Brett S Phinney
- University of California, Davis, Proteomics Core, University of California-Davis Genome Center, Davis, California, 95616
| | - J Will Thompson
- Duke University, Proteomics and Metabolomics Core Facility, Duke University Medical Center, Durham, North Carolina, 27708
| | | | - David L Tabb
- ‖Vanderbilt University, Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, 37232;
| |
Collapse
|
23
|
Arul AB, Byambadorj M, Han NY, Park JM, Lee H. Development of an Automated, High-throughput Sample Preparation Protocol for Proteomics Analysis. B KOREAN CHEM SOC 2015. [DOI: 10.1002/bkcs.10338] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Albert-Baskar Arul
- Lee Gil Ya Cancer and Diabetes Institute; Gachon University; Incheon Republic of Korea
| | | | - Na-Young Han
- Lee Gil Ya Cancer and Diabetes Institute; Gachon University; Incheon Republic of Korea
| | - Jong Moon Park
- Lee Gil Ya Cancer and Diabetes Institute; Gachon University; Incheon Republic of Korea
| | - Hookeun Lee
- Lee Gil Ya Cancer and Diabetes Institute; Gachon University; Incheon Republic of Korea
- Gachon Institute of Pharmaceutical Sciences, Gachon College of Pharmacy; Gachon University; Incheon 406-799 Republic of Korea
- Gachon Medical Research Institute; Gil Medical Center; Incheon 405-760 Republic of Korea
| |
Collapse
|
24
|
Szabo Z, Janaky T. Challenges and developments in protein identification using mass spectrometry. Trends Analyt Chem 2015. [DOI: 10.1016/j.trac.2015.03.007] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
25
|
Quandt A, Espona L, Balasko A, Weisser H, Brusniak MY, Kunszt P, Aebersold R, Malmström L. Using synthetic peptides to benchmark peptide identification software and search parameters for MS/MS data analysis. EUPA OPEN PROTEOMICS 2014. [DOI: 10.1016/j.euprot.2014.10.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
26
|
Wilhelm T, Jones AME. Identification of related peptides through the analysis of fragment ion mass shifts. J Proteome Res 2014; 13:4002-11. [PMID: 25058668 DOI: 10.1021/pr500347e] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Mass spectrometry (MS) has become the method of choice to identify and quantify proteins, typically by fragmenting peptides and inferring protein identification by reference to sequence databases. Well-established programs have largely solved the problem of identifying peptides in complex mixtures. However, to prevent the search space from becoming prohibitively large, most search engines need a list of expected modifications. Therefore, unexpected modifications limit both the identification of proteins and peptide-based quantification. We developed mass spectrometry-peak shift analysis (MS-PSA) to rapidly identify related spectra in large data sets without reference to databases or specified modifications. Peptide identifications from established tools, such as MASCOT or SEQUEST, may be propagated onto MS-PSA results. Modification of a peptide alters the mass of the precursor ion and some of the fragmentation ions. MS-PSA identifies characteristic fragmentation masses from MS/MS spectra. Related spectra are identified by pattern matching of unchanged and mass-shifted fragment ions. We illustrate the use of MS-PSA with simple and complex mixtures with both high and low mass accuracy data sets. MS-PSA is not limited to the analysis of peptides but can be used for the identification of related groups of spectra in any set of fragmentation patterns.
Collapse
Affiliation(s)
- Thomas Wilhelm
- Institute of Food Research , Norwich Research Park, Norwich NR4 7UA, United Kingdom
| | | |
Collapse
|
27
|
Dong NP, Liang YZ, Xu QS, Mok DKW, Yi LZ, Lu HM, He M, Fan W. Prediction of Peptide Fragment Ion Mass Spectra by Data Mining Techniques. Anal Chem 2014; 86:7446-54. [DOI: 10.1021/ac501094m] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
| | | | | | - Daniel K. W. Mok
- Department
of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Hong Kong
- State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), Shenzhen, 518000, P. R. China
| | - Lun-zhao Yi
- Yunnan
Food Safety Research Institute, Kunming University of Science and Technology, Kunming, 650500, P. R. China
| | | | - Min He
- Department of
Pharmaceutical Engineering,
School of Chemical Engineering, Xiangtan University, Xiangtan, 411105, P.R. China
| | - Wei Fan
- College of
Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410083, P. R. China
| |
Collapse
|
28
|
Ma CWM, Lam H. Hunting for unexpected post-translational modifications by spectral library searching with tier-wise scoring. J Proteome Res 2014; 13:2262-71. [PMID: 24661115 DOI: 10.1021/pr401006g] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Discovering novel post-translational modifications (PTMs) to proteins and detecting specific modification sites on proteins is one of the last frontiers of proteomics. At present, hunting for post-translational modifications remains challenging in widely practiced shotgun proteomics workflows due to the typically low abundance of modified peptides and the greatly inflated search space as more potential mass shifts are considered by the search engines. Moreover, most popular search methods require that the user specifies the modification(s) for which to search; therefore, unexpected and novel PTMs will not be detected. Here a new algorithm is proposed to apply spectral library searching to the problem of open modification searches, namely, hunting for PTMs without prior knowledge of what PTMs are in the sample. The proposed tier-wise scoring method intelligently looks for unexpected PTMs by allowing mass-shifted peak matches but only when the number of matches found is deemed statistically significant. This allows the search engine to search for unexpected modifications while maintaining its ability to identify unmodified peptides effectively at the same time. The utility of the method is demonstrated using three different data sets, in which the numbers of spectrum identifications to both unmodified and modified peptides were substantially increased relative to a regular spectral library search as well as to another open modification spectral search method, pMatch.
Collapse
Affiliation(s)
- Chun Wai Manson Ma
- Division of Biomedical Engineering and ‡Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology , Clear Water Bay, Hong Kong, China
| | | |
Collapse
|
29
|
Nunes-Miranda JD, Núñez C, Santos HM, Vale G, Reboiro-Jato M, Fdez-Riverola F, Lodeiro C, Miró M, Capelo JL. A mesofluidic platform integrating on-chip probe ultrasonication for multiple sample pretreatment involving denaturation, reduction, and digestion in protein identification assays by mass spectrometry. Analyst 2014; 139:992-5. [DOI: 10.1039/c3an02178e] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A novel mesofluidic platform integrating on-chip probe ultrasonication for automated high-throughput shotgun proteomic assays.
Collapse
Affiliation(s)
- J. D. Nunes-Miranda
- Department of Genetics and Biotechnology
- University of Trás-os-Montes and Alto Douro
- Vila Real, Portugal
- Institute for Biotechnology and Bioengineering
- Centre of Genomics and Biotechnology
| | - Cristina Núñez
- REQUIMTE
- Departamento de Química
- Faculdade de Ciencias e Tecnologia
- FCT
- Universidade Nova de Lisboa
| | - Hugo M. Santos
- Institute for Biotechnology and Bioengineering
- Centre of Genomics and Biotechnology
- University of Trás-os-Montes and Alto Douro
- Vila Real, Portugal
- REQUIMTE
| | - G. Vale
- REQUIMTE
- Departamento de Química
- Faculdade de Ciencias e Tecnologia
- FCT
- Universidade Nova de Lisboa
| | - Miguel Reboiro-Jato
- SING Group
- Informatics Department
- Higher Technical School of Computer Engineering
- University of Vigo
- Ourense, Spain
| | - Florentino Fdez-Riverola
- SING Group
- Informatics Department
- Higher Technical School of Computer Engineering
- University of Vigo
- Ourense, Spain
| | - Carlos Lodeiro
- REQUIMTE
- Departamento de Química
- Faculdade de Ciencias e Tecnologia
- FCT
- Universidade Nova de Lisboa
| | - Manuel Miró
- FI-TRACE Group
- Department of Chemistry
- University of the Balearic Islands
- Palma de Mallorca, Spain
| | - J. L. Capelo
- REQUIMTE
- Departamento de Química
- Faculdade de Ciencias e Tecnologia
- FCT
- Universidade Nova de Lisboa
| |
Collapse
|
30
|
Granholm V, Kim S, Navarro JCF, Sjölund E, Smith RD, Käll L. Fast and accurate database searches with MS-GF+Percolator. J Proteome Res 2013; 13:890-7. [PMID: 24344789 DOI: 10.1021/pr400937n] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
One can interpret fragmentation spectra stemming from peptides in mass-spectrometry-based proteomics experiments using so-called database search engines. Frequently, one also runs post-processors such as Percolator to assess the confidence, infer unique peptides, and increase the number of identifications. A recent search engine, MS-GF+, has shown promising results, due to a new and efficient scoring algorithm. However, MS-GF+ provides few statistical estimates about the peptide-spectrum matches, hence limiting the biological interpretation. Here, we enabled Percolator processing for MS-GF+ output and observed an increased number of identified peptides for a wide variety of data sets. In addition, Percolator directly reports p values and false discovery rate estimates, such as q values and posterior error probabilities, for peptide-spectrum matches, peptides, and proteins, functions that are useful for the whole proteomics community.
Collapse
Affiliation(s)
- Viktor Granholm
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University , Solna, Sweden
| | | | | | | | | | | |
Collapse
|
31
|
Fu Y, Qian X. Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. Mol Cell Proteomics 2013; 13:1359-68. [PMID: 24200586 DOI: 10.1074/mcp.o113.030189] [Citation(s) in RCA: 60] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
In shotgun proteomics, high-throughput mass spectrometry experiments and the subsequent data analysis produce thousands to millions of hypothetical peptide identifications. The common way to estimate the false discovery rate (FDR) of peptide identifications is the target-decoy database search strategy, which is efficient and accurate for large datasets. However, the legitimacy of the target-decoy strategy for protein-modification-centric studies has rarely been rigorously validated. It is often the case that a global FDR is estimated for all peptide identifications including both modified and unmodified peptides, but that only a subgroup of identifications with a certain type of modification is focused on. As revealed recently, the subgroup FDR of modified peptide identifications can differ dramatically from the global FDR at the same score threshold, and thus the former, when it is of interest, should be separately estimated. However, rare modifications often result in a very small number of modified peptide identifications, which makes the direct separate FDR estimation inaccurate because of the inadequate sample size. This paper presents a method called the transferred FDR for accurately estimating the FDR of an arbitrary number of modified peptide identifications. Through flexible use of the empirical data from a target-decoy database search, a theoretical relationship between the subgroup FDR and the global FDR is made computable. Through this relationship, the subgroup FDR can be predicted from the global FDR, allowing one to avoid an inaccurate direct estimation from a limited amount of data. The effectiveness of the method is demonstrated with both simulated and real mass spectra.
Collapse
Affiliation(s)
- Yan Fu
- National Center for Mathematics and Interdisciplinary Sciences, Key Laboratory of Random Complex Structures and Data Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | | |
Collapse
|
32
|
Yang C, He Z, Yu W. A combinatorial perspective of the protein inference problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1542-1547. [PMID: 24407311 DOI: 10.1109/tcbb.2013.110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In a shotgun proteomics experiment, proteins are the most biologically meaningful output. The success of proteomics studies depends on the ability to accurately and efficiently identify proteins. Many methods have been proposed to facilitate the identification of proteins from peptide identification results. However, the relationship between protein identification and peptide identification has not been thoroughly explained before. In this paper, we devote ourselves to a combinatorial perspective of the protein inference problem. We employ combinatorial mathematics to calculate the conditional protein probabilities (protein probability means the probability that a protein is correctly identified) under three assumptions, which lead to a lower bound, an upper bound, and an empirical estimation of protein probabilities, respectively. The combinatorial perspective enables us to obtain an analytical expression for protein inference. Our method achieves comparable results with ProteinProphet in a more efficient manner in experiments on two data sets of standard protein mixtures and two data sets of real samples. Based on our model, we study the impact of unique peptides and degenerate peptides (degenerate peptides are peptides shared by at least two proteins) on protein probabilities. Meanwhile, we also study the relationship between our model and ProteinProphet. We name our program ProteinInfer. Its Java source code, our supplementary document and experimental results are available at: >http://bioinformatics.ust.hk/proteininfer.
Collapse
Affiliation(s)
- Chao Yang
- The Hong Kong University of Science and Technology, Hong Kong
| | | | - Weichuan Yu
- The Hong Kong University of Science and Technology, Hong Kong
| |
Collapse
|
33
|
Zhang J, Ma J, Zhang W, Xu C, Zhu Y, Xie H. FTDR 2.0: A Tool To Achieve Sub-ppm Level Recalibrated Accuracy in Routine LC–MS Analysis. J Proteome Res 2013; 12:3857-64. [DOI: 10.1021/pr400003a] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Affiliation(s)
- Jiyang Zhang
- College of Mechatronic Engineering
and Automatic Control, National University of Defense Technology, Changsha, 410073, China
| | - Jie Ma
- State Key Laboratory of Proteomics,
Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, 102206, China
- National Engineering Research Center for Protein Drugs, Beijing 102206, China
| | - Wei Zhang
- College of Mechatronic Engineering
and Automatic Control, National University of Defense Technology, Changsha, 410073, China
| | - Changming Xu
- College of Mechatronic Engineering
and Automatic Control, National University of Defense Technology, Changsha, 410073, China
| | - Yunping Zhu
- State Key Laboratory of Proteomics,
Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing, 102206, China
- National Engineering Research Center for Protein Drugs, Beijing 102206, China
| | - Hongwei Xie
- College of Mechatronic Engineering
and Automatic Control, National University of Defense Technology, Changsha, 410073, China
| |
Collapse
|
34
|
Huang X, Huang L, Peng H, Guru A, Xue W, Hong SY, Liu M, Sharma S, Fu K, Caprez AP, Swanson DR, Zhang Z, Ding SJ. ISPTM: an iterative search algorithm for systematic identification of post-translational modifications from complex proteome mixtures. J Proteome Res 2013; 12:3831-42. [PMID: 23919725 DOI: 10.1021/pr4003883] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Identifying protein post-translational modifications (PTMs) from tandem mass spectrometry data of complex proteome mixtures is a highly challenging task. Here we present a new strategy, named iterative search for identifying PTMs (ISPTM), for tackling this challenge. The ISPTM approach consists of a basic search with no variable modification, followed by iterative searches of many PTMs using a small number of them (usually two) in each search. The performance of the ISPTM approach was evaluated on mixtures of 70 synthetic peptides with known modifications, on an 18-protein standard mixture with unknown modifications and on real, complex biological samples of mouse nuclear matrix proteins with unknown modifications. ISPTM revealed that many chemical PTMs were introduced by urea and iodoacetamide during sample preparation and many biological PTMs, including dimethylation of arginine and lysine, were significantly activated by Adriamycin treatment in nuclear matrix associated proteins. ISPTM increased the MS/MS spectral identification rate substantially, displayed significantly better sensitivity for systematic PTM identification compared with that of the conventional all-in-one search approach, and offered PTM identification results that were complementary to InsPecT and MODa, both of which are established PTM identification algorithms. In summary, ISPTM is a new and powerful tool for unbiased identification of many different PTMs with high confidence from complex proteome mixtures.
Collapse
Affiliation(s)
- Xin Huang
- Department of Pathology and Microbiology, University of Nebraska Medical Center, Omaha, Nebraska 68198, United States
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Abstract
MOTIVATION Statistical validation of protein identifications is an important issue in shotgun proteomics. The false discovery rate (FDR) is a powerful statistical tool for evaluating the protein identification result. Several research efforts have been made for FDR estimation at the protein level. However, there are still certain drawbacks in the existing FDR estimation methods based on the target-decoy strategy. RESULTS In this article, we propose a decoy-free protein-level FDR estimation method. Under the null hypothesis that each candidate protein matches an identified peptide totally at random, we assign statistical significance to protein identifications in terms of the permutation P-value and use these P-values to calculate the FDR. Our method consists of three key steps: (i) generating random bipartite graphs with the same structure; (ii) calculating the protein scores on these random graphs; and (iii) calculating the permutation P value and final FDR. As it is time-consuming or prohibitive to execute the protein inference algorithms for thousands of times in step ii, we first train a linear regression model using the original bipartite graph and identification scores provided by the target inference algorithm. Then we use the learned regression model as a substitute of original protein inference method to predict protein scores on shuffled graphs. We test our method on six public available datasets. The results show that our method is comparable with those state-of-the-art algorithms in terms of estimation accuracy. AVAILABILITY The source code of our algorithm is available at: https://sourceforge.net/projects/plfdr/
Collapse
Affiliation(s)
- Ben Teng
- School of Software, Dalian University of Technology, Dalian 116621, China
| | | | | |
Collapse
|
36
|
Wang J, Lam H. Graph-based peak alignment algorithms for multiple liquid chromatography-mass spectrometry datasets. ACTA ACUST UNITED AC 2013; 29:2469-76. [PMID: 23904508 DOI: 10.1093/bioinformatics/btt435] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
UNLABELLED Liquid chromatography coupled to mass spectrometry (LC-MS) is the dominant technological platform for proteomics. An LC-MS analysis of a complex biological sample can be visualized as a 'map' of which the positional coordinates are the mass-to-charge ratio (m/z) and chromatographic retention time (RT) of the chemical species profiled. Label-free quantitative proteomics requires the alignment and comparison of multiple LC-MS maps to ascertain the reproducibility of experiments or reveal proteome changes under different conditions. The main challenge in this task lies in correcting inevitable RT shifts. Similar, but not identical, LC instruments and settings can cause peptides to elute at very different times and sometimes in a different order, violating the assumptions of many state-of-the-art alignment tools. To meet this challenge, we developed LWBMatch, a new algorithm based on weighted bipartite matching. Unlike existing tools, which search for accurate warping functions to correct RT shifts, we directly seek a peak-to-peak mapping by maximizing a global similarity function between two LC-MS maps. For alignment tasks with large RT shifts (>500 s), an approximate warping function is determined by locally weighted scatterplot smoothing of potential matched features, detected using a novel voting scheme based on co-elution. For validation, we defined the ground truth for alignment success based on tandem mass spectrometry identifications from sequence searching. We showed that our method outperforms several existing tools in terms of precision and recall, and is capable of aligning maps from different instruments and settings. AVAILABILITY Available at https://sourceforge.net/projects/rt-alignment/.
Collapse
Affiliation(s)
- Jijie Wang
- Division of Biomedical Engineering and Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
| | | |
Collapse
|
37
|
HE LIN, HAN XI, MA BIN. DE NOVO SEQUENCING WITH LIMITED NUMBER OF POST-TRANSLATIONAL MODIFICATIONS PER PEPTIDE. J Bioinform Comput Biol 2013; 11:1350007. [DOI: 10.1142/s0219720013500078] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
De novo sequencing derives the peptide sequence from a tandem mass spectrum without the assistance of protein databases. This analysis has been indispensable for the identification of novel or modified peptides in a biological sample. Currently, the speed of de novo sequencing algorithms is not heavily affected by the number of post-translational modification (PTM) types in consideration. However, the accuracy of the algorithms can be degraded due to the increased search space. Most peptides in a proteomics research contain only a small number of PTMs per peptide, yet the types of PTMs can come from a large number of choices. Therefore, it is desirable to include a large number of PTM types in a de novo sequencing algorithm, yet to limit the number of PTM occurrences in each peptide to increase the accuracy. In this paper, we present an efficient de novo sequencing algorithm, DeNovoPTM, for such a purpose. The implemented software is downloadable from http://www.cs.uwaterloo.ca/~l22he/denovo_ptm .
Collapse
Affiliation(s)
- LIN HE
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| | - XI HAN
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| | - BIN MA
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1, Canada
| |
Collapse
|
38
|
Dong NP, Liang YZ, Yi LZ, Lu HM. Investigation of scrambled ions in tandem mass spectra, part 2. On the influence of the ions on peptide identification. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2013; 24:857-867. [PMID: 23504644 DOI: 10.1007/s13361-013-0591-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2012] [Revised: 01/19/2013] [Accepted: 01/20/2013] [Indexed: 06/01/2023]
Abstract
A comprehensive investigation was performed to understand the influence of sequence scrambling in peptide ions on peptide identification results. To achieve this, four tandem mass spectrometry datasets with scrambled ions included and with them excluded were analyzed by Crux, X!Tandem, SpectraST, Lutefisk, and PepNovo. While the different algorithms differed in their performance, an increase in the number of correctly identified peptides was generally observed when removing scrambled ions, with the exception of the SpectraST algorithm. However, the variation of the match scores upon removal was unpredictable. Following these investigations, an interpretation was given on how the scrambled ions affect peptide identification. Lastly, a simulated theoretical mass spectral library derived from the NIST peptide Libraries was constructed and searched by SpectraST to study whether scrambled ions in predicted mass spectra could affect peptide identification. Consistent with the peptide library search results, no significant variations for dot product scores as well as peptide identification results were observed when these ions were included in the theoretical MS/MS spectra. From the five adopted algorithms, the SpectraST and Crux provided the most robust results, whereas X!Tandem, PepNovo, and Lutefisk were sensitive to the existence of the scrambled ions, especially the latter two de novo sequencing algorithms.
Collapse
Affiliation(s)
- Nai-ping Dong
- College of Chemistry and Chemical Engineering, Central South University, Changsha, People's Republic of China
| | | | | | | |
Collapse
|
39
|
Vaudel M, Sickmann A, Martens L. Current methods for global proteome identification. Expert Rev Proteomics 2013. [PMID: 23194269 DOI: 10.1586/epr.12.51] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
In a time frame of a few decades, protein identification went from laborious single protein identification to automated identification of entire proteomes. This shift was enabled by the emergence of peptide-centric, gel-free analyses, in particular the so-called shotgun approaches, which not only rely on extensive experiments, but also on cutting-edge data processing methods. The present review therefore provides an overview of a shotgun proteomics identification workflow, listing the state-of-the-art methods involved and software that implement these. The authors focus on freely available tools where possible. Finally, data analysis in the context of emerging across-omics studies will also be discussed briefly, where proteomics goes beyond merely delivering a list of protein accession numbers.
Collapse
Affiliation(s)
- Marc Vaudel
- Leibniz-Institut für Analytische Wissenschaften-ISAS-e.V., Dortmund, Germany
| | | | | |
Collapse
|
40
|
A large synthetic peptide and phosphopeptide reference library for mass spectrometry–based proteomics. Nat Biotechnol 2013; 31:557-64. [DOI: 10.1038/nbt.2585] [Citation(s) in RCA: 146] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2012] [Accepted: 04/15/2013] [Indexed: 01/24/2023]
|
41
|
Xiao CL, Chen XZ, Du YL, Li ZF, Wei L, Zhang G, He QY. Dispec: a novel peptide scoring algorithm based on peptide matching discriminability. PLoS One 2013; 8:e62724. [PMID: 23675420 PMCID: PMC3652849 DOI: 10.1371/journal.pone.0062724] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 03/25/2013] [Indexed: 11/20/2022] Open
Abstract
Identifying peptides from the fragmentation spectra is a fundamental step in mass spectrometry (MS) data processing. The significance (discriminability) of every peak varies, providing additional information for potentially enhancing the identification sensitivity and the correct match rate. However this important information was not considered in previous algorithms. Here we presented a novel method based on Peptide Matching Discriminability (PMD), in which the PMD information of every peak reflects the discriminability of candidate peptides. In addition, we developed a novel peptide scoring algorithm Dispec based on PMD, by taking three aspects of discriminability into consideration: PMD, intensity discriminability and m/z error discriminability. Compared with Mascot and Sequest, Dispec identified remarkably more peptides from three experimental datasets with the same confidence at 1% PSM-level FDR. Dispec is also robust and versatile for various datasets obtained on different instruments. The concept of discriminability enhances the peptide identification and thus may contribute largely to the proteome studies. As an open-source program, Dispec is freely available at http://bioinformatics.jnu.edu.cn/software/dispec/.
Collapse
Affiliation(s)
- Chuan-Le Xiao
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou, China
| | - Xiao-Zhou Chen
- School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming, China
| | - Yang-Li Du
- School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming, China
| | - Zhe-Fu Li
- Jinan University Network and Educational Technology Center, Guangzhou, China
| | - Li Wei
- School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming, China
| | - Gong Zhang
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou, China
- * E-mail: (QYH); (GZ)
| | - Qing-Yu He
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou, China
- * E-mail: (QYH); (GZ)
| |
Collapse
|
42
|
Mohimani H, Kim S, Pevzner PA. A new approach to evaluating statistical significance of spectral identifications. J Proteome Res 2013; 12:1560-8. [PMID: 23343606 DOI: 10.1021/pr300453t] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
While nonlinear peptide natural products such as Vancomycin and Daptomycin are among the most effective antibiotics, the computational techniques for sequencing such peptides are still in their infancy. Previous methods for sequencing peptide natural products are based on Nuclear Magnetic Resonance spectroscopy and require large amounts (milligrams) of purified materials. Recently, development of mass spectrometry-based methods has enabled accurate sequencing of nonlinear peptide natural products using picograms of material, but the question of evaluating statistical significance of Peptide Spectrum Matches (PSM) for these peptides remains open. Moreover, it is unclear how to decide whether a given spectrum is produced by a linear, cyclic, or branch-cyclic peptide. Surprisingly, all previous mass spectrometry studies overlooked the fact that a very similar problem has been successfully addressed in particle physics in 1951. In this work, we develop a method for estimating statistical significance of PSMs defined by any peptide (including linear and nonlinear). This method enables us to identify whether a peptide is linear, cyclic, or branch-cyclic, an important step toward identification of peptide natural products.
Collapse
Affiliation(s)
- Hosein Mohimani
- Department of Electrical and Computer Engineering and ‡Department of Computer Science and Engineering, University of California-San Diego , San Diego, California 92093
| | | | | |
Collapse
|
43
|
Vaudel M, Breiter D, Beck F, Rahnenführer J, Martens L, Zahedi RP. D-score: a search engine independent MD-score. Proteomics 2013; 13:1036-41. [PMID: 23307401 DOI: 10.1002/pmic.201200408] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Revised: 11/11/2012] [Accepted: 12/04/2012] [Indexed: 01/29/2023]
Abstract
While peptides carrying PTMs are routinely identified in gel-free MS, the localization of the PTMs onto the peptide sequences remains challenging. Search engine scores of secondary peptide matches have been used in different approaches in order to infer the quality of site inference, by penalizing the localization whenever the search engine similarly scored two candidate peptides with different site assignments. In the present work, we show how the estimation of posterior error probabilities for peptide candidates allows the estimation of a PTM score called the D-score, for multiple search engine studies. We demonstrate the applicability of this score to three popular search engines: Mascot, OMSSA, and X!Tandem, and evaluate its performance using an already published high resolution data set of synthetic phosphopeptides. For those peptides with phosphorylation site inference uncertainty, the number of spectrum matches with correctly localized phosphorylation increased by up to 25.7% when compared to using Mascot alone, although the actual increase depended on the fragmentation method used. Since this method relies only on search engine scores, it can be readily applied to the scoring of the localization of virtually any modification at no additional experimental or in silico cost.
Collapse
Affiliation(s)
- Marc Vaudel
- Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany
| | | | | | | | | | | |
Collapse
|
44
|
Wang P, Wilson SR. Mass spectrometry-based protein identification by integrating de novo sequencing with database searching. BMC Bioinformatics 2013; 14 Suppl 2:S24. [PMID: 23369017 PMCID: PMC3549845 DOI: 10.1186/1471-2105-14-s2-s24] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. RESULTS We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
Collapse
Affiliation(s)
- Penghao Wang
- Prince of Wales Clinical School, University of New South Wales, Australia.
| | | |
Collapse
|
45
|
Huang T, Gong H, Yang C, He Z. ProteinLasso: A Lasso regression approach to protein inference problem in shotgun proteomics. Comput Biol Chem 2013; 43:46-54. [PMID: 23385215 DOI: 10.1016/j.compbiolchem.2012.12.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2012] [Revised: 12/30/2012] [Accepted: 12/30/2012] [Indexed: 11/28/2022]
Abstract
Protein inference is an important issue in proteomics research. Its main objective is to select a proper subset of candidate proteins that best explain the observed peptides. Although many methods have been proposed for solving this problem, several issues such as peptide degeneracy and one-hit wonders still remain unsolved. Therefore, the accurate identification of proteins that are truly present in the sample continues to be a challenging task. Based on the concept of peptide detectability, we formulate the protein inference problem as a constrained Lasso regression problem, which can be solved very efficiently through a coordinate descent procedure. The new inference algorithm is named as ProteinLasso, which explores an ensemble learning strategy to address the sparsity parameter selection problem in Lasso model. We test the performance of ProteinLasso on three datasets. As shown in the experimental results, ProteinLasso outperforms those state-of-the-art protein inference algorithms in terms of both identification accuracy and running efficiency. In addition, we show that ProteinLasso is stable under different parameter specifications. The source code of our algorithm is available at: http://sourceforge.net/projects/proteinlasso.
Collapse
Affiliation(s)
- Ting Huang
- School of Software, Dalian University of Technology, China
| | | | | | | |
Collapse
|
46
|
Granholm V, Navarro JF, Noble WS, Käll L. Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. J Proteomics 2012; 80:123-31. [PMID: 23268117 DOI: 10.1016/j.jprot.2012.12.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2012] [Revised: 11/30/2012] [Accepted: 12/11/2012] [Indexed: 01/10/2023]
Abstract
The analysis of a shotgun proteomics experiment results in a list of peptide-spectrum matches (PSMs) in which each fragmentation spectrum has been matched to a peptide in a database. Subsequently, most protein inference algorithms rank peptides according to the best-scoring PSM for each peptide. However, there is disagreement in the scientific literature on the best method to assess the statistical significance of the resulting peptide identifications. Here, we use a previously described calibration protocol to evaluate the accuracy of three different peptide-level statistical confidence estimation procedures: the classical Fisher's method, and two complementary procedures that estimate significance, respectively, before and after selecting the top-scoring PSM for each spectrum. Our experiments show that the latter method, which is employed by MaxQuant and Percolator, produces the most accurate, well-calibrated results.
Collapse
Affiliation(s)
- Viktor Granholm
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - José Fernández Navarro
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Solna, Sweden
| | - William Stafford Noble
- Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, USA
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Solna, Sweden.
| |
Collapse
|
47
|
Shi J, Chen B, Wu FX. Unifying protein inference and peptide identification with feedback to update consistency between peptides. Proteomics 2012; 13:239-47. [PMID: 23111981 DOI: 10.1002/pmic.201200338] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2012] [Revised: 10/07/2012] [Accepted: 10/11/2012] [Indexed: 11/11/2022]
Abstract
We first propose a new method to process peptide identification reports from databases search engines. Then via it we develop a method for unifying protein inference and peptide identification by adding a feedback from protein inference to peptide identification. The feedback information is a list of high-confidence proteins, which is used to update an adjacency matrix between peptides. The adjacency matrix is used in the regularization of peptide scores. Logistic regression (LR) is used to compute the probability of peptide identification with the regularized scores. Protein scores are then calculated with the LR probability of peptides. Instead of selecting the best peptide match for each MS/MS, we select multiple peptides. By testing on two datasets, the results have shown that the proposed method can robustly assign accurate probabilities to peptides, and have a higher discrimination power than PeptideProphet to distinguish correct and incorrect identified peptides. Additionally, not only can our method infer more true positive proteins but also infer less false positive proteins than ProteinProphet at the same false positive rate. The coverage of inferred proteins is also significantly increased due to the selection of multiple peptides for each MS/MS and the improvement of their scores by the feedback from the inferred proteins.
Collapse
Affiliation(s)
- Jinhong Shi
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | | | | |
Collapse
|
48
|
Xiao CL, Chen XZ, Du YL, Sun X, Zhang G, He QY. Binomial Probability Distribution Model-Based Protein Identification Algorithm for Tandem Mass Spectrometry Utilizing Peak Intensity Information. J Proteome Res 2012; 12:328-35. [DOI: 10.1021/pr300781t] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Chuan-Le Xiao
- Institute of Life and Health
Engineering, Key Laboratory of Functional Protein Research of Guangdong
Higher Education Institutes, Jinan University, Guangzhou 510632, China
| | - Xiao-Zhou Chen
- School of Mathematics and Computer
Science, Yunnan University of Nationalities, Kunming 650031, China
| | - Yang-Li Du
- School of Mathematics and Computer
Science, Yunnan University of Nationalities, Kunming 650031, China
| | - Xuesong Sun
- Institute of Life and Health
Engineering, Key Laboratory of Functional Protein Research of Guangdong
Higher Education Institutes, Jinan University, Guangzhou 510632, China
| | - Gong Zhang
- Institute of Life and Health
Engineering, Key Laboratory of Functional Protein Research of Guangdong
Higher Education Institutes, Jinan University, Guangzhou 510632, China
| | - Qing-Yu He
- Institute of Life and Health
Engineering, Key Laboratory of Functional Protein Research of Guangdong
Higher Education Institutes, Jinan University, Guangzhou 510632, China
| |
Collapse
|
49
|
Yadav AK, Kumar D, Dash D. Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 2012. [PMID: 23189209 PMCID: PMC3506577 DOI: 10.1371/journal.pone.0050651] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Collapse
Affiliation(s)
- Amit Kumar Yadav
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Dhirendra Kumar
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Debasis Dash
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
- * E-mail:
| |
Collapse
|
50
|
Shi J, Wu FX. A feedback framework for protein inference with peptides identified from tandem mass spectra. Proteome Sci 2012; 10:68. [PMID: 23164319 PMCID: PMC3776439 DOI: 10.1186/1477-5956-10-68] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 11/02/2012] [Indexed: 11/10/2022] Open
Abstract
UNLABELLED BACKGROUND Protein inference is an important computational step in proteomics. There exists a natural nest relationship between protein inference and peptide identification, but these two steps are usually performed separately in existing methods. We believe that both peptide identification and protein inference can be improved by exploring such nest relationship. RESULTS In this study, a feedback framework is proposed to process peptide identification reports from search engines, and an iterative method is implemented to exemplify the processing of Sequest peptide identification reports according to the framework. The iterative method is verified on two datasets with known validity of proteins and peptides, and compared with ProteinProphet and PeptideProphet. The results have shown that not only can the iterative method infer more true positive and less false positive proteins than ProteinProphet, but also identify more true positive and less false positive peptides than PeptideProphet. CONCLUSIONS The proposed iterative method implemented according to the feedback framework can unify and improve the results of peptide identification and protein inference.
Collapse
Affiliation(s)
- Jinhong Shi
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr, Saskatoon, Canada.
| | | |
Collapse
|