1
|
Henke AN, Chilukuri S, Langan LM, Brooks BW. Reporting and reproducibility: Proteomics of fish models in environmental toxicology and ecotoxicology. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 912:168455. [PMID: 37979845 DOI: 10.1016/j.scitotenv.2023.168455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 11/06/2023] [Accepted: 11/07/2023] [Indexed: 11/20/2023]
Abstract
Environmental toxicology and ecotoxicology research efforts are employing proteomics with fish models as New Approach Methodologies, along with in silico, in vitro and other omics techniques to elucidate hazards of toxicants and toxins. We performed a critical review of toxicology studies with fish models using proteomics and reported fundamental parameters across experimental design, sample preparation, mass spectrometry, and bioinformatics of fish, which represent alternative vertebrate models in environmental toxicology, and routinely studied animals in ecotoxicology. We observed inconsistencies in reporting and methodologies among experimental designs, sample preparations, data acquisitions and bioinformatics, which can affect reproducibility of experimental results. We identified a distinct need to develop reporting guidelines for proteomics use in environmental toxicology and ecotoxicology, increased QA/QC throughout studies, and method optimization with an emphasis on reducing inconsistencies among studies. Several recommendations are offered as logical steps to advance development and application of this emerging research area to understand chemical hazards to public health and the environment.
Collapse
Affiliation(s)
- Abigail N Henke
- Department of Biology, Baylor University Waco, TX, USA; Center for Reservoir and Aquatic Systems Research (CRASR), Baylor University Waco, TX, USA
| | | | - Laura M Langan
- Department of Environmental Science, Baylor University Waco, TX, USA; Center for Reservoir and Aquatic Systems Research (CRASR), Baylor University Waco, TX, USA.
| | - Bryan W Brooks
- Department of Environmental Science, Baylor University Waco, TX, USA; Center for Reservoir and Aquatic Systems Research (CRASR), Baylor University Waco, TX, USA.
| |
Collapse
|
2
|
Reanalysis of ProteomicsDB Using an Accurate, Sensitive, and Scalable False Discovery Rate Estimation Approach for Protein Groups. Mol Cell Proteomics 2022; 21:100437. [PMID: 36328188 PMCID: PMC9718969 DOI: 10.1016/j.mcpro.2022.100437] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 10/16/2022] [Accepted: 10/28/2022] [Indexed: 11/07/2022] Open
Abstract
Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry-based proteomics, particularly when analyzing very large datasets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large datasets. Here, we present an extension to this method that can also deal with protein groups, that is, proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam's razor principle leads to anticonservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Reanalysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the reanalysis of the entire human section of ProteomicsDB led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.
Collapse
|
3
|
Perez-Riverol Y. Proteomic repository data submission, dissemination, and reuse: key messages. Expert Rev Proteomics 2022; 19:297-310. [PMID: 36529941 PMCID: PMC7614296 DOI: 10.1080/14789450.2022.2160324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
INTRODUCTION The creation of ProteomeXchange data workflows in 2012 transformed the field of proteomics, consisting of the standardization of data submission and dissemination and enabling the widespread reanalysis of public MS proteomics data worldwide. ProteomeXchange has triggered a growing trend toward public dissemination of proteomics data, facilitating the assessment, reuse, comparative analyses, and extraction of new findings from public datasets. By 2022, the consortium is integrated by PRIDE, PeptideAtlas, MassIVE, jPOST, iProX, and Panorama Public. AREAS COVERED Here, we review and discuss the current ecosystem of resources, guidelines, and file formats for proteomics data dissemination and reanalysis. Special attention is drawn to new exciting quantitative and post-translational modification-oriented resources. The challenges and future directions on data depositions including the lack of metadata and cloud-based and high-performance software solutions for fast and reproducible reanalysis of the available data are discussed. EXPERT OPINION The success of ProteomeXchange and the amount of proteomics data available in the public domain have triggered the creation and/or growth of other protein knowledgebase resources. Data reuse is a leading, active, and evolving field; supporting the creation of new formats, tools, and workflows to rediscover and reshape the public proteomics data.
Collapse
Affiliation(s)
- Yasset Perez-Riverol
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| |
Collapse
|
4
|
Aggarwal S, Raj A, Kumar D, Dash D, Yadav AK. False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
Affiliation(s)
- Suruchi Aggarwal
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| | - Anurag Raj
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Dhirendra Kumar
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India
| | - Debasis Dash
- GN Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics & Integrative Biology, South Campus, Mathura Road, New Delhi 110025, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad-201002, India
| | - Amit Kumar Yadav
- Translational Health Science and Technology Institute, NCR Biotech Science Cluster, 3rd milestone, PO Box No. 04, Faridabad-Gurgaon Expressway, Faridabad-121001, Haryana, India
| |
Collapse
|
5
|
Proteome Discoverer-A Community Enhanced Data Processing Suite for Protein Informatics. Proteomes 2021; 9:proteomes9010015. [PMID: 33806881 PMCID: PMC8006021 DOI: 10.3390/proteomes9010015] [Citation(s) in RCA: 93] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 03/18/2021] [Accepted: 03/20/2021] [Indexed: 01/01/2023] Open
Abstract
Proteomics researchers today face an interesting challenge: how to choose among the dozens of data processing and analysis pipelines available for converting tandem mass spectrometry files to protein identifications. Due to the dominance of Orbitrap technology in proteomics in recent history, many researchers have defaulted to the vendor software Proteome Discoverer. Over the fourteen years since the initial release of the software, it has evolved in parallel with the increasingly complex demands faced by proteomics researchers. Today, Proteome Discoverer exists in two distinct forms with both powerful commercial versions and fully functional free versions in use in many labs today. Throughout the 11 main versions released to date, a central theme of the software has always been the ability to easily view and verify the spectra from which identifications are made. This ability is, even today, a key differentiator from other data analysis solutions. In this review I will attempt to summarize the history and evolution of Proteome Discoverer from its first launch to the versions in use today.
Collapse
|
6
|
Sperk M, van Domselaar R, Rodriguez JE, Mikaeloff F, Sá Vinhas B, Saccon E, Sönnerborg A, Singh K, Gupta S, Végvári Á, Neogi U. Utility of Proteomics in Emerging and Re-Emerging Infectious Diseases Caused by RNA Viruses. J Proteome Res 2020; 19:4259-4274. [PMID: 33095583 PMCID: PMC7640957 DOI: 10.1021/acs.jproteome.0c00380] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Indexed: 12/21/2022]
Abstract
Emerging and re-emerging infectious diseases due to RNA viruses cause major negative consequences for the quality of life, public health, and overall economic development. Most of the RNA viruses causing illnesses in humans are of zoonotic origin. Zoonotic viruses can directly be transferred from animals to humans through adaptation, followed by human-to-human transmission, such as in human immunodeficiency virus (HIV), severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratory syndrome coronavirus (MERS-CoV), and, more recently, SARS coronavirus 2 (SARS-CoV-2), or they can be transferred through insects or vectors, as in the case of Crimean-Congo hemorrhagic fever virus (CCHFV), Zika virus (ZIKV), and dengue virus (DENV). At the present, there are no vaccines or antiviral compounds against most of these viruses. Because proteins possess a vast array of functions in all known biological systems, proteomics-based strategies can provide important insights into the investigation of disease pathogenesis and the identification of promising antiviral drug targets during an epidemic or pandemic. Mass spectrometry technology has provided the capacity required for the precise identification and the sensitive and high-throughput analysis of proteins on a large scale and has contributed greatly to unravelling key protein-protein interactions, discovering signaling networks, and understanding disease mechanisms. In this Review, we present an account of quantitative proteomics and its application in some prominent recent examples of emerging and re-emerging RNA virus diseases like HIV-1, CCHFV, ZIKV, and DENV, with more detail with respect to coronaviruses (MERS-CoV and SARS-CoV) as well as the recent SARS-CoV-2 pandemic.
Collapse
Affiliation(s)
- Maike Sperk
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Robert van Domselaar
- Division
of Infectious Diseases, Department of Medicine Huddinge, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Jimmy Esneider Rodriguez
- Division
of Chemistry I, Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm 14152 Sweden
| | - Flora Mikaeloff
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Beatriz Sá Vinhas
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Elisa Saccon
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Anders Sönnerborg
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
- Division
of Infectious Diseases, Department of Medicine Huddinge, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Kamal Singh
- Department
of Molecular Microbiology and Immunology and the Bond Life Science
Center, University of Missouri, Columbia, Missouri 65211, United States
| | - Soham Gupta
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
| | - Ákos Végvári
- Division
of Chemistry I, Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm 14152 Sweden
| | - Ujjwal Neogi
- Division
of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institute, ANA Futura, Campus Flemingsberg, Stockholm 14152, Sweden
- Department
of Molecular Microbiology and Immunology and the Bond Life Science
Center, University of Missouri, Columbia, Missouri 65211, United States
| |
Collapse
|
7
|
Agten A, Van Houtven J, Askenazi M, Burzykowski T, Laukens K, Valkenborg D. Visualizing the agreement of peptide assignments between different search engines. JOURNAL OF MASS SPECTROMETRY : JMS 2020; 55:e4471. [PMID: 31713933 DOI: 10.1002/jms.4471] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 10/23/2019] [Accepted: 10/28/2019] [Indexed: 06/10/2023]
Abstract
There is a trend in the analysis of shotgun proteomics data that aims to combine information from multiple search engines to increase the number of peptide annotations in an experiment. Typically, the degree of search engine complementarity and search engine agreement is visually illustrated by means of Venn diagrams that present the findings of a database search on the level of the nonredundant peptide annotations. We argue this practice to be not fit-for-purpose since the diagrams do not take into account and often conceal the information on complementarity and agreement at the level of the spectrum identification. We promote a new type of visualization that provides insight on the peptide sequence agreement at the level of the peptide-spectrum match (PSM) as a measure of consensus between two search engines with nominal outcomes. We applied the visualizations and percentage sequence agreement to an in-house data set of our benchmark organism, Caenorhabditis elegans, and illustrated that when assessing the agreement between search engine, one should disentangle the notion of PSM confidence and PSM identity. The visualizations presented in this manuscript provide a more informative assessment of pairs of search engines and are made available as an R function in the Supporting Information.
Collapse
Affiliation(s)
- Annelies Agten
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
| | - Joris Van Houtven
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
- UA-VITO Center for Proteomics, University of Antwerp, Antwerp, Belgium
- Applied Bio and Molecular Systems, Flemish Institute for Technological Research (VITO), Mol, Belgium
| | | | - Tomasz Burzykowski
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| | - Dirk Valkenborg
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
- UA-VITO Center for Proteomics, University of Antwerp, Antwerp, Belgium
- Applied Bio and Molecular Systems, Flemish Institute for Technological Research (VITO), Mol, Belgium
| |
Collapse
|
8
|
Handler DCL, Haynes PA. Statistics in Proteomics: A Meta-analysis of 100 Proteomics Papers Published in 2019. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2020; 31:1337-1343. [PMID: 32324388 DOI: 10.1021/jasms.9b00142] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We randomly selected 100 journal articles published in five proteomics journals in 2019 and manually examined each of them against a set of 13 criteria concerning the statistical analyses used, all of which were based on items mentioned in the journals' instructions to authors. This included questions such as whether a pilot study was conducted and whether false discovery rate calculation was employed at either the quantitation or identification stage. These data were then transformed to binary inputs, analyzed via machine learning algorithms, and classified accordingly, with the aim of determining if clusters of data existed for specific journals or if certain statistical measures correlated with each other. We applied a variety of classification methods including principal component analysis decomposition, agglomerative clustering, and multinomial and Bernoulli naïve Bayes classification and found that none of these could readily determine journal identity given extracted statistical features. Logistic regression was useful in determining high correlative potential between statistical features such as false discovery rate criteria and multiple testing corrections methods, but was similarly ineffective at determining correlations between statistical features and specific journals. This meta-analysis highlights that there is a very wide variety of approaches being used in statistical analysis of proteomics data, many of which do not conform to published journal guidelines, and that contrary to implicit assumptions in the field there are no clear correlations between statistical methods and specific journals.
Collapse
Affiliation(s)
- David C L Handler
- Department of Molecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW 2109, Australia
| | - Paul A Haynes
- Department of Molecular Sciences, Faculty of Science and Engineering, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
9
|
Thomas SP, Haws SA, Borth LE, Denu JM. A practical guide for analysis of histone post-translational modifications by mass spectrometry: Best practices and pitfalls. Methods 2019; 184:53-60. [PMID: 31816396 DOI: 10.1016/j.ymeth.2019.12.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2019] [Revised: 11/23/2019] [Accepted: 12/02/2019] [Indexed: 02/06/2023] Open
Abstract
Advances in mass spectrometry (MS) have revolutionized the ability to measure global changes in histone post-translational modifications (PTMs). The method routinely quantifies over 60 modification states in a single sample, far exceeding the capabilities of traditional western blotting. Thus, MS-based histone analysis has become an increasingly popular tool for understanding how genetic and environmental factors influence epigenetic states. However, histone proteomics experiments exhibit unique challenges, such as batch-to-batch reproducibility, accurate peak integration, and noisy data. Here, we discuss the steps of histone PTM analysis, from sample preparation and peak integration to data analysis and validation. We outline a set of best practices for ensuring data quality, accurate normalization, and robust statistics. Using these practices, we quantify histone modifications in 5 human cell lines, revealing that each cell line exhibits a unique epigenetic signature. We also provide a reproducible workflow for histone PTM analysis in the form of an R script, which is freely available at https://github.com/DenuLab/HistoneAnalysisWorkflow.
Collapse
Affiliation(s)
- Sydney P Thomas
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI, USA; Department of Biomolecular Chemistry, University of Wisconsin, Madison, 420 Henry Mall, Madison, WI, USA
| | - Spencer A Haws
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI, USA; Department of Biomolecular Chemistry, University of Wisconsin, Madison, 420 Henry Mall, Madison, WI, USA
| | - Laura E Borth
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI, USA; Department of Biomolecular Chemistry, University of Wisconsin, Madison, 420 Henry Mall, Madison, WI, USA
| | - John M Denu
- Wisconsin Institute for Discovery, 330 N. Orchard Street, Madison, WI, USA; Department of Biomolecular Chemistry, University of Wisconsin, Madison, 420 Henry Mall, Madison, WI, USA.
| |
Collapse
|
10
|
LeDuc RD, Fellers RT, Early BP, Greer JB, Shams DP, Thomas PM, Kelleher NL. Accurate Estimation of Context-Dependent False Discovery Rates in Top-Down Proteomics. Mol Cell Proteomics 2019; 18:796-805. [PMID: 30647073 PMCID: PMC6442365 DOI: 10.1074/mcp.ra118.000993] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 01/04/2019] [Indexed: 11/06/2022] Open
Abstract
Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics-there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine.
Collapse
Affiliation(s)
- Richard D LeDuc
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;.
| | - Ryan T Fellers
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois
| | - Bryan P Early
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;; §Department of Molecular Biosciences, Northwestern University, Evanston, Illinois
| | - Joseph B Greer
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois
| | - Daniel P Shams
- ¶Interdisciplinary Biological Sciences, Northwestern University, Evanston, Illinois
| | - Paul M Thomas
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;; §Department of Molecular Biosciences, Northwestern University, Evanston, Illinois
| | - Neil L Kelleher
- From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;; §Department of Molecular Biosciences, Northwestern University, Evanston, Illinois;; Department of Chemistry and the Feinberg School of Medicine, Northwestern University, Evanston, Illinois.
| |
Collapse
|
11
|
Łącki MK, Lermyte F, Miasojedow B, Startek MP, Sobott F, Valkenborg D, Gambin A. masstodon: A Tool for Assigning Peaks and Modeling Electron Transfer Reactions in Top-Down Mass Spectrometry. Anal Chem 2019; 91:1801-1807. [DOI: 10.1021/acs.analchem.8b01479] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Mateusz K. Łącki
- University Medical Center, Johannes Gutenberg University, Mainz D-55131, Germany
| | - Frederik Lermyte
- Biomolecular and Analytical Mass Spectrometry Group, Department of Chemistry, University of Antwerp, Antwerp 2020, Belgium
- Centre for Proteomics, University of Antwerp, Antwerp 2000, Belgium
- School of Engineering, University of Warwick, Coventry CV4 7AL, United Kingdom
| | - Błażej Miasojedow
- Department of Mathematics, Informatics, and Mechanics, University of Warsaw, Warsaw 02-097, Poland
| | - Michał P. Startek
- Department of Mathematics, Informatics, and Mechanics, University of Warsaw, Warsaw 02-097, Poland
| | - Frank Sobott
- Biomolecular and Analytical Mass Spectrometry Group, Department of Chemistry, University of Antwerp, Antwerp 2020, Belgium
- Astbury Centre for Structural Molecular Biology, University of Leeds, Leeds LS2 9JT, United Kingdom
- School of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, United Kingdom
| | - Dirk Valkenborg
- Centre for Proteomics, University of Antwerp, Antwerp 2000, Belgium
- Flemish Institute for Technological Research (VITO), Mol 2400, Belgium
- Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt 3500, Belgium
| | - Anna Gambin
- Department of Mathematics, Informatics, and Mechanics, University of Warsaw, Warsaw 02-097, Poland
| |
Collapse
|
12
|
Henning J, Tostengard A, Smith R. A Peptide-Level Fully Annotated Data Set for Quantitative Evaluation of Precursor-Aware Mass Spectrometry Data Processing Algorithms. J Proteome Res 2018; 18:392-398. [PMID: 30394759 DOI: 10.1021/acs.jproteome.8b00659] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Modern label-free quantitative mass spectrometry workflows are complex experimental chains for devising the composition of biological samples. With benchtop and in silico experimental steps that each have a significant effect on the accuracy, coverage, and statistical significance of the study result, it is crucial to understand the efficacy and biases of each protocol decision. Although many studies have been conducted on wet lab experimental protocols, postacquisition data processing methods have not been adequately evaluated in large part due to a lack of available ground truth data. In this study, we provide a novel ground truth data set for mass spectrometry data analysis at the precursor (MS1) signal level comprised of isolated peptide signals from UPS2, a popular complex standard for proteomics analysis, requiring more than 1000 h of manual curation. The data set consists of more than 62 million points with 1,294,008 grouped into 57,518 extracted ion chromatograms and those grouped into 14,111 isotopic envelopes. This data set can be used to evaluate many aspects of mass spectrometry data processing, including precursor mapping and signal extraction algorithms.
Collapse
Affiliation(s)
- Jessica Henning
- Department of Computer Science , University of Montana , Missoula , Montana 59812 , United States
| | - Annika Tostengard
- Department of Computer Science , University of Montana , Missoula , Montana 59812 , United States
| | - Rob Smith
- Department of Computer Science , University of Montana , Missoula , Montana 59812 , United States.,Prime Laboratories, Inc. , Missoula , Montana United States
| |
Collapse
|
13
|
Borges H, Guibert R, Permiakova O, Burger T. Distinguishing between Spectral Clustering and Cluster Analysis of Mass Spectra. J Proteome Res 2018; 18:571-573. [PMID: 30394750 DOI: 10.1021/acs.jproteome.8b00516] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The term "spectral clustering" is sometimes used to refer to the clustering of mass spectrometry data. However, it also classically refers to a family of popular clustering algorithms. To avoid confusion, a more specific term could advantageously be coined.
Collapse
Affiliation(s)
- Hélène Borges
- Univ. Grenoble Alpes, CEA, INSERM, BIG-BGE, 38000 Grenoble , France
| | - Romain Guibert
- Univ. Grenoble Alpes, CEA, INSERM, BIG-BGE, 38000 Grenoble , France.,CNRS, BIG-BGE, F-38000 Grenoble , France
| | - Olga Permiakova
- Univ. Grenoble Alpes, CEA, INSERM, BIG-BGE, 38000 Grenoble , France
| | - Thomas Burger
- Univ. Grenoble Alpes, CEA, INSERM, BIG-BGE, 38000 Grenoble , France.,CNRS, BIG-BGE, F-38000 Grenoble , France
| |
Collapse
|
14
|
Bittremieux W, Tabb DL, Impens F, Staes A, Timmerman E, Martens L, Laukens K. Quality control in mass spectrometry-based proteomics. MASS SPECTROMETRY REVIEWS 2018; 37:697-711. [PMID: 28802010 DOI: 10.1002/mas.21544] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2017] [Revised: 07/24/2017] [Accepted: 07/24/2017] [Indexed: 05/21/2023]
Abstract
Mass spectrometry is a highly complex analytical technique and mass spectrometry-based proteomics experiments can be subject to a large variability, which forms an obstacle to obtaining accurate and reproducible results. Therefore, a comprehensive and systematic approach to quality control is an essential requirement to inspire confidence in the generated results. A typical mass spectrometry experiment consists of multiple different phases including the sample preparation, liquid chromatography, mass spectrometry, and bioinformatics stages. We review potential sources of variability that can impact the results of a mass spectrometry experiment occurring in all of these steps, and we discuss how to monitor and remedy the negative influences on the experimental results. Furthermore, we describe how specialized quality control samples of varying sample complexity can be incorporated into the experimental workflow and how they can be used to rigorously assess detailed aspects of the instrument performance.
Collapse
Affiliation(s)
- Wout Bittremieux
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (Biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - David L Tabb
- Division of Molecular Biology and Human Genetics, Stellenbosch University Faculty of Medicine and Health Sciences, Tygerberg Hospital, Cape Town, South Africa
| | - Francis Impens
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - An Staes
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Evy Timmerman
- VIB Proteomics Core, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, Ghent, Belgium
- Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Zwijnaarde, Belgium
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (Biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| |
Collapse
|
15
|
The M, Edfors F, Perez-Riverol Y, Payne SH, Hoopmann MR, Palmblad M, Forsström B, Käll L. A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms. J Proteome Res 2018; 17:1879-1886. [PMID: 29631402 DOI: 10.1021/acs.jproteome.7b00899] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A natural way to benchmark the performance of an analytical experimental setup is to use samples of known composition and see to what degree one can correctly infer the content of such a sample from the data. For shotgun proteomics, one of the inherent problems of interpreting data is that the measured analytes are peptides and not the actual proteins themselves. As some proteins share proteolytic peptides, there might be more than one possible causative set of proteins resulting in a given set of peptides and there is a need for mechanisms that infer proteins from lists of detected peptides. A weakness of commercially available samples of known content is that they consist of proteins that are deliberately selected for producing tryptic peptides that are unique to a single protein. Unfortunately, such samples do not expose any complications in protein inference. Hence, for a realistic benchmark of protein inference procedures, there is a need for samples of known content where the present proteins share peptides with known absent proteins. Here, we present such a standard, that is based on E. coli expressed human protein fragments. To illustrate the application of this standard, we benchmark a set of different protein inference procedures on the data. We observe that inference procedures excluding shared peptides provide more accurate estimates of errors compared to methods that include information from shared peptides, while still giving a reasonable performance in terms of the number of identified proteins. We also demonstrate that using a sample of known protein content without proteins with shared tryptic peptides can give a false sense of accuracy for many protein inference methods.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Fredrik Edfors
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Yasset Perez-Riverol
- European Molecular Biology Laboratory , European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus , Hinxton, Cambridge CB10 1SD , United Kingdom
| | - Samuel H Payne
- Biological Sciences Division , Pacific Northwest National Laboratory , Richland , Washington 99352 , United States
| | - Michael R Hoopmann
- Institute for Systems Biology , Seattle , Washington 98109 , United States
| | - Magnus Palmblad
- Center for Proteomics and Metabolomics , Leiden University Medical Center , 2300 RC Leiden , The Netherlands
| | - Björn Forsström
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health , KTH - Royal Institute of Technology , Box 1031 , 17121 Solna , Sweden
| |
Collapse
|
16
|
Burger T. Gentle Introduction to the Statistical Foundations of False Discovery Rate in Quantitative Proteomics. J Proteome Res 2017; 17:12-22. [DOI: 10.1021/acs.jproteome.7b00170] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Thomas Burger
- BIG-BGE (Université Grenoble-Alpes,
CNRS, CEA, INSERM), Grenoble 38000, France
| |
Collapse
|
17
|
Dowsey AW. The need for statistical contributions to bioinformatics at scale, with illustration to mass spectrometry. STAT MODEL 2017. [DOI: 10.1177/1471082x17708519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In their article, Morris and Baladandayuthapani clearly evidence the influence of statisticians in recent methodological advances throughout the bioinformatics pipeline and advocate for the expansion of this role. The latest acquisition platforms, such as next generation sequencing (genomics/transcriptomics) and hyphenated mass spectrometry (proteomics/metabolomics), output raw datasets in the order of gigabytes; it is not unusual to acquire a terabyte or more of data per study. The increasing computational burden this brings is a further impediment against the use of statistically rigorous methodology in the pre-processing stages of the bioinformatics pipeline. In this discussion I describe the mass spectrometry pipeline and use it as an example to show that beneath this challenge lies a two-fold opportunity: (a) Biological complexity and dynamic range is still well beyond what is captured by current processing methodology; hence, potential biomarkers and mechanistic insights are consistently missed; (b) Statistical science could play a larger role in optimizing the acquisition process itself. Data rates will continue to increase as routine clinical omics analysis moves to large-scale facilities with systematic, standardized protocols. Key inferential gains will be achieved by borrowing strength across the sum total of all analyzed studies, a task best underpinned by appropriate statistical modelling.
Collapse
Affiliation(s)
- Andrew W Dowsey
- School of Social & Community Medicine and School of Veterinary Sciences, Faculty of Health Sciences, University of Bristol, United Kingdom
| |
Collapse
|
18
|
Rosenberger G, Bludau I, Schmitt U, Heusel M, Hunter CL, Liu Y, MacCoss MJ, MacLean BX, Nesvizhskii AI, Pedrioli PGA, Reiter L, Röst HL, Tate S, Ting YS, Collins BC, Aebersold R. Statistical control of peptide and protein error rates in large-scale targeted data-independent acquisition analyses. Nat Methods 2017; 14:921-927. [PMID: 28825704 PMCID: PMC5581544 DOI: 10.1038/nmeth.4398] [Citation(s) in RCA: 145] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2016] [Accepted: 07/07/2017] [Indexed: 12/18/2022]
Abstract
Liquid chromatography coupled to tandem mass spectrometry is the main method for high-throughput identification and quantification of peptides and inferred proteins. Within this field, data-independent acquisition (DIA) combined with peptide-centric scoring, exemplified by SWATH-MS, emerged as a scalable method to achieve deep and consistent proteome coverage across large-scale datasets. Here we discuss the adaptation of statistical concepts developed for discovery proteomics based on spectrum-centric scoring to large-scale DIA experiments analyzed with peptide-centric scoring strategies and provide guidance on their application. We show that optimal tradeoffs between sensitivity and specificity require careful considerations of the relationship between proteins in the samples and proteins represented in the spectral library. We propose the application of a global analyte constraint to prevent accumulation of false positives across large-scale datasets. Furthermore, to increase the quality and reproducibility of published proteomic results, well-established confidence criteria should be reported for detected peptide queries, peptides and inferred proteins.
Collapse
Affiliation(s)
- George Rosenberger
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.,PhD Program in Systems Biology, University of Zurich and ETH Zurich, Zurich, Switzerland
| | - Isabell Bludau
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.,PhD Program in Systems Biology, University of Zurich and ETH Zurich, Zurich, Switzerland
| | - Uwe Schmitt
- ID Scientific IT Services, ETH Zurich, Zurich, Switzerland
| | - Moritz Heusel
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.,PhD program in Molecular and Translational Biomedicine, Competence Center Personalized Medicine (CC-PM), ETH Zurich and University of Zurich, Zurich, Switzerland
| | | | - Yansheng Liu
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Michael J MacCoss
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Brendan X MacLean
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Alexey I Nesvizhskii
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA.,Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | - Patrick G A Pedrioli
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | | | - Hannes L Röst
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | | | - Ying S Ting
- Department of Genome Sciences, University of Washington, Seattle, Washington, USA
| | - Ben C Collins
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
| | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.,Faculty of Science, University of Zurich, Zurich, Switzerland
| |
Collapse
|
19
|
Liu X, Guo Z, Sun H, Li W, Sun W. Comprehensive Map and Functional Annotation of Human Pituitary and Thyroid Proteome. J Proteome Res 2017; 16:2680-2691. [PMID: 28678506 DOI: 10.1021/acs.jproteome.6b00914] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Knowledge about human tissue proteome will provide insights into health organ physiology. To construct a comprehensive data set of human pituitary and thyroid proteins, post-mortem pituitaries and thyroids from 10 normal individuals were used. The pooled samples were prepared using two methods. One part of the sample was processed using 14 high-abundance proteins immunoaffinity column. The other part was directly subjected to digestion. Finally, a total of 7596 proteins in pituitary and 5602 proteins in thyroid with high confidence were identified, with 6623 and 4368 quantified, respectively. A total of 5781 of pituitary and 3178 of thyroid proteins have not been previously reported in the normal pituitary and thyroid proteome. Comparison of pituitary and thyroid proteome indicated that thyroid prefers to be involved in nerve system regeneration and metabolic regulation, while pituitary mainly performs functions of signal transduction and cancer modulation. Our results, for the first time, comprehensively profiled and functionally annotated the largest high-confidence data set of proteome of two important endocrine glands, pituitary and thyroid, which is important for further studies on biomarker identification and molecular mechanisms of pituitary and thyroid disorders. The mapping results can be freely downloaded at http://www.urimarker.com/pituitary/ and http://www.urimarker.com/thyroid/ . The raw data are available via ProteomeXchange with identifier PXD006471.
Collapse
Affiliation(s)
- Xiaoyan Liu
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
| | - Zhengguang Guo
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
| | - Haidan Sun
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
| | - Wenting Li
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
| | - Wei Sun
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China
| |
Collapse
|
20
|
|
21
|
van Ooijen MP, Jong VL, Eijkemans MJC, Heck AJR, Andeweg AC, Binai NA, van den Ham HJ. Identification of differentially expressed peptides in high-throughput proteomics data. Brief Bioinform 2017; 19:971-981. [DOI: 10.1093/bib/bbx031] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2016] [Indexed: 12/25/2022] Open
Affiliation(s)
| | - Victor L Jong
- Department of Biostatistics and Research Support, Julius Center, UMC Utrecht, Netherlands
| | - Marinus J C Eijkemans
- Julius Center for Health Sciences and Primary Care of the University Medical Center Utrecht, Netherlands
| | - Albert J R Heck
- Biomolecular Mass Spectrometry and Proteomics, Utrecht University, Netherlands
| | - Arno C Andeweg
- Department of Viroscience, Erasmus MC, CA Rotterdam, Netherlands
| | - Nadine A Binai
- Biomolecular Mass Spectrometry Group, Utrecht University, Netherlands
| | | |
Collapse
|
22
|
Zhang B, Pirmoradian M, Zubarev R, Käll L. Covariation of Peptide Abundances Accurately Reflects Protein Concentration Differences. Mol Cell Proteomics 2017; 16:936-948. [PMID: 28302922 PMCID: PMC5417831 DOI: 10.1074/mcp.o117.067728] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Revised: 03/13/2017] [Indexed: 12/29/2022] Open
Abstract
Most implementations of mass spectrometry-based proteomics involve enzymatic digestion of proteins, expanding the analysis to multiple proteolytic peptides for each protein. Currently, there is no consensus of how to summarize peptides' abundances to protein concentrations, and such efforts are complicated by the fact that error control normally is applied to the identification process, and do not directly control errors linking peptide abundance measures to protein concentration. Peptides resulting from suboptimal digestion or being partially modified are not representative of the protein concentration. Without a mechanism to remove such unrepresentative peptides, their abundance adversely impacts the estimation of their protein's concentration. Here, we present a relative quantification approach, Diffacto, that applies factor analysis to extract the covariation of peptides' abundances. The method enables a weighted geometrical average summarization and automatic elimination of incoherent peptides. We demonstrate, based on a set of controlled label-free experiments using standard mixtures of proteins, that the covariation structure extracted by the factor analysis accurately reflects protein concentrations. In the 1% peptide-spectrum match-level FDR data set, as many as 11% of the peptides have abundance differences incoherent with the other peptides attributed to the same protein. If not controlled, such contradicting peptide abundance have a severe impact on protein quantifications. When adding the quantities of each protein's three most abundant peptides, we note as many as 14% of the proteins being estimated as having a negative correlation with their actual concentration differences between samples. Diffacto reduced the amount of such obviously incorrectly quantified proteins to 1.6%. Furthermore, by analyzing clinical data sets from two breast cancer studies, our method revealed the persistent proteomic signatures linked to three subtypes of breast cancer. We conclude that Diffacto can facilitate the interpretation and enhance the utility of most types of proteomics data.
Collapse
Affiliation(s)
- Bo Zhang
- From the ‡Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, SE-17177 Solna, Sweden
| | - Mohammad Pirmoradian
- From the ‡Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, SE-17177 Solna, Sweden.,§Department of Laboratory Medicine, Karolinska University Hospital Huddinge, SE-14186 Huddinge, Sweden
| | - Roman Zubarev
- From the ‡Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, SE-17177 Solna, Sweden;
| | - Lukas Käll
- ¶Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology-KTH, SE-17165 Solna, Sweden
| |
Collapse
|
23
|
Audain E, Uszkoreit J, Sachsenberg T, Pfeuffer J, Liang X, Hermjakob H, Sanchez A, Eisenacher M, Reinert K, Tabb DL, Kohlbacher O, Perez-Riverol Y. In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. J Proteomics 2017; 150:170-182. [DOI: 10.1016/j.jprot.2016.08.002] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Revised: 07/30/2016] [Accepted: 08/02/2016] [Indexed: 12/24/2022]
|
24
|
The M, MacCoss MJ, Noble WS, Käll L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2016; 27:1719-1727. [PMID: 27572102 PMCID: PMC5059416 DOI: 10.1007/s13361-016-1460-7] [Citation(s) in RCA: 240] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 06/15/2016] [Accepted: 07/20/2016] [Indexed: 05/21/2023]
Abstract
Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method-grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein-in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstract ᅟ.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, KTH - Royal Institute of Technology, Box 1031, 17121, Solna, Sweden
| | - Michael J MacCoss
- Department of Genome Sciences, School of Medicine, University of Washington, Seattle, WA, 98195, USA
| | - William S Noble
- Department of Genome Sciences, School of Medicine, University of Washington, Seattle, WA, 98195, USA
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, KTH - Royal Institute of Technology, Box 1031, 17121, Solna, Sweden.
| |
Collapse
|
25
|
The M, Tasnim A, Käll L. How to talk about protein-level false discovery rates in shotgun proteomics. Proteomics 2016; 16:2461-9. [PMID: 27503675 PMCID: PMC5096025 DOI: 10.1002/pmic.201500431] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2015] [Revised: 05/12/2016] [Accepted: 07/20/2016] [Indexed: 12/04/2022]
Abstract
A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate (FDR). Many consider protein-level FDRs a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level FDRs, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the FDR. Furthermore, we demonstrate how the same simulations can be used to verify FDR estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level FDRs for both competing null hypotheses.
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Ayesha Tasnim
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Solna, Sweden.
| |
Collapse
|
26
|
Kumar D, Bansal G, Narang A, Basak T, Abbas T, Dash D. Integrating transcriptome and proteome profiling: Strategies and applications. Proteomics 2016; 16:2533-2544. [PMID: 27343053 DOI: 10.1002/pmic.201600140] [Citation(s) in RCA: 106] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 06/12/2016] [Accepted: 06/23/2016] [Indexed: 12/17/2022]
Abstract
Discovering the gene expression signature associated with a cellular state is one of the basic quests in majority of biological studies. For most of the clinical and cellular manifestations, these molecular differences may be exhibited across multiple layers of gene regulation like genomic variations, gene expression, protein translation and post-translational modifications. These system wide variations are dynamic in nature and their crosstalk is overwhelmingly complex, thus analyzing them separately may not be very informative. This necessitates the integrative analysis of such multiple layers of information to understand the interplay of the individual components of the biological system. Recent developments in high throughput RNA sequencing and mass spectrometric (MS) technologies to probe transcripts and proteins made these as preferred methods for understanding global gene regulation. Subsequently, improvements in "big-data" analysis techniques enable novel conclusions to be drawn from integrative transcriptomic-proteomic analysis. The unified analyses of both these data types have been rewarding for several biological objectives like improving genome annotation, predicting RNA-protein quantities, deciphering gene regulations, discovering disease markers and drug targets. There are different ways in which transcriptomics and proteomics data can be integrated; each aiming for different research objectives. Here, we review various studies, approaches and computational tools targeted for integrative analysis of these two high-throughput omics methods.
Collapse
Affiliation(s)
- Dhirendra Kumar
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Gourja Bansal
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Ankita Narang
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA
| | - Trayambak Basak
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA.,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India
| | - Tahseen Abbas
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA.,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India
| | - Debasis Dash
- G.N. Ramachandran Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, INDIA. , .,Academy of Scientific & Innovative Research (AcSIR), CSIR-IGIB South Campus, New Delhi, India. ,
| |
Collapse
|
27
|
Wright JC, Choudhary JS. DecoyPyrat: Fast Non-redundant Hybrid Decoy Sequence Generation for Large Scale Proteomics. JOURNAL OF PROTEOMICS & BIOINFORMATICS 2016; 9:176-180. [PMID: 27418748 PMCID: PMC4941923 DOI: 10.4172/jpb.1000404] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Accurate statistical evaluation of sequence database peptide identifications from tandem mass spectra is essential in mass spectrometry based proteomics experiments. These statistics are dependent on accurately modelling random identifications. The target-decoy approach has risen to become the de facto approach to calculating FDR in proteomic datasets. The main principle of this approach is to search a set of decoy protein sequences that emulate the size and composition of the target protein sequences searched whilst not matching real proteins in the sample. To do this, it is commonplace to reverse or shuffle the proteins and peptides in the target database. However, these approaches have their drawbacks and limitations. A key confounding issue is the peptide redundancy between target and decoy databases leading to inaccurate FDR estimation. This inaccuracy is further amplified at the protein level and when searching large sequence databases such as those used for proteogenomics. Here, we present a unifying hybrid method to quickly and efficiently generate decoy sequences with minimal overlap between target and decoy peptides. We show that applying a reversed decoy approach can produce up to 5% peptide redundancy and many more additional peptides will have the exact same precursor mass as a target peptide. Our hybrid method addresses both these issues by first switching proteolytic cleavage sites with preceding amino acid, reversing the database and then shuffling any redundant sequences. This flexible hybrid method reduces the peptide overlap between target and decoy peptides to about 1% of peptides, making a more robust decoy model suitable for large search spaces. We also demonstrate the anti-conservative effect of redundant peptides on the calculation of q-values in mouse brain tissue data.
Collapse
Affiliation(s)
- James C Wright
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Jyoti S Choudhary
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
28
|
Bogdanow B, Zauber H, Selbach M. Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides. Mol Cell Proteomics 2016; 15:2791-801. [PMID: 27215553 DOI: 10.1074/mcp.m115.055103] [Citation(s) in RCA: 49] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2015] [Indexed: 01/17/2023] Open
Abstract
The principle of shotgun proteomics is to use peptide mass spectra in order to identify corresponding sequences in a protein database. The quality of peptide and protein identification and quantification critically depends on the sensitivity and specificity of this assignment process. Many peptides in proteomic samples carry biochemical modifications, and a large fraction of unassigned spectra arise from modified peptides. Spectra derived from modified peptides can erroneously be assigned to wrong amino acid sequences. However, the impact of this problem on proteomic data has not yet been investigated systematically. Here we use combinations of different database searches to show that modified peptides can be responsible for 20-50% of false positive identifications in deep proteomic data sets. These false positive hits are particularly problematic as they have significantly higher scores and higher intensities than other false positive matches. Furthermore, these wrong peptide assignments lead to hundreds of false protein identifications and systematic biases in protein quantification. We devise a "cleaned search" strategy to address this problem and show that this considerably improves the sensitivity and specificity of proteomic data. In summary, we show that modified peptides cause systematic errors in peptide and protein identification and quantification and should therefore be considered to further improve the quality of proteomic data annotation.
Collapse
Affiliation(s)
- Boris Bogdanow
- From the ‡Proteome Dynamics lab, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str.13, 13092 Berlin, Germany
| | - Henrik Zauber
- From the ‡Proteome Dynamics lab, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str.13, 13092 Berlin, Germany
| | - Matthias Selbach
- From the ‡Proteome Dynamics lab, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str.13, 13092 Berlin, Germany
| |
Collapse
|
29
|
Maes E, Kelchtermans P, Bittremieux W, De Grave K, Degroeve S, Hooyberghs J, Mertens I, Baggerman G, Ramon J, Laukens K, Martens L, Valkenborg D. Designing biomedical proteomics experiments: state-of-the-art and future perspectives. Expert Rev Proteomics 2016; 13:495-511. [PMID: 27031651 DOI: 10.1586/14789450.2016.1172967] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
With the current expanded technical capabilities to perform mass spectrometry-based biomedical proteomics experiments, an improved focus on the design of experiments is crucial. As it is clear that ignoring the importance of a good design leads to an unprecedented rate of false discoveries which would poison our results, more and more tools are developed to help researchers designing proteomic experiments. In this review, we apply statistical thinking to go through the entire proteomics workflow for biomarker discovery and validation and relate the considerations that should be made at the level of hypothesis building, technology selection, experimental design and the optimization of the experimental parameters.
Collapse
Affiliation(s)
- Evelyne Maes
- a Applied Bio & molecular systems , VITO , Mol , Belgium.,b CFP , University of Antwerp , Antwerp , Belgium
| | - Pieter Kelchtermans
- b CFP , University of Antwerp , Antwerp , Belgium.,c Medical Biotechnology Center , VIB , Ghent , Belgium.,d Department of Biochemistry , Ghent University , Ghent , Belgium.,e Bioinformatics Institute Ghent , Ghent University , Ghent , Belgium
| | - Wout Bittremieux
- f Department of Mathematics and Computer Science , University of Antwerp , Antwerp , Belgium.,g Biomedical Informatics Research Center Antwerp (biomina) , University of Antwerp/Antwerp University Hospital , Antwerp , Belgium
| | - Kurt De Grave
- h Department of Computer Science , KU Leuven , Leuven , Belgium
| | - Sven Degroeve
- c Medical Biotechnology Center , VIB , Ghent , Belgium.,d Department of Biochemistry , Ghent University , Ghent , Belgium.,e Bioinformatics Institute Ghent , Ghent University , Ghent , Belgium
| | - Jef Hooyberghs
- a Applied Bio & molecular systems , VITO , Mol , Belgium
| | - Inge Mertens
- a Applied Bio & molecular systems , VITO , Mol , Belgium.,b CFP , University of Antwerp , Antwerp , Belgium
| | - Geert Baggerman
- a Applied Bio & molecular systems , VITO , Mol , Belgium.,b CFP , University of Antwerp , Antwerp , Belgium
| | - Jan Ramon
- h Department of Computer Science , KU Leuven , Leuven , Belgium.,i INRIA , Lille , France
| | - Kris Laukens
- f Department of Mathematics and Computer Science , University of Antwerp , Antwerp , Belgium.,g Biomedical Informatics Research Center Antwerp (biomina) , University of Antwerp/Antwerp University Hospital , Antwerp , Belgium
| | - Lennart Martens
- c Medical Biotechnology Center , VIB , Ghent , Belgium.,d Department of Biochemistry , Ghent University , Ghent , Belgium.,e Bioinformatics Institute Ghent , Ghent University , Ghent , Belgium
| | - Dirk Valkenborg
- a Applied Bio & molecular systems , VITO , Mol , Belgium.,b CFP , University of Antwerp , Antwerp , Belgium.,j Interuniversity Institute for Biostatistics and statistical Bioinformatics , Hasselt University , Hasselt , Belgium
| |
Collapse
|
30
|
Blattmann P, Heusel M, Aebersold R. SWATH2stats: An R/Bioconductor Package to Process and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools. PLoS One 2016; 11:e0153160. [PMID: 27054327 PMCID: PMC4824525 DOI: 10.1371/journal.pone.0153160] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2016] [Accepted: 03/24/2016] [Indexed: 11/19/2022] Open
Abstract
SWATH-MS is an acquisition and analysis technique of targeted proteomics that enables measuring several thousand proteins with high reproducibility and accuracy across many samples. OpenSWATH is popular open-source software for peptide identification and quantification from SWATH-MS data. For downstream statistical and quantitative analysis there exist different tools such as MSstats, mapDIA and aLFQ. However, the transfer of data from OpenSWATH to the downstream statistical tools is currently technically challenging. Here we introduce the R/Bioconductor package SWATH2stats, which allows convenient processing of the data into a format directly readable by the downstream analysis tools. In addition, SWATH2stats allows annotation, analyzing the variation and the reproducibility of the measurements, FDR estimation, and advanced filtering before submitting the processed data to downstream tools. These functionalities are important to quickly analyze the quality of the SWATH-MS data. Hence, SWATH2stats is a new open-source tool that summarizes several practical functionalities for analyzing, processing, and converting SWATH-MS data and thus facilitates the efficient analysis of large-scale SWATH/DIA datasets.
Collapse
Affiliation(s)
- Peter Blattmann
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8093, Zurich, Switzerland
- * E-mail:
| | - Moritz Heusel
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8093, Zurich, Switzerland
- PhD program in Molecular and Translational Biomedicine, Competence Center Personalized Medicine UZH/ETH & Life Science Zurich Graduate School, ETH Zurich and University of Zurich, 8044, Zurich, Switzerland
| | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8093, Zurich, Switzerland
- Faculty of Science, University of Zurich, 8057, Zurich, Switzerland
| |
Collapse
|
31
|
The M, Käll L. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. J Proteome Res 2016; 15:713-20. [PMID: 26653874 DOI: 10.1021/acs.jproteome.5b00749] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/statisticalbiotechnology/maracluster (under an Apache 2.0 license).
Collapse
Affiliation(s)
- Matthew The
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH , Box 1031, 17121 Solna, Sweden
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH , Box 1031, 17121 Solna, Sweden
| |
Collapse
|
32
|
Zhang B, Käll L, Zubarev RA. DeMix-Q: Quantification-Centered Data Processing Workflow. Mol Cell Proteomics 2016; 15:1467-78. [PMID: 26729709 DOI: 10.1074/mcp.o115.055475] [Citation(s) in RCA: 55] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Indexed: 12/31/2022] Open
Abstract
For historical reasons, most proteomics workflows focus on MS/MS identification but consider quantification as the end point of a comparative study. The stochastic data-dependent MS/MS acquisition (DDA) gives low reproducibility of peptide identifications from one run to another, which inevitably results in problems with missing values when quantifying the same peptide across a series of label-free experiments. However, the signal from the molecular ion is almost always present among the MS(1)spectra. Contrary to what is frequently claimed, missing values do not have to be an intrinsic problem of DDA approaches that perform quantification at the MS(1)level. The challenge is to perform sound peptide identity propagation across multiple high-resolution LC-MS/MS experiments, from runs with MS/MS-based identifications to runs where such information is absent. Here, we present a new analytical workflow DeMix-Q (https://github.com/userbz/DeMix-Q), which performs such propagation that recovers missing values reliably by using a novel scoring scheme for quality control. Compared with traditional workflows for DDA as well as previous DIA studies, DeMix-Q achieves deeper proteome coverage, fewer missing values, and lower quantification variance on a benchmark dataset. This quantification-centered workflow also enables flexible and robust proteome characterization based on covariation of peptide abundances.
Collapse
Affiliation(s)
- Bo Zhang
- From the ‡ Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, SE-17177 Solna, Sweden
| | - Lukas Käll
- § Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology-KTH, 17165 Solna, Sweden
| | - Roman A Zubarev
- From the ‡ Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, SE-17177 Solna, Sweden.
| |
Collapse
|
33
|
Gatto L, Hansen KD, Hoopmann MR, Hermjakob H, Kohlbacher O, Beyer A. Testing and Validation of Computational Methods for Mass Spectrometry. J Proteome Res 2015; 15:809-14. [PMID: 26549429 DOI: 10.1021/acs.jproteome.5b00852] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
High-throughput methods based on mass spectrometry (proteomics, metabolomics, lipidomics, etc.) produce a wealth of data that cannot be analyzed without computational methods. The impact of the choice of method on the overall result of a biological study is often underappreciated, but different methods can result in very different biological findings. It is thus essential to evaluate and compare the correctness and relative performance of computational methods. The volume of the data as well as the complexity of the algorithms render unbiased comparisons challenging. This paper discusses some problems and challenges in testing and validation of computational methods. We discuss the different types of data (simulated and experimental validation data) as well as different metrics to compare methods. We also introduce a new public repository for mass spectrometric reference data sets ( http://compms.org/RefData ) that contains a collection of publicly available data sets for performance evaluation for a wide range of different methods.
Collapse
Affiliation(s)
- Laurent Gatto
- Computational Proteomics Unit and Cambridge Centre for Proteomics, University of Cambridge , Cambridge CB2 1QR, United Kingdom
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins University , Baltimore, Maryland 21205, United States.,Institute of Genetic Medicine, Johns Hopkins University , Baltimore, Maryland 21205, United States
| | - Michael R Hoopmann
- Institute for Systems Biology , Seattle, Washington 98109, United States
| | - Henning Hermjakob
- European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.,National Center for Protein Sciences , Beijing, China
| | - Oliver Kohlbacher
- Quantitative Biology Center, Universität Tübingen , Auf der Morgenstelle 10, 72076 Tübingen, Germany.,Center for Bioinformatics, Universität Tübingen , Sand 14, 72076 Tübingen, Germany.,Dept. of Computer Science, Universität Tübingen , Sand 14, 72076 Tübingen, Germany.,Biomolecular Interactions, Max Planck Institute for Developmental Biology , Spemannstr. 35, 72076 Tübingen, Germany
| | - Andreas Beyer
- CECAD, University of Cologne , 50931 Cologne, Germany
| |
Collapse
|
34
|
Ezkurdia I, Calvo E, Del Pozo A, Vázquez J, Valencia A, Tress ML. The potential clinical impact of the release of two drafts of the human proteome. Expert Rev Proteomics 2015; 12:579-93. [PMID: 26496066 PMCID: PMC4732427 DOI: 10.1586/14789450.2015.1103186] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
The authors have carried out an investigation of the two "draft maps of the human proteome" published in 2014 in Nature. The findings include an abundance of poor spectra, low-scoring peptide-spectrum matches and incorrectly identified proteins in both these studies, highlighting clear issues with the application of false discovery rates. This noise means that the claims made by the two papers - the identification of high numbers of protein coding genes, the detection of novel coding regions and the draft tissue maps themselves - should be treated with considerable caution. The authors recommend that clinicians and researchers do not use the unfiltered data from these studies. Despite this these studies will inspire further investigation into tissue-based proteomics. As long as this future work has proper quality controls, it could help produce a consensus map of the human proteome and improve our understanding of the processes that underlie health and disease.
Collapse
Affiliation(s)
- Iakes Ezkurdia
- Unidad de Proteómica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Enrique Calvo
- Unidad de Proteómica, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Angela Del Pozo
- Instituto de Genetica Medica y Molecular, Hospital Universitario La Paz, Madrid, Spain
| | - Jesús Vázquez
- Laboratorio de Proteómica Cardiovascular, Centro Nacional de Investigaciones Cardiovasculares, CNIC, Madrid, Spain
| | - Alfonso Valencia
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- National Bioinformatics Institute (INB), Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Michael L. Tress
- Structural Biology and Bioinformatics Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| |
Collapse
|