1
|
Feng S, Ji HL, Wang H, Zhang B, Sterzenbach R, Pan C, Guo X. MetaLP: An integrative linear programming method for protein inference in metaproteomics. PLoS Comput Biol 2022; 18:e1010603. [PMID: 36269761 PMCID: PMC9629623 DOI: 10.1371/journal.pcbi.1010603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 11/02/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open
Abstract
Metaproteomics based on high-throughput tandem mass spectrometry (MS/MS) plays a crucial role in characterizing microbiome functions. The acquired MS/MS data is searched against a protein sequence database to identify peptides, which are then used to infer a list of proteins present in a metaproteome sample. While the problem of protein inference has been well-studied for proteomics of single organisms, it remains a major challenge for metaproteomics of complex microbial communities because of the large number of degenerate peptides shared among homologous proteins in different organisms. This challenge calls for improved discrimination of true protein identifications from false protein identifications given a set of unique and degenerate peptides identified in metaproteomics. MetaLP was developed here for protein inference in metaproteomics using an integrative linear programming method. Taxonomic abundance information extracted from metagenomics shotgun sequencing or 16s rRNA gene amplicon sequencing, was incorporated as prior information in MetaLP. Benchmarking with mock, human gut, soil, and marine microbial communities demonstrated significantly higher numbers of protein identifications by MetaLP than ProteinLP, PeptideProphet, DeepPep, PIPQ, and Sipros Ensemble. In conclusion, MetaLP could substantially improve protein inference for complex metaproteomes by incorporating taxonomic abundance information in a linear programming model.
Collapse
Affiliation(s)
- Shichao Feng
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| | - Hong-Long Ji
- Department of Cellular and Molecular Biology, University of Texas at Tyler, Tyler, Texas, United States of America
- Texas Lung Injury Institute, University of Texas at Tyler, Tyler, Texas, United States of America
| | - Huan Wang
- College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, CHINA
| | - Bailu Zhang
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| | - Ryan Sterzenbach
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
- Department of Biomedical Engineering, University of North Texas, Denton, Texas, United States of America
| | - Chongle Pan
- School of Computer Science, University of Oklahoma, Norman, Oklahoma, United States of America
| | - Xuan Guo
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| |
Collapse
|
2
|
Fancello L, Burger T. An analysis of proteogenomics and how and when transcriptome-informed reduction of protein databases can enhance eukaryotic proteomics. Genome Biol 2022; 23:132. [PMID: 35725496 PMCID: PMC9208142 DOI: 10.1186/s13059-022-02701-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 06/09/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Proteogenomics aims to identify variant or unknown proteins in bottom-up proteomics, by searching transcriptome- or genome-derived custom protein databases. However, empirical observations reveal that these large proteogenomic databases produce lower-sensitivity peptide identifications. Various strategies have been proposed to avoid this, including the generation of reduced transcriptome-informed protein databases, which only contain proteins whose transcripts are detected in the sample-matched transcriptome. These were found to increase peptide identification sensitivity. Here, we present a detailed evaluation of this approach. RESULTS We establish that the increased sensitivity in peptide identification is in fact a statistical artifact, directly resulting from the limited capability of target-decoy competition to accurately model incorrect target matches when using excessively small databases. As anti-conservative false discovery rates (FDRs) are likely to hamper the robustness of the resulting biological conclusions, we advocate for alternative FDR control methods that are less sensitive to database size. Nevertheless, reduced transcriptome-informed databases are useful, as they reduce the ambiguity of protein identifications, yielding fewer shared peptides. Furthermore, searching the reference database and subsequently filtering proteins whose transcripts are not expressed reduces protein identification ambiguity to a similar extent, but is more transparent and reproducible. CONCLUSIONS In summary, using transcriptome information is an interesting strategy that has not been promoted for the right reasons. While the increase in peptide identifications from searching reduced transcriptome-informed databases is an artifact caused by the use of an FDR control method unsuitable to excessively small databases, transcriptome information can reduce the ambiguity of protein identifications.
Collapse
Affiliation(s)
- Laura Fancello
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France
| | - Thomas Burger
- CNRS, CEA, Inserm, BioSanté U1292, Profi FR2048, Université Grenoble Alpes, Grenoble, France.
| |
Collapse
|
3
|
Winkler R. ProtyQuant: Comparing label-free shotgun proteomics datasets using accumulated peptide probabilities. J Proteomics 2020; 230:103985. [PMID: 32956841 DOI: 10.1016/j.jprot.2020.103985] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Revised: 08/07/2020] [Accepted: 09/10/2020] [Indexed: 11/20/2022]
Abstract
Comparing multiple label-free shotgun proteomics datasets requires various data processing and formatting steps, including peptide-spectrum matching, protein inference, and quantification. Finally, the compilation of results files into a format that allows for downstream analyses. ProtyQuant performs protein inference and quantification calculations, and combines the results of individual datasets into plain text tables. These are lightweight, human-readable, and easy to import into databases or statistical software. ProtyQuant reads validated pepXML from proteomic workflows such as the Trans-Proteomic Pipeline (TPP), which makes it compatible with many commercial and free search engines. For protein inference and quantification, a modified version of the PIPQ program (He et al. 2016) was integrated. In contrast to simple spectral-counting, PIPQ sums up peptide probabilities. For assigning peptides to proteins, three algorithms are available: Multiple Counting, Equal Division, and Linear Programming. The accumulated peptide probabilities (app) are used for both tasks, protein probability estimation, and quantification. ProtyQuant was tested using a reference dataset for label-free shotgun proteomics, obtained from different concentrations of 48 human UPS proteins spiked into yeast lysate. Compared to ProteinProphet, ProtyQuant detected up to 126 (15%) more proteins in the mixture, applying an equal false positive rate (FPR). Using the app values for label-free quantification showed suitable sensitivity and linearity. Strikingly, the app values represent a realistic measure of 'Protein Presence,' an integral concept of protein probability and quantity. ProtyQuant provides a graphical user interface (GUI) and scripts for console-based processing. It is available (GNU GLP v3) for Windows, Linux, and Docker from https://bitbucket.org/lababi/protyquant/. SIGNIFICANCE: Integrating data from multiple shot-gun proteomics experiments overwhelms non-expert researchers. ProtyQuant complements well-established workflows by aiding the comparison of proteins across samples. Importantly, the probability and abundance of proteins are seen from a holistic point of view. The accumulated peptide probability (app) as an integral measure of 'Protein Presence' demonstrated reliable performance for both protein identification and quantification. Using the app as a single measure facilitates the compilation of reports in comparative proteomics.
Collapse
Affiliation(s)
- Robert Winkler
- Center for Research and Advanced Studies (CINVESTAV) Irapuato, Department of Biochemistry and Biotechnology, Km. 9.6 Libramiento Norte Carr. Irapuato-León, 36824 Irapuato, GTO, Mexico.
| |
Collapse
|
4
|
Prieto G, Vázquez J. Protein Probability Model for High-Throughput Protein Identification by Mass Spectrometry-Based Proteomics. J Proteome Res 2020; 19:1285-1297. [PMID: 32037837 DOI: 10.1021/acs.jproteome.9b00819] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Shotgun proteomics is the method of choice for high-throughput protein identification; however, robust statistical methods are essential to automatize this task while minimizing the number of false identifications. The standard method for estimating the false discovery rate (FDR) of individual identifications and keeping it below a threshold (typically 1%) is the target-decoy approach. However, numerous works have shown that FDR at the protein level may become much larger than FDR at the peptide level. The development of an appropriate scoring model to identify proteins from their peptides using high-throughput shotgun proteomics is highly needed. In this study, we present a novel protein-level scoring algorithm that uses the scores of the identified peptides and maintains all of the properties expected for a true protein probability. We also present a refinement of the picked method to calculate FDR at the protein level. These algorithms can be used together as a robust identification workflow suitable for large-scale proteomics, and we show that the identification performance of this workflow is superior to that of other widely used methods in several samples and using different search engines. Our protein probability model offers the scientific community an algorithm that is easy to integrate into protein identification workflows for the automated analysis of shotgun proteomics data.
Collapse
Affiliation(s)
- Gorka Prieto
- Department of Communications Engineering, University of the Basque Country (UPV/EHU), 48013 Bilbao, Spain
| | - Jesús Vázquez
- Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), 28049 Madrid, Spain
| |
Collapse
|
5
|
Zhong J, Wang J, Ding X, Zhang Z, Li M, Wu FX, Pan Y. Protein Inference from the Integration of Tandem MS Data and Interactome Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1399-1409. [PMID: 28113634 DOI: 10.1109/tcbb.2016.2601618] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Since proteins are digested into a mixture of peptides in the preprocessing step of tandem mass spectrometry (MS), it is difficult to determine which specific protein a shared peptide belongs to. In recent studies, besides tandem MS data and peptide identification information, some other information is exploited to infer proteins. Different from the methods which first use only tandem MS data to infer proteins and then use network information to refine them, this study proposes a protein inference method named TMSIN, which uses interactome networks directly. As two interacting proteins should co-exist, it is reasonable to assume that if one of the interacting proteins is confidently inferred in a sample, its interacting partners should have a high probability in the same sample, too. Therefore, we can use the neighborhood information of a protein in an interactome network to adjust the probability that the shared peptide belongs to the protein. In TMSIN, a multi-weighted graph is constructed by incorporating the bipartite graph with interactome network information, where the bipartite graph is built with the peptide identification information. Based on multi-weighted graphs, TMSIN adopts an iterative workflow to infer proteins. At each iterative step, the probability that a shared peptide belongs to a specific protein is calculated by using the Bayes' law based on the neighbor protein support scores of each protein which are mapped by the shared peptides. We carried out experiments on yeast data and human data to evaluate the performance of TMSIN in terms of ROC, q-value, and accuracy. The experimental results show that AUC scores yielded by TMSIN are 0.742 and 0.874 in yeast dataset and human dataset, respectively, and TMSIN yields the maximum number of true positives when q-value less than or equal to 0.05. The overlap analysis shows that TMSIN is an effective complementary approach for protein inference.
Collapse
|
6
|
Kim M, Eetemadi A, Tagkopoulos I. DeepPep: Deep proteome inference from peptide profiles. PLoS Comput Biol 2017; 13:e1005661. [PMID: 28873403 PMCID: PMC5600403 DOI: 10.1371/journal.pcbi.1005661] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Revised: 09/15/2017] [Accepted: 06/27/2017] [Indexed: 11/24/2022] Open
Abstract
Protein inference, the identification of the protein set that is the origin of a given peptide profile, is a fundamental challenge in proteomics. We present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a proteomics mixture, given the sequence universe of possible proteins and a target peptide profile. In its core, DeepPep quantifies the change in probabilistic score of peptide-spectrum matches in the presence or absence of a specific protein, hence selecting as candidate proteins with the largest impact to the peptide profile. Application of the method across datasets argues for its competitive predictive ability (AUC of 0.80±0.18, AUPR of 0.84±0.28) in inferring proteins without need of peptide detectability on which the most competitive methods rely. We find that the convolutional neural network architecture outperforms the traditional artificial neural network architectures without convolution layers in protein inference. We expect that similar deep learning architectures that allow learning nonlinear patterns can be further extended to problems in metagenome profiling and cell type inference. The source code of DeepPep and the benchmark datasets used in this study are available at https://deeppep.github.io/DeepPep/. The accurate identification of proteins in a proteomics sample, called the protein inference problem, is a fundamental challenge in biomedical sciences. Current approaches are based on applications of traditional neural networks, linear optimization and Bayesian techniques. We here present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a standard proteomics mixture, given all protein sequences and a peptide profile. Comparison to leading methods shows that DeepPep has most robust performance with various instruments and datasets. Our results provide evidence that using sequence-level location information of a peptide in the context of proteome sequence can result in more accurate and robust protein inference. We conclude that Deep Learning on protein sequence leads to superior platforms for protein inference that can be further refined with additional features and extended for far reaching applications.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ameen Eetemadi
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
7
|
Kepplinger D, Takhar M, Sasaki M, Hollander Z, Smith D, McManus B, McMaster WR, Ng RT, Cohen Freue GV. PGCA: An algorithm to link protein groups created from MS/MS data. PLoS One 2017; 12:e0177569. [PMID: 28562641 PMCID: PMC5451011 DOI: 10.1371/journal.pone.0177569] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 04/28/2017] [Indexed: 11/19/2022] Open
Abstract
The quantitation of proteins using shotgun proteomics has gained popularity in the last decades, simplifying sample handling procedures, removing extensive protein separation steps and achieving a relatively high throughput readout. The process starts with the digestion of the protein mixture into peptides, which are then separated by liquid chromatography and sequenced by tandem mass spectrometry (MS/MS). At the end of the workflow, recovering the identity of the proteins originally present in the sample is often a difficult and ambiguous process, because more than one protein identifier may match a set of peptides identified from the MS/MS spectra. To address this identification problem, many MS/MS data processing software tools combine all plausible protein identifiers matching a common set of peptides into a protein group. However, this solution introduces new challenges in studies with multiple experimental runs, which can be characterized by three main factors: i) protein groups' identifiers are local, i.e., they vary run to run, ii) the composition of each group may change across runs, and iii) the supporting evidence of proteins within each group may also change across runs. Since in general there is no conclusive evidence about the absence of proteins in the groups, protein groups need to be linked across different runs in subsequent statistical analyses. We propose an algorithm, called Protein Group Code Algorithm (PGCA), to link groups from multiple experimental runs by forming global protein groups from connected local groups. The algorithm is computationally inexpensive and enables the connection and analysis of lists of protein groups across runs needed in biomarkers studies. We illustrate the identification problem and the stability of the PGCA mapping using 65 iTRAQ experimental runs. Further, we use two biomarker studies to show how PGCA enables the discovery of relevant candidate protein group markers with similar but non-identical compositions in different runs.
Collapse
Affiliation(s)
- David Kepplinger
- Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Mandeep Takhar
- NCE CECR PROOF Centre of Excellence, Vancouver, British Columbia, Canada
| | - Mayu Sasaki
- NCE CECR PROOF Centre of Excellence, Vancouver, British Columbia, Canada
| | | | - Derek Smith
- University of Victoria - Genome BC Proteomics Centre, Victoria, British Columbia, Canada
| | - Bruce McManus
- NCE CECR PROOF Centre of Excellence, Vancouver, British Columbia, Canada
| | - W. Robert McMaster
- Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| | - Raymond T. Ng
- NCE CECR PROOF Centre of Excellence, Vancouver, British Columbia, Canada
- Department of Computer Science, University of British Columbia, Vancouver, British Columbia, Canada
| | - Gabriela V. Cohen Freue
- Department of Statistics, University of British Columbia, Vancouver, British Columbia, Canada
- * E-mail:
| |
Collapse
|
8
|
Audain E, Uszkoreit J, Sachsenberg T, Pfeuffer J, Liang X, Hermjakob H, Sanchez A, Eisenacher M, Reinert K, Tabb DL, Kohlbacher O, Perez-Riverol Y. In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics. J Proteomics 2017; 150:170-182. [DOI: 10.1016/j.jprot.2016.08.002] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Revised: 07/30/2016] [Accepted: 08/02/2016] [Indexed: 12/24/2022]
|
9
|
Protein inference: A protein quantification perspective. Comput Biol Chem 2016; 63:21-29. [DOI: 10.1016/j.compbiolchem.2016.02.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2016] [Accepted: 02/01/2016] [Indexed: 01/04/2023]
|
10
|
Computational Methods in Mass Spectrometry-Based Proteomics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 939:63-89. [PMID: 27807744 DOI: 10.1007/978-981-10-1503-8_4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
This chapter introduces computational methods used in mass spectrometry-based proteomics, including those for addressing the critical problems such as peptide identification and protein inference, peptide and protein quantification, characterization of posttranslational modifications (PTMs), and data-independent acquisitions (DIA). The chapter concludes with emerging applications of proteomic techniques, such as metaproteomics, glycoproteomics, and proteogenomics.
Collapse
|
11
|
He Z, Huang T, Zhao C, Teng B. Protein Inference. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 919:237-242. [PMID: 27975221 DOI: 10.1007/978-3-319-41448-5_12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
Protein inference is one of the most important steps in protein identification, which transforms peptides identified from tandem mass spectra into a list of proteins. In this chapter, we provide a brief introduction on this problem and present a short summary on the existing protein inference methods in the literature.
Collapse
Affiliation(s)
- Zengyou He
- School of Software, Dalian University of Technology, Dalian, China.
| | - Ting Huang
- College of Computer and Information Science, Northeastern University, Boston, MA, USA
| | - Can Zhao
- School of Software, Dalian University of Technology, Dalian, China
| | - Ben Teng
- School of Software, Dalian University of Technology, Dalian, China
| |
Collapse
|
12
|
Zhao C, Liu D, Teng B, He Z. BagReg: Protein inference through machine learning. Comput Biol Chem 2015; 57:12-20. [DOI: 10.1016/j.compbiolchem.2015.02.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2014] [Accepted: 02/03/2015] [Indexed: 10/24/2022]
|
13
|
Kelchtermans P, Bittremieux W, De Grave K, Degroeve S, Ramon J, Laukens K, Valkenborg D, Barsnes H, Martens L. Machine learning applications in proteomics research: how the past can boost the future. Proteomics 2014; 14:353-66. [PMID: 24323524 DOI: 10.1002/pmic.201300289] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/24/2013] [Accepted: 10/14/2013] [Indexed: 01/22/2023]
Abstract
Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.
Collapse
Affiliation(s)
- Pieter Kelchtermans
- Department of Medical Protein Research, VIB, Ghent, Belgium; Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium; Flemish Institute for Technological Research (VITO), Boeretang, Mol, Belgium
| | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
MOTIVATION Statistical validation of protein identifications is an important issue in shotgun proteomics. The false discovery rate (FDR) is a powerful statistical tool for evaluating the protein identification result. Several research efforts have been made for FDR estimation at the protein level. However, there are still certain drawbacks in the existing FDR estimation methods based on the target-decoy strategy. RESULTS In this article, we propose a decoy-free protein-level FDR estimation method. Under the null hypothesis that each candidate protein matches an identified peptide totally at random, we assign statistical significance to protein identifications in terms of the permutation P-value and use these P-values to calculate the FDR. Our method consists of three key steps: (i) generating random bipartite graphs with the same structure; (ii) calculating the protein scores on these random graphs; and (iii) calculating the permutation P value and final FDR. As it is time-consuming or prohibitive to execute the protein inference algorithms for thousands of times in step ii, we first train a linear regression model using the original bipartite graph and identification scores provided by the target inference algorithm. Then we use the learned regression model as a substitute of original protein inference method to predict protein scores on shuffled graphs. We test our method on six public available datasets. The results show that our method is comparable with those state-of-the-art algorithms in terms of estimation accuracy. AVAILABILITY The source code of our algorithm is available at: https://sourceforge.net/projects/plfdr/
Collapse
Affiliation(s)
- Ben Teng
- School of Software, Dalian University of Technology, Dalian 116621, China
| | | | | |
Collapse
|
15
|
|