1
|
Feng S, Ji HL, Wang H, Zhang B, Sterzenbach R, Pan C, Guo X. MetaLP: An integrative linear programming method for protein inference in metaproteomics. PLoS Comput Biol 2022; 18:e1010603. [PMID: 36269761 PMCID: PMC9629623 DOI: 10.1371/journal.pcbi.1010603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 11/02/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open
Abstract
Metaproteomics based on high-throughput tandem mass spectrometry (MS/MS) plays a crucial role in characterizing microbiome functions. The acquired MS/MS data is searched against a protein sequence database to identify peptides, which are then used to infer a list of proteins present in a metaproteome sample. While the problem of protein inference has been well-studied for proteomics of single organisms, it remains a major challenge for metaproteomics of complex microbial communities because of the large number of degenerate peptides shared among homologous proteins in different organisms. This challenge calls for improved discrimination of true protein identifications from false protein identifications given a set of unique and degenerate peptides identified in metaproteomics. MetaLP was developed here for protein inference in metaproteomics using an integrative linear programming method. Taxonomic abundance information extracted from metagenomics shotgun sequencing or 16s rRNA gene amplicon sequencing, was incorporated as prior information in MetaLP. Benchmarking with mock, human gut, soil, and marine microbial communities demonstrated significantly higher numbers of protein identifications by MetaLP than ProteinLP, PeptideProphet, DeepPep, PIPQ, and Sipros Ensemble. In conclusion, MetaLP could substantially improve protein inference for complex metaproteomes by incorporating taxonomic abundance information in a linear programming model.
Collapse
Affiliation(s)
- Shichao Feng
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| | - Hong-Long Ji
- Department of Cellular and Molecular Biology, University of Texas at Tyler, Tyler, Texas, United States of America
- Texas Lung Injury Institute, University of Texas at Tyler, Tyler, Texas, United States of America
| | - Huan Wang
- College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, CHINA
| | - Bailu Zhang
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| | - Ryan Sterzenbach
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
- Department of Biomedical Engineering, University of North Texas, Denton, Texas, United States of America
| | - Chongle Pan
- School of Computer Science, University of Oklahoma, Norman, Oklahoma, United States of America
| | - Xuan Guo
- Department of Computer Science and Engineering, University of North Texas, Denton, Texas, United States of America
| |
Collapse
|
2
|
Hayati M, Chindelevitch L. Computing the distribution of the Robinson-Foulds distance. Comput Biol Chem 2020; 87:107284. [PMID: 32599459 DOI: 10.1016/j.compbiolchem.2020.107284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Accepted: 05/09/2020] [Indexed: 11/22/2022]
Abstract
With the exponential growth of genome databases, the importance of phylogenetics has increased dramatically over the past years. Studying phylogenetic trees enables us not only to understand how genes, genomes, and species evolve, but also helps us predict how they might change in future. One of the crucial aspects of phylogenetics is the comparison of two or more phylogenetic trees. There are different metrics for computing the dissimilarity between a pair of trees. The Robinson-Foulds (RF) distance is one of the widely used metrics on the space of labeled trees. The distribution of the RF distance from a given tree has been studied before, but the fastest known algorithm for computing this distribution is a slow, albeit polynomial-time, O(l5) algorithm. In this paper, we modify the dynamic programming algorithm for computing the distribution of this distance for a given tree by leveraging the number-theoretic transform (NTT), and improve the running time from O(l5) to O(l3logl), where l is the number of tips of the tree. In addition to its practical usefulness, our method represents a theoretical novelty, as it is, to our knowledge, one of the rare applications of the number-theoretic transform for solving a computational biology problem.
Collapse
Affiliation(s)
- Maryam Hayati
- Simon Fraser University, Department of Computing Science, add8888 University Avenue, Burnaby, BC V5A 1S6, Canada
| | - Leonid Chindelevitch
- Simon Fraser University, Department of Computing Science, add8888 University Avenue, Burnaby, BC V5A 1S6, Canada.
| |
Collapse
|
3
|
Pfeuffer J, Sachsenberg T, Dijkstra TMH, Serang O, Reinert K, Kohlbacher O. EPIFANY: A Method for Efficient High-Confidence Protein Inference. J Proteome Res 2020; 19:1060-1072. [PMID: 31975601 DOI: 10.1021/acs.jproteome.9b00566] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Accurate protein inference in the presence of shared peptides is still one of the key problems in bottom-up proteomics. Most protein inference tools employing simple heuristic inference strategies are efficient but exhibit reduced accuracy. More advanced probabilistic methods often exhibit better inference quality but tend to be too slow for large data sets. Here, we present a novel protein inference method, EPIFANY, combining a loopy belief propagation algorithm with convolution trees for efficient processing of Bayesian networks. We demonstrate that EPIFANY combines the reliable protein inference of Bayesian methods with significantly shorter runtimes. On the 2016 iPRG protein inference benchmark data, EPIFANY is the only tested method that finds all true-positive proteins at a 5% protein false discovery rate (FDR) without strict prefiltering on the peptide-spectrum match (PSM) level, yielding an increase in identification performance (+10% in the number of true positives and +14% in partial AUC) compared to previous approaches. Even very large data sets with hundreds of thousands of spectra (which are intractable with other Bayesian and some non-Bayesian tools) can be processed with EPIFANY within minutes. The increased inference quality including shared peptides results in better protein inference results and thus increased robustness of the biological hypotheses generated. EPIFANY is available as open-source software for all major platforms at https://OpenMS.de/epifany.
Collapse
Affiliation(s)
- Julianus Pfeuffer
- Applied Bioinformatics, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany.,Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.,Algorithmic Bioinformatics, Department of Bioinformatics, Freie Universität Berlin, 14195 Berlin, Germany
| | - Timo Sachsenberg
- Applied Bioinformatics, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany.,Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
| | - Tjeerd M H Dijkstra
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany
| | - Oliver Serang
- Department of Computer Science, University of Montana, Missoula, Montana 59812, United States
| | - Knut Reinert
- Algorithmic Bioinformatics, Department of Bioinformatics, Freie Universität Berlin, 14195 Berlin, Germany
| | - Oliver Kohlbacher
- Applied Bioinformatics, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany.,Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.,Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076 Tübingen, Germany.,Institute for Translational Bioinformatics, University Hospital Tübingen, 72076 Tübingen, Germany.,Quantitative Biology Center, University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
4
|
Alves G, Yu YK. Robust Accurate Identification and Biomass Estimates of Microorganisms via Tandem Mass Spectrometry. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2020; 31:85-102. [PMID: 32881514 PMCID: PMC10501333 DOI: 10.1021/jasms.9b00035] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Rapid and accurate identification of microorganisms and estimation of their biomasses are of extreme importance to public health. Mass spectrometry has become an important technique for these purposes. Previously we published a workflow named Microorganism Classification and Identification (MiCId v.12.26.2017) that was shown to perform no worse than other workflows. This manuscript presents MiCId v.12.13.2018 that, in comparison with the earlier version v.12.26.2017, allows for biomass estimates, provides more accurate microorganism identifications (better controls the number of false positives), and is robust against database size increase. This significant advance is made possible by several new ingredients introduced: first, we apply a modified expectation-maximization method to compute for each taxon considered a prior probability, which can be used for biomass estimate; second, we introduce a new concept called ownership, through which the participation ratio is computed and use it as the number of taxa to be kept within a cluster of closely related taxa; third, based on confidently identified peptides, we calculate for each taxon its degree of independence from the rest of taxa considered to determine whether or not to split this taxon off the cluster. Using 270 data files, each containing a large number of MS/MS spectra, we show that, in comparison with v.12.26.2017, version v.12.13.2018 yields superior retrieval results. We also show that MiCId v.12.13.2018 can estimate species biomass reasonably well. The new MiCId v.12.13.2018, designed to run in Linux environment, is freely available for download at https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads.html.
Collapse
Affiliation(s)
- Gelio Alves
- National Center for Biotehnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Yi-Kuo Yu
- National Center for Biotehnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
5
|
Peyrard N, Cros M, Givry S, Franc A, Robin S, Sabbadin R, Schiex T, Vignes M. Exact or approximate inference in graphical models: why the choice is dictated by the treewidth, and how variable elimination can be exploited. AUST NZ J STAT 2019. [DOI: 10.1111/anzs.12257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- N. Peyrard
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - M.‐J. Cros
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - S. Givry
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - A. Franc
- INRA UMR 1202 Biodiversité, Gènes et Communautés 69, route d'Arcachon, Pierroton 33612Cestas Cedex France
| | - S. Robin
- AgroParisTech UMR 518 MIA 16 rue Claude Bernard Paris 5e France
- INRA, UMR 518 MIA 16 rue Claude Bernard Paris 5e France
| | - R. Sabbadin
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - T. Schiex
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - M. Vignes
- Institute of Fundamental Sciences Massey University Palmerston North New Zealand
| |
Collapse
|
6
|
Assessing species biomass contributions in microbial communities via metaproteomics. Nat Commun 2017; 8:1558. [PMID: 29146960 PMCID: PMC5691128 DOI: 10.1038/s41467-017-01544-x] [Citation(s) in RCA: 139] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 09/26/2017] [Indexed: 12/13/2022] Open
Abstract
Microbial community structure can be analyzed by quantifying cell numbers or by quantifying biomass for individual populations. Methods for quantifying cell numbers are already available (e.g., fluorescence in situ hybridization, 16S rRNA gene amplicon sequencing), yet high-throughput methods for assessing community structure in terms of biomass are lacking. Here we present metaproteomics-based methods for assessing microbial community structure using protein abundance as a measure for biomass contributions of individual populations. We optimize the accuracy and sensitivity of the method using artificially assembled microbial communities and show that it is less prone to some of the biases found in sequencing-based methods. We apply the method to communities from two different environments, microbial mats from two alkaline soda lakes, and saliva from multiple individuals. We show that assessment of species biomass contributions adds an important dimension to the analysis of microbial community structure. Convenient methods for assessing microbial community structure in terms of biomass are lacking. Here, the authors present a metaproteomics-based approach for assessing microbial community structure using protein abundance as a measure for biomass contributions of individual populations.
Collapse
|
7
|
Kim M, Eetemadi A, Tagkopoulos I. DeepPep: Deep proteome inference from peptide profiles. PLoS Comput Biol 2017; 13:e1005661. [PMID: 28873403 PMCID: PMC5600403 DOI: 10.1371/journal.pcbi.1005661] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Revised: 09/15/2017] [Accepted: 06/27/2017] [Indexed: 11/24/2022] Open
Abstract
Protein inference, the identification of the protein set that is the origin of a given peptide profile, is a fundamental challenge in proteomics. We present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a proteomics mixture, given the sequence universe of possible proteins and a target peptide profile. In its core, DeepPep quantifies the change in probabilistic score of peptide-spectrum matches in the presence or absence of a specific protein, hence selecting as candidate proteins with the largest impact to the peptide profile. Application of the method across datasets argues for its competitive predictive ability (AUC of 0.80±0.18, AUPR of 0.84±0.28) in inferring proteins without need of peptide detectability on which the most competitive methods rely. We find that the convolutional neural network architecture outperforms the traditional artificial neural network architectures without convolution layers in protein inference. We expect that similar deep learning architectures that allow learning nonlinear patterns can be further extended to problems in metagenome profiling and cell type inference. The source code of DeepPep and the benchmark datasets used in this study are available at https://deeppep.github.io/DeepPep/. The accurate identification of proteins in a proteomics sample, called the protein inference problem, is a fundamental challenge in biomedical sciences. Current approaches are based on applications of traditional neural networks, linear optimization and Bayesian techniques. We here present DeepPep, a deep-convolutional neural network framework that predicts the protein set from a standard proteomics mixture, given all protein sequences and a peptide profile. Comparison to leading methods shows that DeepPep has most robust performance with various instruments and datasets. Our results provide evidence that using sequence-level location information of a peptide in the context of proteome sequence can result in more accurate and robust protein inference. We conclude that Deep Learning on protein sequence leads to superior platforms for protein inference that can be further refined with additional features and extended for far reaching applications.
Collapse
Affiliation(s)
- Minseung Kim
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ameen Eetemadi
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
| | - Ilias Tagkopoulos
- Department of Computer Science, University of California, Davis, Davis, California, United States of America
- Genome Center, University of California, Davis, Davis, California, United States of America
- * E-mail:
| |
Collapse
|
8
|
Serang O, Käll L. Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. J Proteome Res 2015; 14:4099-103. [PMID: 26257019 DOI: 10.1021/acs.jproteome.5b00568] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In any high-throughput scientific study, it is often essential to estimate the percent of findings that are actually incorrect. This percentage is called the false discovery rate (abbreviated "FDR"), and it is an invariant (albeit, often unknown) quantity for any well-formed study. In proteomics, it has become common practice to incorrectly conflate the protein FDR (the percent of identified proteins that are actually absent) with protein-level target-decoy, a particular method for estimating the protein-level FDR. In this manner, the challenges of one approach have been used as the basis for an argument that the field should abstain from protein-level FDR analysis altogether or even the suggestion that the very notion of a protein FDR is flawed. As we demonstrate in simple but accurate simulations, not only is the protein-level FDR an invariant concept, when analyzing large data sets, the failure to properly acknowledge it or to correct for multiple testing can result in large, unrecognized errors, whereby thousands of absent proteins (and, potentially every protein in the FASTA database being considered) can be incorrectly identified.
Collapse
Affiliation(s)
- Oliver Serang
- Department of Informatik, Freie Universität Berlin , Takustr. 9, Berlin 14195, Germany.,Leibniz-Institute for Freshwater Ecology and Inland Fisheries (IGB) , Müggelseedamm 310, Berlin 12587, Germany
| | - Lukas Käll
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH , Tomtebodavägen 23A, Solna SE-171 21, Sweden
| |
Collapse
|
9
|
Serang O. A Fast Numerical Method for Max-Convolution and the Application to Efficient Max-Product Inference in Bayesian Networks. J Comput Biol 2015; 22:770-83. [PMID: 26161499 DOI: 10.1089/cmb.2015.0013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Observations depending on sums of random variables are common throughout many fields; however, no efficient solution is currently known for performing max-product inference on these sums of general discrete distributions (max-product inference can be used to obtain maximum a posteriori estimates). The limiting step to max-product inference is the max-convolution problem (sometimes presented in log-transformed form and denoted as "infimal convolution," "min-convolution," or "convolution on the tropical semiring"), for which no O(k log(k)) method is currently known. Presented here is an O(k log(k)) numerical method for estimating the max-convolution of two nonnegative vectors (e.g., two probability mass functions), where k is the length of the larger vector. This numerical max-convolution method is then demonstrated by performing fast max-product inference on a convolution tree, a data structure for performing fast inference given information on the sum of n discrete random variables in O(nk log(nk)log(n)) steps (where each random variable has an arbitrary prior distribution on k contiguous possible states). The numerical max-convolution method can be applied to specialized classes of hidden Markov models to reduce the runtime of computing the Viterbi path from nk(2) to nk log(k), and has potential application to the all-pairs shortest paths problem.
Collapse
Affiliation(s)
- Oliver Serang
- 1 Department of Informatik Freie Universität Berlin, Berlin, Germany .,2 Liebniz-Institute of Freshwater Ecology and Inland Fisheries, Berlin, Germany
| |
Collapse
|
10
|
Pineda AL, Gopalakrishnan V. Novel Application of Junction Trees to the Interpretation of Epigenetic Differences among Lung Cancer Subtypes. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2015; 2015:31-5. [PMID: 26306226 PMCID: PMC4525224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In this era of precision medicine, understanding the epigenetic differences in lung cancer subtypes could lead to personalized therapies by possibly reversing these alterations. Traditional methods for analyzing microarray data rely on the use of known pathways. We propose a novel workflow, called Junction trees to Knowledge (J2K) framework, for creating interpretable graphical representations that can be derived directly from in silico analysis of microarray data. Our workflow has three steps, preprocessing (discretization and feature selection), construction of a Bayesian network and, its subsequent transformation into a Junction tree. We used data from the Cancer Genome Atlas to perform preliminary analyses of this J2K framework. We found relevant cliques of methylated sites that are junctions of the network along with potential methylation biomarkers in the lung cancer pathogenesis.
Collapse
Affiliation(s)
- Arturo Lopez Pineda
- The PRoBE Lab, Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA
| | - Vanathi Gopalakrishnan
- The PRoBE Lab, Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA
| |
Collapse
|