1
|
Tariq U, Saeed F. Predicting peptide properties from mass spectrometry data using deep attention-based multitask network and uncertainty quantification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.21.609035. [PMID: 39229185 PMCID: PMC11370541 DOI: 10.1101/2024.08.21.609035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Database search algorithms reduce the number of potential candidate peptides against which scoring needs to be performed using a single (i.e. mass) property for filtering. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides - potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra. We demonstrate that ProteoRift can predict these properties with up to 97% accuracy resulting in search-space reduction by more than 90%. As a result, our end-to-end pipeline is shown to exhibit 8x to 12x speedups with peptide deduction accuracy comparable to algorithmic techniques. We also formulate two uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end ML pipeline available at https://github.com/pcdslab/ProteoRift.
Collapse
Affiliation(s)
- Usman Tariq
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
| | - Fahad Saeed
- Knight Foundation School of Computing, and Information Sciences, Florida International University (FIU), Miami, FL USA
- Biomolecular Sciences Institute (BSI), Florida International University, Miami, FL, USA
- Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA
| |
Collapse
|
2
|
Brechtmann F, Bechtler T, Londhe S, Mertes C, Gagneur J. Evaluation of input data modality choices on functional gene embeddings. NAR Genom Bioinform 2023; 5:lqad095. [PMID: 37942285 PMCID: PMC10629286 DOI: 10.1093/nargab/lqad095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/07/2023] [Accepted: 09/28/2023] [Indexed: 11/10/2023] Open
Abstract
Functional gene embeddings, numerical vectors capturing gene function, provide a promising way to integrate functional gene information into machine learning models. These embeddings are learnt by applying self-supervised machine-learning algorithms on various data types including quantitative omics measurements, protein-protein interaction networks and literature. However, downstream evaluations comparing alternative data modalities used to construct functional gene embeddings have been lacking. Here we benchmarked functional gene embeddings obtained from various data modalities for predicting disease-gene lists, cancer drivers, phenotype-gene associations and scores from genome-wide association studies. Off-the-shelf predictors trained on precomputed embeddings matched or outperformed dedicated state-of-the-art predictors, demonstrating their high utility. Embeddings based on literature and protein-protein interactions inferred from low-throughput experiments outperformed embeddings derived from genome-wide experimental data (transcriptomics, deletion screens and protein sequence) when predicting curated gene lists. In contrast, they did not perform better when predicting genome-wide association signals and were biased towards highly-studied genes. These results indicate that embeddings derived from literature and low-throughput experiments appear favourable in many existing benchmarks because they are biased towards well-studied genes and should therefore be considered with caution. Altogether, our study and precomputed embeddings will facilitate the development of machine-learning models in genetics and related fields.
Collapse
Affiliation(s)
- Felix Brechtmann
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Thibault Bechtler
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Shubhankar Londhe
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Christian Mertes
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Munich Data Science Institute, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
| | - Julien Gagneur
- TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| |
Collapse
|
3
|
Liu L, Jones BF, Uzzi B, Wang D. Data, measurement and empirical methods in the science of science. Nat Hum Behav 2023:10.1038/s41562-023-01562-4. [PMID: 37264084 DOI: 10.1038/s41562-023-01562-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 02/17/2023] [Indexed: 06/03/2023]
Abstract
The advent of large-scale datasets that trace the workings of science has encouraged researchers from many different disciplinary backgrounds to turn scientific methods into science itself, cultivating a rapidly expanding 'science of science'. This Review considers this growing, multidisciplinary literature through the lens of data, measurement and empirical methods. We discuss the purposes, strengths and limitations of major empirical approaches, seeking to increase understanding of the field's diverse methodologies and expand researchers' toolkits. Overall, new empirical developments provide enormous capacity to test traditional beliefs and conceptual frameworks about science, discover factors associated with scientific productivity, predict scientific outcomes and design policies that facilitate scientific progress.
Collapse
Affiliation(s)
- Lu Liu
- Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA
- Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA
- Kellogg School of Management, Northwestern University, Evanston, IL, USA
- College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, USA
| | - Benjamin F Jones
- Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA
- Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA
- Kellogg School of Management, Northwestern University, Evanston, IL, USA
- National Bureau of Economic Research, Cambridge, MA, USA
- Brookings Institution, Washington, DC, USA
| | - Brian Uzzi
- Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA
- Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA
- Kellogg School of Management, Northwestern University, Evanston, IL, USA
| | - Dashun Wang
- Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA.
- Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA.
- Kellogg School of Management, Northwestern University, Evanston, IL, USA.
- McCormick School of Engineering, Northwestern University, Evanston, IL, USA.
| |
Collapse
|
4
|
|
5
|
Kustatscher G, Collins T, Gingras AC, Guo T, Hermjakob H, Ideker T, Lilley KS, Lundberg E, Marcotte EM, Ralser M, Rappsilber J. An open invitation to the Understudied Proteins Initiative. Nat Biotechnol 2022; 40:815-817. [PMID: 35534555 DOI: 10.1038/s41587-022-01316-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Georg Kustatscher
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, UK.
| | | | - Anne-Claude Gingras
- Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Sinai Health System, Toronto, Ontario, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Tiannan Guo
- Zhejiang Provincial Laboratory of Life Sciences and Biomedicine, Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, Hangzhou, China.,Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, China
| | - Henning Hermjakob
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Trey Ideker
- Division of Genetics, Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kathryn S Lilley
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Emma Lundberg
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH-Royal Institute of Technology, Stockholm, Sweden.,Department of Bioengineering, Stanford University, Stanford, CA, USA.,Department of Pathology, Stanford University, Stanford, CA, USA.,Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Edward M Marcotte
- Department of Molecular Biosciences, Center for Systems and Synthetic Biology, University of Texas at Austin, Austin, TX, USA
| | - Markus Ralser
- Department of Biochemistry, Charité University Medicine, Berlin, Germany.,The Molecular Biology of Metabolism Laboratory, The Francis Crick Institute, London, UK
| | - Juri Rappsilber
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, UK. .,Bioanalytics, Institute of Biotechnology, Technische Universität Berlin, Berlin, Germany. .,Wellcome Centre for Cell Biology, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
6
|
Jeon M, Jagodnik KM, Kropiwnicki E, Stein DJ, Ma'ayan A. Prioritizing Pain-Associated Targets with Machine Learning. Biochemistry 2021; 60:1430-1446. [PMID: 33606503 DOI: 10.1021/acs.biochem.0c00930] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
While hundreds of genes have been associated with pain, much of the molecular mechanisms of pain remain unknown. As a result, current analgesics are limited to few clinically validated targets. Here, we trained a machine learning (ML) ensemble model to predict new targets for 17 categories of pain. The model utilizes features from transcriptomics, proteomics, and gene ontology to prioritize targets for modulating pain. We focused on identifying novel G-protein-coupled receptors (GPCRs), ion channels, and protein kinases because these proteins represent the most successful drug target families. The performance of the model to predict novel pain targets is 0.839 on average based on AUROC, while the predictions for arthritis had the highest accuracy (AUROC = 0.929). The model predicts hundreds of novel targets for pain; for example, GPR132 and GPR109B are highly ranked GPCRs for rheumatoid arthritis. Overall, gene-pain association predictions cluster into three groups that are enriched for cytokine, calcium, and GABA-related cell signaling pathways. These predictions can serve as a foundation for future experimental exploration to advance the development of safer and more effective analgesics.
Collapse
Affiliation(s)
- Minji Jeon
- Department of Pharmacological Sciences, Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, P.O. Box 1603, New York, New York 10029, United States
| | - Kathleen M Jagodnik
- Department of Pharmacological Sciences, Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, P.O. Box 1603, New York, New York 10029, United States
| | - Eryk Kropiwnicki
- Department of Pharmacological Sciences, Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, P.O. Box 1603, New York, New York 10029, United States
| | - Daniel J Stein
- Department of Pharmacological Sciences, Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, P.O. Box 1603, New York, New York 10029, United States
| | - Avi Ma'ayan
- Department of Pharmacological Sciences, Knowledge Management Center for Illuminating the Druggable Genome (KMC-IDG), Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, P.O. Box 1603, New York, New York 10029, United States
| |
Collapse
|
7
|
Qin H, Niu T, Zhao J. Identifying Multi-Omics Causers and Causal Pathways for Complex Traits. Front Genet 2019; 10:110. [PMID: 30847004 PMCID: PMC6393387 DOI: 10.3389/fgene.2019.00110] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 01/30/2019] [Indexed: 12/23/2022] Open
Abstract
The central dogma of molecular biology delineates a unidirectional causal flow, i.e., DNA → RNA → protein → trait. Genome-wide association studies, next-generation sequencing association studies, and their meta-analyses have successfully identified ~12,000 susceptibility genetic variants that are associated with a broad array of human physiological traits. However, such conventional association studies ignore the mediate causers (i.e., RNA, protein) and the unidirectional causal pathway. Such studies may not be ideally powerful; and the genetic variants identified may not necessarily be genuine causal variants. In this article, we model the central dogma by a mediate causal model and analytically prove that the more remote an omics level is from a physiological trait, the smaller the magnitude of their correlation is. Under both random and extreme sampling schemes, we numerically demonstrate that the proteome-trait correlation test is more powerful than the transcriptome-trait correlation test, which in turn is more powerful than the genotype-trait association test. In conclusion, integrating RNA and protein expressions with DNA data and causal inference are necessary to gain a full understanding of how genetic causal variants contribute to phenotype variations.
Collapse
Affiliation(s)
- Huaizhen Qin
- Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, United States
- Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, United States
| | - Tianhua Niu
- Department of Global Biostatistics and Data Science, Tulane University, New Orleans, LA, United States
- Department of Biochemistry and Molecular Biology, Tulane University School Medicine, New Orleans, LA, United States
| | - Jinying Zhao
- Department of Epidemiology, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, FL, United States
| |
Collapse
|