1
|
Onigbinde S, Gutierrez Reyes CD, Sandilya V, Chukwubueze F, Oluokun O, Sahioun S, Oluokun A, Mechref Y. Optimization of glycopeptide enrichment techniques for the identification of clinical biomarkers. Expert Rev Proteomics 2024:1-32. [PMID: 39439029 DOI: 10.1080/14789450.2024.2418491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/28/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024]
Abstract
INTRODUCTION The identification and characterization of glycopeptides through LC-MS/MS and advanced enrichment techniques are crucial for advancing clinical glycoproteomics, significantly impacting the discovery of disease biomarkers and therapeutic targets. Despite progress in enrichment methods like Lectin Affinity Chromatography (LAC), Hydrophilic Interaction Liquid Chromatography (HILIC), and Electrostatic Repulsion Hydrophilic Interaction Chromatography (ERLIC), issues with specificity, efficiency, and scalability remain, impeding thorough analysis of complex glycosylation patterns crucial for disease understanding. AREAS COVERED This review explores the current challenges and innovative solutions in glycopeptide enrichment and mass spectrometry analysis, highlighting the importance of novel materials and computational advances for improving sensitivity and specificity. It outlines the potential future directions of these technologies in clinical glycoproteomics, emphasizing their transformative impact on medical diagnostics and therapeutic strategies. EXPERT OPINION The application of innovative materials such as Metal-Organic Frameworks (MOFs), Covalent Organic Frameworks (COFs), functional nanomaterials, and online enrichment shows promise in addressing challenges associated with glycoproteomics analysis by providing more selective and robust enrichment platforms. Moreover, the integration of artificial intelligence and machine learning is revolutionizing glycoproteomics by enhancing the processing and interpretation of extensive data from LC-MS/MS, boosting biomarker discovery, and improving predictive accuracy, thus supporting personalized medicine.
Collapse
Affiliation(s)
- Sherifdeen Onigbinde
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | | | - Vishal Sandilya
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Favour Chukwubueze
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Odunayo Oluokun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Sarah Sahioun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Ayobami Oluokun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Yehia Mechref
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| |
Collapse
|
2
|
Asediya VS, Anjaria PA, Mathakiya RA, Koringa PG, Nayak JB, Bisht D, Fulmali D, Patel VA, Desai DN. Vaccine development using artificial intelligence and machine learning: A review. Int J Biol Macromol 2024; 282:136643. [PMID: 39426778 DOI: 10.1016/j.ijbiomac.2024.136643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 09/30/2024] [Accepted: 10/15/2024] [Indexed: 10/21/2024]
Abstract
The COVID-19 pandemic has underscored the critical importance of effective vaccines, yet their development is a challenging and demanding process. It requires identifying antigens that elicit protective immunity, selecting adjuvants that enhance immunogenicity, and designing delivery systems that ensure optimal efficacy. Artificial intelligence (AI) can facilitate this process by using machine learning methods to analyze large and diverse datasets, suggest novel vaccine candidates, and refine their design and predict their performance. This review explores how AI can be applied to various aspects of vaccine development, such as predicting immune response from protein sequences, discovering adjuvants, optimizing vaccine doses, modeling vaccine supply chains, and predicting protein structures. We also address the challenges and ethical issues that emerge from the use of AI in vaccine development, such as data privacy, algorithmic bias, and health data sensitivity. We contend that AI has immense potential to accelerate vaccine development and respond to future pandemics, but it also requires careful attention to the quality and validity of the data and methods used.
Collapse
Affiliation(s)
| | | | | | | | | | - Deepanker Bisht
- Indian Veterinary Research Institute, Izatnagar, U.P., India
| | | | | | | |
Collapse
|
3
|
He G, He Q, Cheng J, Yu R, Shuai J, Cao Y. ProPept-MT: A Multi-Task Learning Model for Peptide Feature Prediction. Int J Mol Sci 2024; 25:7237. [PMID: 39000344 PMCID: PMC11241495 DOI: 10.3390/ijms25137237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 06/26/2024] [Accepted: 06/28/2024] [Indexed: 07/16/2024] Open
Abstract
In the realm of quantitative proteomics, data-independent acquisition (DIA) has emerged as a promising approach, offering enhanced reproducibility and quantitative accuracy compared to traditional data-dependent acquisition (DDA) methods. However, the analysis of DIA data is currently hindered by its reliance on project-specific spectral libraries derived from DDA analyses, which not only limits proteome coverage but also proves to be a time-intensive process. To overcome these challenges, we propose ProPept-MT, a novel deep learning-based multi-task prediction model designed to accurately forecast key features such as retention time (RT), ion intensity, and ion mobility (IM). Leveraging advanced techniques such as multi-head attention and BiLSTM for feature extraction, coupled with Nash-MTL for gradient coordination, ProPept-MT demonstrates superior prediction performance. Integrating ion mobility alongside RT, mass-to-charge ratio (m/z), and ion intensity forms 4D proteomics. Then, we outline a comprehensive workflow tailored for 4D DIA proteomics research, integrating the use of 4D in silico libraries predicted by ProPept-MT. Evaluation on a benchmark dataset showcases ProPept-MT's exceptional predictive capabilities, with impressive results including a 99.9% Pearson correlation coefficient (PCC) for RT prediction, a median dot product (DP) of 96.0% for fragment ion intensity prediction, and a 99.3% PCC for IM prediction on the test set. Notably, ProPept-MT manifests efficacy in predicting both unmodified and phosphorylated peptides, underscoring its potential as a valuable tool for constructing high-quality 4D DIA in silico libraries.
Collapse
Affiliation(s)
- Guoqiang He
- Postgraduate Training Base Alliance, Wenzhou Medical University, Wenzhou 325000, China
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Qingzu He
- Department of Physics, and Fujian Provincial Key Laboratory for Soft Functional Materials Research, Xiamen University, Xiamen 361005, China
| | - Jinyan Cheng
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Rongwen Yu
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Jianwei Shuai
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| | - Yi Cao
- Postgraduate Training Base Alliance, Wenzhou Medical University, Wenzhou 325000, China
- Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325000, China
| |
Collapse
|
4
|
Hamaneh M, Ogurtsov AY, Obolensky OI, Yu YK. Systematic Assessment of Deep Learning-Based Predictors of Fragmentation Intensity Profiles. J Proteome Res 2024; 23:1983-1999. [PMID: 38728051 PMCID: PMC11165591 DOI: 10.1021/acs.jproteome.3c00857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 03/05/2024] [Accepted: 04/16/2024] [Indexed: 06/13/2024]
Abstract
In recent years, several deep learning-based methods have been proposed for predicting peptide fragment intensities. This study aims to provide a comprehensive assessment of six such methods, namely Prosit, DeepMass:Prism, pDeep3, AlphaPeptDeep, Prosit Transformer, and the method proposed by Guan et al. To this end, we evaluated the accuracy of the predicted intensity profiles for close to 1.7 million precursors (including both tryptic and HLA peptides) corresponding to more than 18 million experimental spectra procured from 40 independent submissions to the PRIDE repository that were acquired for different species using a variety of instruments and different dissociation types/energies. Specifically, for each method, distributions of similarity (measured by Pearson's correlation and normalized angle) between the predicted and the corresponding experimental b and y fragment intensities were generated. These distributions were used to ascertain the prediction accuracy and rank the prediction methods for particular types of experimental conditions. The effect of variables like precursor charge, length, and collision energy on the prediction accuracy was also investigated. In addition to prediction accuracy, the methods were evaluated in terms of prediction speed. The systematic assessment of these six methods may help in choosing the right method for MS/MS spectra prediction for particular needs.
Collapse
Affiliation(s)
- Mehdi
B. Hamaneh
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | - Aleksey Y. Ogurtsov
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| | | | - Yi-Kuo Yu
- National Center for Biotechnology
Information, National Library of Medicine,
National Institutes of Health, Bethesda, Maryland 20894, United States
| |
Collapse
|
5
|
Geer LY, Lapin J, Slotta DJ, Mak TD, Stein SE. AIomics: Exploring More of the Proteome Using Mass Spectral Libraries Extended by Artificial Intelligence. J Proteome Res 2023; 22:2246-2255. [PMID: 37232537 PMCID: PMC10542943 DOI: 10.1021/acs.jproteome.2c00807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
The unbounded permutations of biological molecules, including proteins and their constituent peptides, present a dilemma in identifying the components of complex biosamples. Sequence search algorithms used to identify peptide spectra can be expanded to cover larger classes of molecules, including more modifications, isoforms, and atypical cleavage, but at the cost of false positives or false negatives due to the simplified spectra they compute from sequence records. Spectral library searching can help solve this issue by precisely matching experimental spectra to library spectra with excellent sensitivity and specificity. However, compiling spectral libraries that span entire proteomes is pragmatically difficult. Neural networks that predict complete spectra containing a full range of annotated and unannotated ions can be used to replace these simplified spectra with libraries of fully predicted spectra, including modified peptides. Using such a network, we created predicted spectral libraries that were used to rescore matches from a sequence search done over a large search space, including a large number of modifications. Rescoring improved the separation of true and false hits by 82%, yielding an 8% increase in peptide identifications, including a 21% increase in nonspecifically cleaved peptides and a 17% increase in phosphopeptides.
Collapse
Affiliation(s)
- Lewis Y. Geer
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Joel Lapin
- Department of Physics, Georgetown University, Washington, DC 20057, United States
- Associate, Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Douglas J. Slotta
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Tytus D. Mak
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| | - Stephen E. Stein
- Mass Spectrometry Data Center, National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Dr., Gaithersburg, Maryland 20899, United States
| |
Collapse
|
6
|
Litsa EE, Chenthamarakshan V, Das P, Kavraki LE. An end-to-end deep learning framework for translating mass spectra to de-novo molecules. Commun Chem 2023; 6:132. [PMID: 37353554 PMCID: PMC10290119 DOI: 10.1038/s42004-023-00932-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Accepted: 06/13/2023] [Indexed: 06/25/2023] Open
Abstract
Elucidating the structure of a chemical compound is a fundamental task in chemistry with applications in multiple domains including drug discovery, precision medicine, and biomarker discovery. The common practice for elucidating the structure of a compound is to obtain a mass spectrum and subsequently retrieve its structure from spectral databases. However, these methods fail for novel molecules that are not present in the reference database. We propose Spec2Mol, a deep learning architecture for molecular structure recommendation given mass spectra alone. Spec2Mol is inspired by the Speech2Text deep learning architectures for translating audio signals into text. Our approach is based on an encoder-decoder architecture. The encoder learns the spectra embeddings, while the decoder, pre-trained on a massive dataset of chemical structures for translating between different molecular representations, reconstructs SMILES sequences of the recommended chemical structures. We have evaluated Spec2Mol by assessing the molecular similarity between the recommended structures and the original structure. Our analysis showed that Spec2Mol is able to identify the presence of key molecular substructures from its mass spectrum, and shows on par performance, when compared to existing fragmentation tree methods particularly when test structure information is not available during training or present in the reference database.
Collapse
Affiliation(s)
- Eleni E Litsa
- Department of Computer Science, Rice University, Houston, TX, USA
| | | | - Payel Das
- IBM Research, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
| | - Lydia E Kavraki
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
7
|
Joiret M, Leclercq M, Lambrechts G, Rapino F, Close P, Louppe G, Geris L. Cracking the genetic code with neural networks. Front Artif Intell 2023; 6:1128153. [PMID: 37091301 PMCID: PMC10117997 DOI: 10.3389/frai.2023.1128153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 03/21/2023] [Indexed: 04/09/2023] Open
Abstract
The genetic code is textbook scientific knowledge that was soundly established without resorting to Artificial Intelligence (AI). The goal of our study was to check whether a neural network could re-discover, on its own, the mapping links between codons and amino acids and build the complete deciphering dictionary upon presentation of transcripts proteins data training pairs. We compared different Deep Learning neural network architectures and estimated quantitatively the size of the required human transcriptomic training set to achieve the best possible accuracy in the codon-to-amino-acid mapping. We also investigated the effect of a codon embedding layer assessing the semantic similarity between codons on the rate of increase of the training accuracy. We further investigated the benefit of quantifying and using the unbalanced representations of amino acids within real human proteins for a faster deciphering of rare amino acids codons. Deep neural networks require huge amount of data to train them. Deciphering the genetic code by a neural network is no exception. A test accuracy of 100% and the unequivocal deciphering of rare codons such as the tryptophan codon or the stop codons require a training dataset of the order of 4–22 millions cumulated pairs of codons with their associated amino acids presented to the neural network over around 7–40 training epochs, depending on the architecture and settings. We confirm that the wide generic capacities and modularity of deep neural networks allow them to be customized easily to learn the deciphering task of the genetic code efficiently.
Collapse
Affiliation(s)
- Marc Joiret
- Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium
- *Correspondence: Marc Joiret
| | - Marine Leclercq
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Gaspard Lambrechts
- Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium
| | - Francesca Rapino
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Pierre Close
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Gilles Louppe
- Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium
| | - Liesbet Geris
- Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium
- Skeletal Biology and Engineering Research Center, KU Leuven, Leuven, Belgium
- Biomechanics Section, KU Leuven, Heverlee, Belgium
| |
Collapse
|
8
|
Cox J. Prediction of peptide mass spectral libraries with machine learning. Nat Biotechnol 2023; 41:33-43. [PMID: 36008611 DOI: 10.1038/s41587-022-01424-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 07/11/2022] [Indexed: 01/21/2023]
Abstract
The recent development of machine learning methods to identify peptides in complex mass spectrometric data constitutes a major breakthrough in proteomics. Longstanding methods for peptide identification, such as search engines and experimental spectral libraries, are being superseded by deep learning models that allow the fragmentation spectra of peptides to be predicted from their amino acid sequence. These new approaches, including recurrent neural networks and convolutional neural networks, use predicted in silico spectral libraries rather than experimental libraries to achieve higher sensitivity and/or specificity in the analysis of proteomics data. Machine learning is galvanizing applications that involve large search spaces, such as immunopeptidomics and proteogenomics. Current challenges in the field include the prediction of spectra for peptides with post-translational modifications and for cross-linked pairs of peptides. Permeation of machine-learning-based spectral prediction into search engines and spectrum-centric data-independent acquisition workflows for diverse peptide classes and measurement conditions will continue to push sensitivity and dynamic range in proteomics applications in the coming years.
Collapse
Affiliation(s)
- Jürgen Cox
- Computational Systems Biochemistry Research Group, Max-Planck Institute of Biochemistry, Martinsried, Germany.
- Department of Biological and Medical Psychology, University of Bergen, Bergen, Norway.
| |
Collapse
|
9
|
Dickinson Q, Meyer JG. Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences. PLoS Comput Biol 2022; 18:e1009736. [PMID: 35089914 PMCID: PMC8797255 DOI: 10.1371/journal.pcbi.1009736] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 12/09/2021] [Indexed: 11/29/2022] Open
Abstract
Machine learning with multi-layered artificial neural networks, also known as "deep learning," is effective for making biological predictions. However, model interpretation is challenging, especially for sequential input data used with recurrent neural network architectures. Here, we introduce a framework called "Positional SHAP" (PoSHAP) to interpret models trained from biological sequences by utilizing SHapely Additive exPlanations (SHAP) to generate positional model interpretations. We demonstrate this using three long short-term memory (LSTM) regression models that predict peptide properties, including binding affinity to major histocompatibility complexes (MHC), and collisional cross section (CCS) measured by ion mobility spectrometry. Interpretation of these models with PoSHAP reproduced MHC class I (rhesus macaque Mamu-A1*001 and human A*11:01) peptide binding motifs, reflected known properties of peptide CCS, and provided new insights into interpositional dependencies of amino acid interactions. PoSHAP should have widespread utility for interpreting a variety of models trained from biological sequences.
Collapse
Affiliation(s)
- Quinn Dickinson
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| | - Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
| |
Collapse
|
10
|
Yang Y, Lin L, Qiao L. Deep learning approaches for data-independent acquisition proteomics. Expert Rev Proteomics 2021; 18:1031-1043. [PMID: 34918987 DOI: 10.1080/14789450.2021.2020654] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
INTRODUCTION Data-independent acquisition (DIA) is an emerging technology for large-scale proteomic studies. DIA data analysis methods are evolving rapidly, and deep learning has cut a conspicuous figure in this field. AREAS COVERED This review discusses and provides an overview of the deep learning methods that are used for DIA data analysis, including spectral library prediction, feature scoring, and statistical control in peptide-centric analysis, as well as de novo peptide sequencing. Literature searches were performed for articles, including preprints, up to December 2021 from PubMed, Scopus, and Web of Science databases. EXPERT OPINION While spectral library prediction has broken through the limitation on proteome coverage of experimental libraries, the statistical burden due to the large query space is the remaining challenge of utilizing proteome-wide predicted libraries. Analysis of post-translational modifications is another promising direction of deep learning-based DIA methods.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Ling Lin
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| | - Liang Qiao
- Department of Chemistry, Shanghai Stomatological Hospital, and Minhang Hospital, Fudan University, Shanghai China
| |
Collapse
|
11
|
Peeters MKR, Baggerman G, Gabriels R, Pepermans E, Menschaert G, Boonen K. Ion Mobility Coupled to a Time-of-Flight Mass Analyzer Combined With Fragment Intensity Predictions Improves Identification of Classical Bioactive Peptides and Small Open Reading Frame-Encoded Peptides. Front Cell Dev Biol 2021; 9:720570. [PMID: 34604223 PMCID: PMC8484717 DOI: 10.3389/fcell.2021.720570] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 08/25/2021] [Indexed: 12/29/2022] Open
Abstract
Bioactive peptides exhibit key roles in a wide variety of complex processes, such as regulation of body weight, learning, aging, and innate immune response. Next to the classical bioactive peptides, emerging from larger precursor proteins by specific proteolytic processing, a new class of peptides originating from small open reading frames (sORFs) have been recognized as important biological regulators. But their intrinsic properties, specific expression pattern and location on presumed non-coding regions have hindered the full characterization of the repertoire of bioactive peptides, despite their predominant role in various pathways. Although the development of peptidomics has offered the opportunity to study these peptides in vivo, it remains challenging to identify the full peptidome as the lack of cleavage enzyme specification and large search space complicates conventional database search approaches. In this study, we introduce a proteogenomics methodology using a new type of mass spectrometry instrument and the implementation of machine learning tools toward improved identification of potential bioactive peptides in the mouse brain. The application of trapped ion mobility spectrometry (tims) coupled to a time-of-flight mass analyzer (TOF) offers improved sensitivity, an enhanced peptide coverage, reduction in chemical noise and the reduced occurrence of chimeric spectra. Subsequent machine learning tools MS2PIP, predicting fragment ion intensities and DeepLC, predicting retention times, improve the database searching based on a large and comprehensive custom database containing both sORFs and alternative ORFs. Finally, the identification of peptides is further enhanced by applying the post-processing semi-supervised learning tool Percolator. Applying this workflow, the first peptidomics workflow combined with spectral intensity and retention time predictions, we identified a total of 167 predicted sORF-encoded peptides, of which 48 originating from presumed non-coding locations, next to 401 peptides from known neuropeptide precursors, linked to 66 annotated bioactive neuropeptides from within 22 different families. Additional PEAKS analysis expanded the pool of SEPs on presumed non-coding locations to 84, while an additional 204 peptides completed the list of peptides from neuropeptide precursors. Altogether, this study provides insights into a new robust pipeline that fuses technological advancements from different fields ensuring an improved coverage of the neuropeptidome in the mouse brain.
Collapse
Affiliation(s)
- Marlies K. R. Peeters
- BioBix, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Geert Baggerman
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
- Unit Environmental Risk and Health, Flemish Institute for Technological Research, Mol, Belgium
| | - Ralf Gabriels
- Department of Biomolecular Medicine, Ghent University, Ghent, Belgium
- VIB-UGent Center for Medical Biotechnology, Flanders Institute for Biotechnology, Ghent, Belgium
| | - Elise Pepermans
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
- Unit Environmental Risk and Health, Flemish Institute for Technological Research, Mol, Belgium
| | - Gerben Menschaert
- BioBix, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- OHMX.bio, Ghent, Belgium
| | - Kurt Boonen
- Centre for Proteomics, University of Antwerp, Antwerp, Belgium
- Unit Environmental Risk and Health, Flemish Institute for Technological Research, Mol, Belgium
| |
Collapse
|
12
|
Haseeb M, Saeed F. High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data. NATURE COMPUTATIONAL SCIENCE 2021; 1:550-561. [PMID: 34723198 PMCID: PMC8554525 DOI: 10.1038/s43588-021-00113-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 07/16/2021] [Indexed: 05/09/2023]
Abstract
Database peptide search algorithms deduce peptides from mass spectrometry (MS) data. There has been substantial effort in improving their computational efficiency to achieve larger and more complex systems biology studies. However, modern serial and high-performance computing (HPC) algorithms exhibit sub-optimal performance mainly due to their ineffective parallel designs (low resource utilization), and high overhead costs. We present an HPC framework, called HiCOPS, for efficient acceleration of the database peptide search algorithms on distributed-memory supercomputers. HiCOPS provides, on average, more than 10-fold improvement in speed, and superior parallel performance over several existing HPC database search software. We also formulate a mathematical model for performance analysis and optimization, and report near-optimal results for several key metrics including strong-scale efficiency, hardware utilization, load-balance, inter-process communication and I/O overheads. The core parallel design, techniques, and optimizations presented in HiCOPS are search-algorithm independent and can be extended to efficiently accelerate the existing and future algorithms and software.
Collapse
Affiliation(s)
- Muhammad Haseeb
- Knight Foundation School of Computing and Information
Sciences, Florida International University, Miami, FL, USA
| | - Fahad Saeed
- Knight Foundation School of Computing and Information
Sciences, Florida International University, Miami, FL, USA
- Biomolecular Sciences Institute (BSI), Florida
International University, Miami, FL, USA
- Department of Human and Molecular Genetics, Herbert
Wertheim School of Medicine, Florida International University, Miami, FL, USA
| |
Collapse
|
13
|
Abstract
Mass-spectrometry-based proteomics enables quantitative analysis of thousands of human proteins. However, experimental and computational challenges restrict progress in the field. This review summarizes the recent flurry of machine-learning strategies using artificial deep neural networks (or "deep learning") that have started to break barriers and accelerate progress in the field of shotgun proteomics. Deep learning now accurately predicts physicochemical properties of peptides from their sequence, including tandem mass spectra and retention time. Furthermore, deep learning methods exist for nearly every aspect of the modern proteomics workflow, enabling improved feature selection, peptide identification, and protein inference.
Collapse
Affiliation(s)
- Jesse G. Meyer
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| |
Collapse
|
14
|
Odenkirk MT, Reif DM, Baker ES. Multiomic Big Data Analysis Challenges: Increasing Confidence in the Interpretation of Artificial Intelligence Assessments. Anal Chem 2021; 93:7763-7773. [PMID: 34029068 PMCID: PMC8465926 DOI: 10.1021/acs.analchem.0c04850] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
The need for holistic molecular measurements to better understand disease initiation, development, diagnosis, and therapy has led to an increasing number of multiomic analyses. The wealth of information available from multiomic assessments, however, requires both the evaluation and interpretation of extremely large data sets, limiting analysis throughput and ease of adoption. Computational methods utilizing artificial intelligence (AI) provide the most promising way to address these challenges, yet despite the conceptual benefits of AI and its successful application in singular omic studies, the widespread use of AI in multiomic studies remains limited. Here, we discuss present and future capabilities of AI techniques in multiomic studies while introducing analytical checks and balances to validate the computational conclusions.
Collapse
Affiliation(s)
- Melanie T Odenkirk
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27606, United States
| | - David M Reif
- Department of Biological Sciences, North Carolina State University, Raleigh, North Carolina 27606, United States
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina 27606, United States
| | - Erin S Baker
- Department of Chemistry, North Carolina State University, Raleigh, North Carolina 27606, United States
| |
Collapse
|
15
|
Tarn C, Zeng WF. pDeep3: Toward More Accurate Spectrum Prediction with Fast Few-Shot Learning. Anal Chem 2021; 93:5815-5822. [PMID: 33797898 DOI: 10.1021/acs.analchem.0c05427] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Spectrum prediction using deep learning has attracted a lot of attention in recent years. Although existing deep learning methods have dramatically increased the prediction accuracy, there is still considerable space for improvement, which is presently limited by the difference of fragmentation types or instrument settings. In this work, we use the few-shot learning method to fit the data online to make up for the shortcoming. The method is evaluated using ten data sets, where the instruments includes Velos, QE, Lumos, and Sciex, with collision energies being differently set. Experimental results show that few-shot learning can achieve higher prediction accuracy with almost negligible computing resources. For example, on the data set from a untrained instrument Sciex-6600, within about 10 s, the prediction accuracy is increased from 69.7% to 86.4%; on the CID (collision-induced dissociation) data set, the prediction accuracy of the model trained by HCD (higher energy collision dissociation) spectra is increased from 48.0% to 83.9%. It is also shown that, the method is not critical to data quality and is sufficiently efficient to fill the accuracy gap. The source code of pDeep3 is available at http://pfind.ict.ac.cn/software/pdeep3.
Collapse
Affiliation(s)
- Ching Tarn
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, 100190, Beijing, China.,University of Chinese Academy of Sciences, 100049, Beijing, China
| |
Collapse
|
16
|
Chen ZL, Mao PZ, Zeng WF, Chi H, He SM. pDeepXL: MS/MS Spectrum Prediction for Cross-Linked Peptide Pairs by Deep Learning. J Proteome Res 2021; 20:2570-2582. [PMID: 33821641 DOI: 10.1021/acs.jproteome.0c01004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In cross-linking mass spectrometry, the identification of cross-linked peptide pairs heavily relies on the ability of a database search engine to measure the similarities between experimental and theoretical MS/MS spectra. However, the lack of accurate ion intensities in theoretical spectra impairs the performance of search engines, in particular, on proteome scales. Here we introduce pDeepXL, a deep neural network to predict MS/MS spectra of cross-linked peptide pairs. To train pDeepXL, we used the transfer-learning technique because it facilitated the training with limited benchmark data of cross-linked peptide pairs. Test results on more than ten data sets showed that pDeepXL accurately predicted the spectra of both noncleavable DSS/BS3/Leiker cross-linked peptide pairs (>80% of predicted spectra have Pearson's r values higher than 0.9) and cleavable DSSO/DSBU cross-linked peptide pairs (>75% of predicted spectra have Pearson's r values higher than 0.9). pDeepXL also achieved the accurate prediction on unseen data sets using an online fine-tuning technique. Lastly, integrating pDeepXL into a database search engine increased the number of identified cross-link spectra by 18% on average.
Collapse
Affiliation(s)
- Zhen-Lin Chen
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Peng-Zhi Mao
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wen-Feng Zeng
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Chi
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Si-Min He
- Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
17
|
Carrà A, Spezia R. In Silico
Tandem Mass Spectrometer: an Analytical and Fundamental Tool. ACTA ACUST UNITED AC 2021. [DOI: 10.1002/cmtd.202000071] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- Andrea Carrà
- Agilent Technologies Italia Via Piero Gobetti 2/C 20063 Cernusco SN, Milano Italy
| | - Riccardo Spezia
- Laboratoire de Chimie Théorique Sorbonne Université, UMR 7616 CNRS 4, Place Jussieu 75005 Paris France
| |
Collapse
|
18
|
Yang M, Zhu Z, Zhuang Z, Bai Y, Wang S, Ge F. Proteogenomic Characterization of the Pathogenic Fungus Aspergillus flavus Reveals Novel Genes Involved in Aflatoxin Production. Mol Cell Proteomics 2020; 20:100013. [PMID: 33568340 PMCID: PMC7950108 DOI: 10.1074/mcp.ra120.002144] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Revised: 10/06/2020] [Accepted: 11/24/2020] [Indexed: 12/20/2022] Open
Abstract
Aspergillus flavus (A. flavus), a pathogenic fungus, can produce carcinogenic and toxic aflatoxins that are a serious agricultural and medical threat worldwide. Attempts to decipher the aflatoxin biosynthetic pathway have been hampered by the lack of a high-quality genome annotation for A. flavus. To address this gap, we performed a comprehensive proteogenomic analysis using high-accuracy mass spectrometry data for this pathogen. The resulting high-quality data set confirmed the translation of 8724 previously predicted genes and identified 732 novel proteins, 269 splice variants, 447 single amino acid variants, 188 revised genes. A subset of novel proteins was experimentally validated by RT-PCR and synthetic peptides. Further functional annotation suggested that a number of the identified novel proteins may play roles in aflatoxin biosynthesis and stress responses in A. flavus. This comprehensive strategy also identified a wide range of posttranslational modifications (PTMs), including 3461 modification sites from 1765 proteins. Functional analysis suggested the involvement of these modified proteins in the regulation of cellular metabolic and aflatoxin biosynthetic pathways. Together, we provided a high-quality annotation of A. flavus genome and revealed novel insights into the mechanisms of aflatoxin production and pathogenicity in this pathogen.
Collapse
Affiliation(s)
- Mingkun Yang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China; State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China
| | - Zhuo Zhu
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Zhenhong Zhuang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Youhuang Bai
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Shihua Wang
- School of Life Sciences, and Key Laboratory of Pathogenic Fungi and Mycotoxins of Fujian Province, Fujian Agriculture and Forestry University, Fuzhou, China.
| | - Feng Ge
- State Key Laboratory of Freshwater Ecology and Biotechnology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan, China.
| |
Collapse
|
19
|
Wen B, Zeng W, Liao Y, Shi Z, Savage SR, Jiang W, Zhang B. Deep Learning in Proteomics. Proteomics 2020; 20:e1900335. [PMID: 32939979 PMCID: PMC7757195 DOI: 10.1002/pmic.201900335] [Citation(s) in RCA: 76] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Revised: 09/14/2020] [Indexed: 12/17/2022]
Abstract
Proteomics, the study of all the proteins in biological systems, is becoming a data-rich science. Protein sequences and structures are comprehensively catalogued in online databases. With recent advancements in tandem mass spectrometry (MS) technology, protein expression and post-translational modifications (PTMs) can be studied in a variety of biological systems at the global scale. Sophisticated computational algorithms are needed to translate the vast amount of data into novel biological insights. Deep learning automatically extracts data representations at high levels of abstraction from data, and it thrives in data-rich scientific research domains. Here, a comprehensive overview of deep learning applications in proteomics, including retention time prediction, MS/MS spectrum prediction, de novo peptide sequencing, PTM prediction, major histocompatibility complex-peptide binding prediction, and protein structure prediction, is provided. Limitations and the future directions of deep learning in proteomics are also discussed. This review will provide readers an overview of deep learning and how it can be used to analyze proteomics data.
Collapse
Affiliation(s)
- Bo Wen
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen‐Feng Zeng
- Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS)Chinese Academy of SciencesInstitute of Computing TechnologyBeijing100190China
| | - Yuxing Liao
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Zhiao Shi
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Sara R. Savage
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Wen Jiang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| | - Bing Zhang
- Lester and Sue Smith Breast CenterBaylor College of MedicineHoustonTX77030USA
- Department of Molecular and Human GeneticsBaylor College of MedicineHoustonTX77030USA
| |
Collapse
|