1
|
Onigbinde S, Gutierrez Reyes CD, Sandilya V, Chukwubueze F, Oluokun O, Sahioun S, Oluokun A, Mechref Y. Optimization of glycopeptide enrichment techniques for the identification of clinical biomarkers. Expert Rev Proteomics 2024:1-32. [PMID: 39439029 DOI: 10.1080/14789450.2024.2418491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 07/28/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024]
Abstract
INTRODUCTION The identification and characterization of glycopeptides through LC-MS/MS and advanced enrichment techniques are crucial for advancing clinical glycoproteomics, significantly impacting the discovery of disease biomarkers and therapeutic targets. Despite progress in enrichment methods like Lectin Affinity Chromatography (LAC), Hydrophilic Interaction Liquid Chromatography (HILIC), and Electrostatic Repulsion Hydrophilic Interaction Chromatography (ERLIC), issues with specificity, efficiency, and scalability remain, impeding thorough analysis of complex glycosylation patterns crucial for disease understanding. AREAS COVERED This review explores the current challenges and innovative solutions in glycopeptide enrichment and mass spectrometry analysis, highlighting the importance of novel materials and computational advances for improving sensitivity and specificity. It outlines the potential future directions of these technologies in clinical glycoproteomics, emphasizing their transformative impact on medical diagnostics and therapeutic strategies. EXPERT OPINION The application of innovative materials such as Metal-Organic Frameworks (MOFs), Covalent Organic Frameworks (COFs), functional nanomaterials, and online enrichment shows promise in addressing challenges associated with glycoproteomics analysis by providing more selective and robust enrichment platforms. Moreover, the integration of artificial intelligence and machine learning is revolutionizing glycoproteomics by enhancing the processing and interpretation of extensive data from LC-MS/MS, boosting biomarker discovery, and improving predictive accuracy, thus supporting personalized medicine.
Collapse
Affiliation(s)
- Sherifdeen Onigbinde
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | | | - Vishal Sandilya
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Favour Chukwubueze
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Odunayo Oluokun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Sarah Sahioun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Ayobami Oluokun
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| | - Yehia Mechref
- Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, TX, USA
| |
Collapse
|
2
|
Wen B, Hsu C, Zeng WF, Riffle M, Chang A, Mudge M, Nunn B, Berg MD, Villén J, MacCoss MJ, Noble WS. Carafe enables high quality in silico spectral library generation for data-independent acquisition proteomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.15.618504. [PMID: 39463980 PMCID: PMC11507862 DOI: 10.1101/2024.10.15.618504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Data-independent acquisition (DIA)-based mass spectrometry is becoming an increasingly popular mass spectrometry acquisition strategy for carrying out quantitative proteomics experiments. Most of the popular DIA search engines make use of in silico generated spectral libraries. However, the generation of high-quality spectral libraries for DIA data analysis remains a challenge, particularly because most such libraries are generated directly from data-dependent acquisition (DDA) data or are from in silico prediction using models trained on DDA data. In this study, we developed Carafe, a tool that generates high-quality experiment-specific in silico spectral libraries by training deep learning models directly on DIA data. We demonstrate the performance of Carafe on a wide range of DIA datasets, where we observe improved fragment ion intensity prediction and peptide detection relative to existing pretrained DDA models.
Collapse
Affiliation(s)
- Bo Wen
- Department of Genome Sciences, University of Washington
| | - Chris Hsu
- Department of Genome Sciences, University of Washington
| | - Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Germany
| | | | - Alexis Chang
- Department of Genome Sciences, University of Washington
| | - Miranda Mudge
- Department of Genome Sciences, University of Washington
| | - Brook Nunn
- Department of Genome Sciences, University of Washington
| | | | - Judit Villén
- Department of Genome Sciences, University of Washington
| | | | - William S. Noble
- Department of Genome Sciences, University of Washington
- Paul G. Allen School of Computer Science and Engineering, University of Washington
| |
Collapse
|
3
|
Zhang L, Deng T, Pan S, Zhang M, Zhang Y, Yang C, Yang X, Tian G, Mi J. DeepO-GlcNAc: a web server for prediction of protein O-GlcNAcylation sites using deep learning combined with attention mechanism. Front Cell Dev Biol 2024; 12:1456728. [PMID: 39450274 PMCID: PMC11500328 DOI: 10.3389/fcell.2024.1456728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Accepted: 09/26/2024] [Indexed: 10/26/2024] Open
Abstract
Introduction Protein O-GlcNAcylation is a dynamic post-translational modification involved in major cellular processes and associated with many human diseases. Bioinformatic prediction of O-GlcNAc sites before experimental validation is a challenge task in O-GlcNAc research. Recent advancements in deep learning algorithms and the availability of O-GlcNAc proteomics data present an opportunity to improve O-GlcNAc site prediction. Objectives This study aims to develop a deep learning-based tool to improve O-GlcNAcylation site prediction. Methods We construct an annotated unbalanced O-GlcNAcylation data set and propose a new deep learning framework, DeepO-GlcNAc, using Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) combined with attention mechanism. Results The ablation study confirms that the additional model components in DeepO-GlcNAc, such as attention mechanisms and LSTM, contribute positively to improving prediction performance. Our model demonstrates strong robustness across five cross-species datasets, excluding humans. We also compare our model with three external predictors using an independent dataset. Our results demonstrated that DeepO-GlcNAc outperforms the external predictors, achieving an accuracy of 92%, an average precision of 72%, a MCC of 0.60, and an AUC of 92% in ROC analysis. Moreover, we have implemented DeepO-GlcNAc as a web server to facilitate further investigation and usage by the scientific community. Conclusion Our work demonstrates the feasibility of utilizing deep learning for O-GlcNAc site prediction and provides a novel tool for O-GlcNAc investigation.
Collapse
Affiliation(s)
- Liyuan Zhang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Tingzhi Deng
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China
| | - Shuijing Pan
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Minghui Zhang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, Shandong, China
| | - Chunhua Yang
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Xiaoyong Yang
- Department of Comparative Medicine, Department of Cellular and Molecular Physiology, Yale University, New Haven, CT, United States
| | - Geng Tian
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| | - Jia Mi
- Shandong Technology Innovation Center of Molecular Targeting and Intelligent Diagnosis and Treatment, Binzhou Medical University, Yantai, Shandong, China
| |
Collapse
|
4
|
Qin Z, Ren H, Zhao P, Wang K, Liu H, Miao C, Du Y, Li J, Wu L, Chen Z. Current computational tools for protein lysine acylation site prediction. Brief Bioinform 2024; 25:bbae469. [PMID: 39316944 PMCID: PMC11421846 DOI: 10.1093/bib/bbae469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/20/2024] [Accepted: 09/07/2024] [Indexed: 09/26/2024] Open
Abstract
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
Collapse
Affiliation(s)
- Zhaohui Qin
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Haoran Ren
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Kaiyuan Wang
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Huixia Liu
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Chunbo Miao
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Yanxiu Du
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Liuji Wu
- National Key Laboratory of Wheat and Maize Crop Science, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Key Laboratory of Rice Molecular Breeding and High Efficiency Production, College of Agronomy, Henan Agricultural University, Zhengzhou 450046, China
| |
Collapse
|
5
|
Kim HS, Kim YI, Cho JY. ARID3C Acts as a Regulator of Monocyte-to-Macrophage Differentiation Interacting with NPM1. J Proteome Res 2024; 23:2882-2892. [PMID: 38231884 PMCID: PMC11302414 DOI: 10.1021/acs.jproteome.3c00509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 12/15/2023] [Accepted: 12/21/2023] [Indexed: 01/19/2024]
Abstract
ARID3C is a protein located on human chromosome 9 and expressed at low levels in various organs, yet its biological function has not been elucidated. In this study, we investigated both the cellular localization and function of ARID3C. Employing a combination of LC-MS/MS and deep learning techniques, we identified NPM1 as a binding partner for ARID3C's nuclear shuttling. ARID3C was found to predominantly localize with the nucleus, where it functioned as a transcription factor for genes STAT3, STAT1, and JUNB, thereby facilitating monocyte-to-macrophage differentiation. The precise binding sites between ARID3C and NPM1 were predicted by AlphaFold2. Mutating this binding site prevented ARID3C from interacting with NPM1, resulting in its retention in the cytoplasm instead of translocation to the nucleus. Consequently, ARID3C lost its ability to bind to the promoters of target genes, leading to a loss of monocyte-to-macrophage differentiation. Collectively, our findings indicate that ARID3C forms a complex with NPM1 to translocate to the nucleus, acting as a transcription factor that promotes the expression of the genes involved in monocyte-to-macrophage differentiation.
Collapse
Affiliation(s)
- Hui-Su Kim
- Department
of Biochemistry, College of Veterinary Medicine, Research Institute
for Veterinary Science, and BK21 FOUR Future Veterinary Medicine Leading
Education and Research Center, Seoul National
University, Seoul 08826, Republic of Korea
- Comparative
Medicine Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul 08826, Republic of Korea
| | - Yong-In Kim
- Department
of Biochemistry, College of Veterinary Medicine, Research Institute
for Veterinary Science, and BK21 FOUR Future Veterinary Medicine Leading
Education and Research Center, Seoul National
University, Seoul 08826, Republic of Korea
- Comparative
Medicine Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul 08826, Republic of Korea
| | - Je-Yoel Cho
- Department
of Biochemistry, College of Veterinary Medicine, Research Institute
for Veterinary Science, and BK21 FOUR Future Veterinary Medicine Leading
Education and Research Center, Seoul National
University, Seoul 08826, Republic of Korea
- Comparative
Medicine Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul 08826, Republic of Korea
| |
Collapse
|
6
|
Webel H, Niu L, Nielsen AB, Locard-Paulet M, Mann M, Jensen LJ, Rasmussen S. Imputation of label-free quantitative mass spectrometry-based proteomics data using self-supervised deep learning. Nat Commun 2024; 15:5405. [PMID: 38926340 PMCID: PMC11208500 DOI: 10.1038/s41467-024-48711-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 05/13/2024] [Indexed: 06/28/2024] Open
Abstract
Imputation techniques provide means to replace missing measurements with a value and are used in almost all downstream analysis of mass spectrometry (MS) based proteomics data using label-free quantification (LFQ). Here we demonstrate how collaborative filtering, denoising autoencoders, and variational autoencoders can impute missing values in the context of LFQ at different levels. We applied our method, proteomics imputation modeling mass spectrometry (PIMMS), to an alcohol-related liver disease (ALD) cohort with blood plasma proteomics data available for 358 individuals. Removing 20 percent of the intensities we were able to recover 15 out of 17 significant abundant protein groups using PIMMS-VAE imputations. When analyzing the full dataset we identified 30 additional proteins (+13.2%) that were significantly differentially abundant across disease stages compared to no imputation and found that some of these were predictive of ALD progression in machine learning models. We, therefore, suggest the use of deep learning approaches for imputing missing values in MS-based proteomics on larger datasets and provide workflows for these.
Collapse
Affiliation(s)
- Henry Webel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Lili Niu
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Annelaura Bach Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Marie Locard-Paulet
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
- Institut de Pharmacologie et de Biologie Structurale (IPBS), Université de Toulouse, CNRS, Université Toulouse III - Paul Sabatier (UT3), Toulouse, France
| | - Matthias Mann
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark.
- Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen N, Denmark.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
7
|
Beck A, Muhoberac M, Randolph CE, Beveridge CH, Wijewardhane PR, Kenttämaa HI, Chopra G. Recent Developments in Machine Learning for Mass Spectrometry. ACS MEASUREMENT SCIENCE AU 2024; 4:233-246. [PMID: 38910862 PMCID: PMC11191731 DOI: 10.1021/acsmeasuresciau.3c00060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/27/2023] [Accepted: 01/22/2024] [Indexed: 06/25/2024]
Abstract
Statistical analysis and modeling of mass spectrometry (MS) data have a long and rich history with several modern MS-based applications using statistical and chemometric methods. Recently, machine learning (ML) has experienced a renaissance due to advents in computational hardware and the development of new algorithms for artificial neural networks (ANN) and deep learning architectures. Moreover, recent successes of new ANN and deep learning architectures in several areas of science, engineering, and society have further strengthened the ML field. Importantly, modern ML methods and architectures have enabled new approaches for tasks related to MS that are now widely adopted in several popular MS-based subdisciplines, such as mass spectrometry imaging and proteomics. Herein, we aim to provide an introductory summary of the practical aspects of ML methodology relevant to MS. Additionally, we seek to provide an up-to-date review of the most recent developments in ML integration with MS-based techniques while also providing critical insights into the future direction of the field.
Collapse
Affiliation(s)
- Armen
G. Beck
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Matthew Muhoberac
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Caitlin E. Randolph
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Connor H. Beveridge
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Prageeth R. Wijewardhane
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Hilkka I. Kenttämaa
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
| | - Gaurav Chopra
- Department
of Chemistry, Purdue University, 560 Oval Drive, West Lafayette, Indiana 47907, United States
- Department
of Computer Science (by courtesy), Purdue University, West Lafayette, Indiana 47907, United States
- Purdue
Institute for Drug Discovery, Purdue Institute for Cancer Research,
Regenstrief Center for Healthcare Engineering, Purdue Institute for
Inflammation, Immunology and Infectious Disease, Purdue Institute for Integrative Neuroscience, West Lafayette, Indiana 47907 United States
| |
Collapse
|
8
|
Algahmadi A, Mohammed AE, Alfadda AA, Alanazi IO, Alwehaibi MA, Scaria Joy S, Al-shaye D, Benabdelkamel H. Proteomics of Penicillium chrysogenum for a Deeper Understanding of Lead (Pb) Metal Bioremediation. ACS OMEGA 2024; 9:26245-26256. [PMID: 38911750 PMCID: PMC11190926 DOI: 10.1021/acsomega.4c02006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 05/09/2024] [Accepted: 05/27/2024] [Indexed: 06/25/2024]
Abstract
Penicillium chrysogenum (P. chrysogenum), a ubiquitous filamentous fungus, has demonstrated remarkable potential in the bioremediation of lead-contaminated environments. Its inherent tolerance and bioaccumulation capacity for lead (Pb), coupled with its relatively rapid growth rate, make it an attractive candidate for bioremediation applications. This study aims to identify the proteomic changes in P. chrysogenuminduced by Pb metal stress and unravel the roles of identified proteins in molecular mechanisms and cellular responses. Untargeted proteomic analysis was carried out using a two-dimensional difference in gel electrophoresis (2D-DIGE) coupled with matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS). This study reported the identification of 43 statistically significant proteins (24 upregulated and 19 downregulated, ANOVA, p ≤ 0.05; fold change ≥1.5) in P. chrysogenum as a consequence of Pb treatment. Proteins were grouped according to their function into 18 groups from which 13 proteins were related to metabolism, 11 were related to cellular process and signaling, and 19 proteins were related to information storage and processing. The current study is considered the first report about the proteomics study of P. chrysogenum under Pb stress conditions, where upregulated proteins could better explain the mechanism of tolerance and Pb toxicity removal. Our research has provided a thorough understanding of the molecular and cellular processes involved in fungal-metal interactions, paving the way for the development of innovative molecular markers for heavy metal myco-remediation. To the best of our knowledge, this study of P. chrysogenum provides valuable insights toward growing research in comprehending the metal-microbe interactions. This will facilitate development of novel molecular markers for metal bioremediation.
Collapse
Affiliation(s)
- Amjad Algahmadi
- Department
of Biology, College of Science, Princess
Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia
| | - Afrah E. Mohammed
- Department
of Biology, College of Science, Princess
Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia
| | - Assim A Alfadda
- Proteomics
Resource Unit, Obesity Research Center and the Department of Medicine,
College of Medicine, King Saud University, P O Box 2925 98 Riyadh 11461, Saudi Arabia
| | - Ibrahim O Alanazi
- Healthy
Aging Research Institute Health Sector, King Abdulaziz City for Science and Technology (KACST), P O Box 6086 Riyadh 11442, Saudi Arabia
| | - Moudi A. Alwehaibi
- Proteomics
Resource Unit, Obesity Research Center, College of Medicine, King Saud University,
P O Box 2925 98 Riyadh 11461, Saudi Arabia
| | - Salini Scaria Joy
- Strategic
Center for Diabetes Research, College of Medicine, King Saud University, Riyadh 12211, Saudi Arabia
| | - Dalal Al-shaye
- Department
of Biology, College of Science, Princess
Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia
| | - Hicham Benabdelkamel
- Proteomics
Resource Unit, Obesity Research Center, College of Medicine, King Saud University,
P O Box 2925 98 Riyadh 11461, Saudi Arabia
| |
Collapse
|
9
|
Bulashevska A, Nacsa Z, Lang F, Braun M, Machyna M, Diken M, Childs L, König R. Artificial intelligence and neoantigens: paving the path for precision cancer immunotherapy. Front Immunol 2024; 15:1394003. [PMID: 38868767 PMCID: PMC11167095 DOI: 10.3389/fimmu.2024.1394003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 05/13/2024] [Indexed: 06/14/2024] Open
Abstract
Cancer immunotherapy has witnessed rapid advancement in recent years, with a particular focus on neoantigens as promising targets for personalized treatments. The convergence of immunogenomics, bioinformatics, and artificial intelligence (AI) has propelled the development of innovative neoantigen discovery tools and pipelines. These tools have revolutionized our ability to identify tumor-specific antigens, providing the foundation for precision cancer immunotherapy. AI-driven algorithms can process extensive amounts of data, identify patterns, and make predictions that were once challenging to achieve. However, the integration of AI comes with its own set of challenges, leaving space for further research. With particular focus on the computational approaches, in this article we have explored the current landscape of neoantigen prediction, the fundamental concepts behind, the challenges and their potential solutions providing a comprehensive overview of this rapidly evolving field.
Collapse
Affiliation(s)
- Alla Bulashevska
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Zsófia Nacsa
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Franziska Lang
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Markus Braun
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Martin Machyna
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Mustafa Diken
- TRON - Translational Oncology at the University Medical Center of the Johannes Gutenberg University gGmbH, Mainz, Germany
| | - Liam Childs
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| | - Renate König
- Host-Pathogen-Interactions, Paul-Ehrlich-Institut, Langen, Germany
| |
Collapse
|
10
|
Peters-Clarke TM, Coon JJ, Riley NM. Instrumentation at the Leading Edge of Proteomics. Anal Chem 2024; 96:7976-8010. [PMID: 38738990 DOI: 10.1021/acs.analchem.3c04497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Affiliation(s)
- Trenton M Peters-Clarke
- Department of Chemistry, University of Wisconsin─Madison, Madison, Wisconsin 53706, United States
- Department of Biomolecular Chemistry, University of Wisconsin─Madison, Madison, Wisconsin 53706, United States
| | - Joshua J Coon
- Department of Chemistry, University of Wisconsin─Madison, Madison, Wisconsin 53706, United States
- Department of Biomolecular Chemistry, University of Wisconsin─Madison, Madison, Wisconsin 53706, United States
- Morgridge Institute for Research, Madison, Wisconsin 53715, United States
| | - Nicholas M Riley
- Department of Chemistry, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
11
|
Siraj A, Bouwmeester R, Declercq A, Welp L, Chernev A, Wulf A, Urlaub H, Martens L, Degroeve S, Kohlbacher O, Sachsenberg T. Intensity and retention time prediction improves the rescoring of protein-nucleic acid cross-links. Proteomics 2024; 24:e2300144. [PMID: 38629965 DOI: 10.1002/pmic.202300144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 12/29/2023] [Accepted: 01/05/2024] [Indexed: 04/19/2024]
Abstract
In protein-RNA cross-linking mass spectrometry, UV or chemical cross-linking introduces stable bonds between amino acids and nucleic acids in protein-RNA complexes that are then analyzed and detected in mass spectra. This analytical tool delivers valuable information about RNA-protein interactions and RNA docking sites in proteins, both in vitro and in vivo. The identification of cross-linked peptides with oligonucleotides of different length leads to a combinatorial increase in search space. We demonstrate that the peptide retention time prediction tasks can be transferred to the task of cross-linked peptide retention time prediction using a simple amino acid composition encoding, yielding improved identification rates when the prediction error is included in rescoring. For the more challenging task of including fragment intensity prediction of cross-linked peptides in the rescoring, we obtain, on average, a similar improvement. Further improvement in the encoding and fine-tuning of retention time and intensity prediction models might lead to further gains, and merit further research.
Collapse
Affiliation(s)
- Arslan Siraj
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Robbin Bouwmeester
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Arthur Declercq
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Luisa Welp
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- Bioanalytics, Institute of Clinical Chemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Aleksandar Chernev
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Alexander Wulf
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Henning Urlaub
- Bioanalytical Mass Spectrometry, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- Bioanalytics, Institute of Clinical Chemistry, University Medical Center Göttingen, Göttingen, Germany
| | - Lennart Martens
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Sven Degroeve
- Department of Biomolecular Medicine, Ghent University, Gent, Belgium
- VIB-UGent Center for Medical Biotechnology, VIB, Gent, Belgium
| | - Oliver Kohlbacher
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Timo Sachsenberg
- Department of Computer Science, Applied Bioinformatics, University of Tübingen, Tübingen, Germany
- Institute for Biological and Medical Informatics, University of Tübingen, Tübingen, Germany
| |
Collapse
|
12
|
Abrego L, Zaikin A, Marino IP, Krivonosov MI, Jacobs I, Menon U, Gentry‐Maharaj A, Blyuss O. Bayesian and deep-learning models applied to the early detection of ovarian cancer using multiple longitudinal biomarkers. Cancer Med 2024; 13:e7163. [PMID: 38597129 PMCID: PMC11004913 DOI: 10.1002/cam4.7163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 03/16/2024] [Accepted: 03/26/2024] [Indexed: 04/11/2024] Open
Abstract
BACKGROUND Ovarian cancer is the most lethal of all gynecological cancers. Cancer Antigen 125 (CA125) is the best-performing ovarian cancer biomarker which however is still not effective as a screening test in the general population. Recent literature reports additional biomarkers with the potential to improve on CA125 for early detection when using longitudinal multimarker models. METHODS Our data comprised 180 controls and 44 cases with serum samples sourced from the multimodal arm of UK Collaborative Trial of Ovarian Cancer Screening (UKCTOCS). Our models were based on Bayesian change-point detection and recurrent neural networks. RESULTS We obtained a significantly higher performance for CA125-HE4 model using both methodologies (AUC 0.971, sensitivity 96.7% and AUC 0.987, sensitivity 96.7%) with respect to CA125 (AUC 0.949, sensitivity 90.8% and AUC 0.953, sensitivity 92.1%) for Bayesian change-point model (BCP) and recurrent neural networks (RNN) approaches, respectively. One year before diagnosis, the CA125-HE4 model also ranked as the best, whereas at 2 years before diagnosis no multimarker model outperformed CA125. CONCLUSIONS Our study identified and tested different combination of biomarkers using longitudinal multivariable models that outperformed CA125 alone. We showed the potential of multivariable models and candidate biomarkers to increase the detection rate of ovarian cancer.
Collapse
Affiliation(s)
- Luis Abrego
- Department of Women's CancerEGA Institute for Women's Health, University College LondonLondonUK
- Department of MathematicsUniversity College LondonLondonUK
| | - Alexey Zaikin
- Department of Women's CancerEGA Institute for Women's Health, University College LondonLondonUK
- Department of MathematicsUniversity College LondonLondonUK
| | - Ines P. Marino
- Department of Biology and Geology, Physics and Inorganic ChemistryUniversidad Rey Juan CarlosMadridSpain
| | - Mikhail I. Krivonosov
- Research Center for Trusted Artificial IntelligenceIvannikov Institute for System Programming of the Russian Academy of SciencesMoscowRussia
- Institute of BiogerontologyLobachevsky State UniversityNizhny NovgorodRussia
| | - Ian Jacobs
- Department of Women's CancerEGA Institute for Women's Health, University College LondonLondonUK
| | - Usha Menon
- MRC Clinical Trials UnitUniversity College LondonLondonUK
| | - Aleksandra Gentry‐Maharaj
- Department of Women's CancerEGA Institute for Women's Health, University College LondonLondonUK
- MRC Clinical Trials UnitUniversity College LondonLondonUK
| | - Oleg Blyuss
- Department of Women's CancerEGA Institute for Women's Health, University College LondonLondonUK
- Wolfson Institute of Population HealthQueen Mary University of LondonLondonUK
- Department of Pediatrics and Pediatric Infectious Diseases, Institute of Child's HealthSechenov First Moscow State Medical University (Sechenov University)MoscowRussia
| |
Collapse
|
13
|
Yang Y, Fang Q. Prediction of glycopeptide fragment mass spectra by deep learning. Nat Commun 2024; 15:2448. [PMID: 38503734 PMCID: PMC10951270 DOI: 10.1038/s41467-024-46771-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 03/11/2024] [Indexed: 03/21/2024] Open
Abstract
Deep learning has achieved a notable success in mass spectrometry-based proteomics and is now emerging in glycoproteomics. While various deep learning models can predict fragment mass spectra of peptides with good accuracy, they cannot cope with the non-linear glycan structure in an intact glycopeptide. Herein, we present DeepGlyco, a deep learning-based approach for the prediction of fragment spectra of intact glycopeptides. Our model adopts tree-structured long-short term memory networks to process the glycan moiety and a graph neural network architecture to incorporate potential fragmentation pathways of a specific glycan structure. This feature is beneficial to model explainability and differentiation ability of glycan structural isomers. We further demonstrate that predicted spectral libraries can be used for data-independent acquisition glycoproteomics as a supplement for library completeness. We expect that this work will provide a valuable deep learning resource for glycoproteomics.
Collapse
Affiliation(s)
- Yi Yang
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311200, China.
| | - Qun Fang
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou, 311200, China.
- Department of Chemistry, Zhejiang University, Hangzhou, 310058, China.
| |
Collapse
|
14
|
Strauss MT, Bludau I, Zeng WF, Voytik E, Ammar C, Schessner JP, Ilango R, Gill M, Meier F, Willems S, Mann M. AlphaPept: a modern and open framework for MS-based proteomics. Nat Commun 2024; 15:2168. [PMID: 38461149 PMCID: PMC10924963 DOI: 10.1038/s41467-024-46485-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/20/2024] [Indexed: 03/11/2024] Open
Abstract
In common with other omics technologies, mass spectrometry (MS)-based proteomics produces ever-increasing amounts of raw data, making efficient analysis a principal challenge. A plethora of different computational tools can process the MS data to derive peptide and protein identification and quantification. However, during the last years there has been dramatic progress in computer science, including collaboration tools that have transformed research and industry. To leverage these advances, we develop AlphaPept, a Python-based open-source framework for efficient processing of large high-resolution MS data sets. Numba for just-in-time compilation on CPU and GPU achieves hundred-fold speed improvements. AlphaPept uses the Python scientific stack of highly optimized packages, reducing the code base to domain-specific tasks while accessing the latest advances. We provide an easy on-ramp for community contributions through the concept of literate programming, implemented in Jupyter Notebooks. Large datasets can rapidly be processed as shown by the analysis of hundreds of proteomes in minutes per file, many-fold faster than acquisition. AlphaPept can be used to build automated processing pipelines with web-serving functionality and compatibility with downstream analysis tools. It provides easy access via one-click installation, a modular Python library for advanced users, and via an open GitHub repository for developers.
Collapse
Affiliation(s)
- Maximilian T Strauss
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Isabell Bludau
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Eugenia Voytik
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Constantin Ammar
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Julia P Schessner
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | | | | | - Florian Meier
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
- Functional Proteomics, Jena University Hospital, Jena, Germany
| | - Sander Willems
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Matthias Mann
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
15
|
Jia J, Lei R, Qin L, Wei X. i5mC-DCGA: an improved hybrid network framework based on the CBAM attention mechanism for identifying promoter 5mC sites. BMC Genomics 2024; 25:242. [PMID: 38443802 PMCID: PMC10913688 DOI: 10.1186/s12864-024-10154-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Accepted: 02/22/2024] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND 5-Methylcytosine (5mC) plays a very important role in gene stability, transcription, and development. Therefore, accurate identification of the 5mC site is of key importance in genetic and pathological studies. However, traditional experimental methods for identifying 5mC sites are time-consuming and costly, so there is an urgent need to develop computational methods to automatically detect and identify these 5mC sites. RESULTS Deep learning methods have shown great potential in the field of 5mC sites, so we developed a deep learning combinatorial model called i5mC-DCGA. The model innovatively uses the Convolutional Block Attention Module (CBAM) to improve the Dense Convolutional Network (DenseNet), which is improved to extract advanced local feature information. Subsequently, we combined a Bidirectional Gated Recurrent Unit (BiGRU) and a Self-Attention mechanism to extract global feature information. Our model can learn feature representations of abstract and complex from simple sequence coding, while having the ability to solve the sample imbalance problem in benchmark datasets. The experimental results show that the i5mC-DCGA model achieves 97.02%, 96.52%, 96.58% and 85.58% in sensitivity (Sn), specificity (Sp), accuracy (Acc) and matthews correlation coefficient (MCC), respectively. CONCLUSIONS The i5mC-DCGA model outperforms other existing prediction tools in predicting 5mC sites, and it is currently the most representative promoter 5mC site prediction tool. The benchmark dataset and source code for the i5mC-DCGA model can be found in https://github.com/leirufeng/i5mC-DCGA .
Collapse
Grants
- Nos. 61761023, 62162032, and 31760315 National Natural Science Foundation of China
- Nos. 61761023, 62162032, and 31760315 National Natural Science Foundation of China
- Nos. 61761023, 62162032, and 31760315 National Natural Science Foundation of China
- Nos. 20202BABL202004 and 20202BAB202007 Natural Science Foundation of Jiangxi Province
- Nos. 20202BABL202004 and 20202BAB202007 Natural Science Foundation of Jiangxi Province
- Nos. 20202BABL202004 and 20202BAB202007 Natural Science Foundation of Jiangxi Province
- GJJ190695 and GJJ212419 Scientific Research Plan of the Department of Education of Jiangxi Province, China
- GJJ190695 and GJJ212419 Scientific Research Plan of the Department of Education of Jiangxi Province, China
- GJJ190695 and GJJ212419 Scientific Research Plan of the Department of Education of Jiangxi Province, China
- GJJ190695 and GJJ212419 Scientific Research Plan of the Department of Education of Jiangxi Province, China
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, 333403, Jingdezhen, China.
| | - Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, 333403, Jingdezhen, China.
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, 333403, Jingdezhen, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, 330044, Nanchang, China
| |
Collapse
|
16
|
Lapin J, Yan X, Dong Q. UniSpec: Deep Learning for Predicting the Full Range of Peptide Fragment Ion Series to Enhance the Proteomics Data Analysis Workflow. Anal Chem 2024. [PMID: 38329031 DOI: 10.1021/acs.analchem.3c02321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
We present UniSpec, an attention-driven deep neural network designed to predict comprehensive collision-induced fragmentation spectra, thereby improving peptide identification in shotgun proteomics. Utilizing a training data set of 1.8 million unique high-quality tandem mass spectra (MS2) from 0.8 million unique peptide ions, UniSpec learned with a peptide fragmentation dictionary encompassing 7919 fragment peaks. Among these, 5712 are neutral loss peaks, with 2310 corresponding to modification-specific neutral losses. Remarkably, UniSpec can predict 73%-77% of fragment intensities based on our NIST reference library spectra, a significant leap from the 35%-45% coverage of only b and y ions. Comparative studies with Prosit elucidate that while both models are strong at predicting their respective fragment ion series, UniSpec particularly shines in generating more complex MS2 spectra with diverse ion annotations. The integration of UniSpec's predictions into shotgun proteomics data analysis boosts the identification rate of tryptic peptides by 48% at a 1% false discovery rate (FDR) and 60% at a more confident 0.1% FDR. Using UniSpec's predicted in-silico spectral library, the search results closely matched those from search engines and experimental spectral libraries used in peptide identification, highlighting its potential as a stand-alone identification tool. The source code and Python scripts are available on GitHub (https://github.com/usnistgov/UniSpec) and Zenodo (https://zenodo.org/records/10452792), and all data sets and analysis results generated in this work were deposited in Zenodo (https://zenodo.org/records/10052268).
Collapse
Affiliation(s)
- Joel Lapin
- Department of Physics, Georgetown University, Washington, D.C. 20057, United States
- Associate, Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Xinjian Yan
- Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| | - Qian Dong
- Mass Spectrometry Data Center, Biomolecular Measurement Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States
| |
Collapse
|
17
|
Mu D, Sun D, Qian X, Ma X, Qiu L, Cheng X, Yu S. Steroid profiling in adrenal disease. Clin Chim Acta 2024; 553:117749. [PMID: 38169194 DOI: 10.1016/j.cca.2023.117749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 12/26/2023] [Accepted: 12/27/2023] [Indexed: 01/05/2024]
Abstract
The measurement of steroid hormones in blood and urine, which reflects steroid biosynthesis and metabolism, has been recognized as a valuable tool for identifying and distinguishing steroidogenic disorders. The application of mass spectrometry enables the reliable and simultaneous analysis of large panels of steroids, ushering in a new era for diagnosing adrenal diseases. However, the interpretation of complex hormone results necessitates the expertise and experience of skilled clinicians. In this scenario, machine learning techniques are gaining worldwide attention within healthcare fields. The clinical values of combining mass spectrometry-based steroid profiles analysis with machine learning models, also known as steroid metabolomics, have been investigated for identifying and discriminating adrenal disorders such as adrenocortical carcinomas, adrenocortical adenomas, and congenital adrenal hyperplasia. This promising approach is expected to lead to enhanced clinical decision-making in the field of adrenal diseases. This review will focus on the clinical performances of steroid profiling, which is measured using mass spectrometry and analyzed by machine learning techniques, in the realm of decision-making for adrenal diseases.
Collapse
Affiliation(s)
- Danni Mu
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China
| | - Dandan Sun
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China
| | - Xia Qian
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China
| | - Xiaoli Ma
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China
| | - Ling Qiu
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China; State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Science and Peking Union Medical College, Beijing, China.
| | - Xinqi Cheng
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China.
| | - Songlin Yu
- Department of Laboratory Medicine, Peking Union Medical College Hospital, Peking Union Medical College & Chinese Academy of Medical Science, Beijing 100730, China.
| |
Collapse
|
18
|
Abdullah N, Husin NF, Goh YX, Kamaruddin MA, Abdullah MS, Yusri AF, Kamalul Arifin AS, Jamal R. Development of digital health management systems in longitudinal study: The Malaysian cohort experience. Digit Health 2024; 10:20552076241277481. [PMID: 39281044 PMCID: PMC11402075 DOI: 10.1177/20552076241277481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Accepted: 08/07/2024] [Indexed: 09/18/2024] Open
Abstract
Background The management of extensive longitudinal data in cohort studies presents significant challenges, particularly in middle-income countries like Malaysia where technological resources may be limited. These challenges include ensuring data integrity, security, and scalability of storage solutions over extended periods. Objective This article outlines innovative methods developed and implemented by The Malaysian Cohort project to effectively manage and maintain large-scale databases from project inception through the follow-up phase, ensuring robust data privacy and security. Methods We describe the comprehensive strategies employed to develop and sustain the database infrastructure necessary for handling large volumes of data collected during the study. This includes the integration of advanced information management systems and adherence to stringent data security protocols. Outcomes Key achievements include the establishment of a scalable database architecture and an effective data privacy framework that together support the dynamic requirements of longitudinal healthcare research. The solutions implemented serve as a model for similar cohort studies in resource-limited settings. The article also explores the broader implications of these methodologies for public health and personalized medicine, addressing both the challenges posed by big data in healthcare and the opportunities it offers for enhancing disease prevention and management strategies. Conclusion By sharing these insights, we aim to contribute to the global discourse on improving data management practices in cohort studies and to assist other researchers in overcoming the complexities associated with longitudinal health data.
Collapse
Affiliation(s)
- Noraidatulakma Abdullah
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Nurul Faeizah Husin
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Ying-Xian Goh
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Mohd Arman Kamaruddin
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Mohd Shaharom Abdullah
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Aiman Fitri Yusri
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | | | - Rahman Jamal
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| |
Collapse
|
19
|
Olaya‐Abril A, Biełło K, Rodríguez‐Caballero G, Cabello P, Sáez LP, Moreno‐Vivián C, Luque‐Almagro VM, Roldán MD. Bacterial tolerance and detoxification of cyanide, arsenic and heavy metals: Holistic approaches applied to bioremediation of industrial complex wastes. Microb Biotechnol 2024; 17:e14399. [PMID: 38206076 PMCID: PMC10832572 DOI: 10.1111/1751-7915.14399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/19/2023] [Accepted: 12/22/2023] [Indexed: 01/12/2024] Open
Abstract
Cyanide is a highly toxic compound that is found in wastewaters generated from different industrial activities, such as mining or jewellery. These residues usually contain high concentrations of other toxic pollutants like arsenic and heavy metals that may form different complexes with cyanide. To develop bioremediation strategies, it is necessary to know the metabolic processes involved in the tolerance and detoxification of these pollutants, but most of the current studies are focused on the characterization of the microbial responses to each one of these environmental hazards individually, and the effect of co-contaminated wastes on microbial metabolism has been hardly addressed. This work summarizes the main strategies developed by bacteria to alleviate the effects of cyanide, arsenic and heavy metals, analysing interactions among these toxic chemicals. Additionally, it is discussed the role of systems biology and synthetic biology as tools for the development of bioremediation strategies of complex industrial wastes and co-contaminated sites, emphasizing the importance and progress derived from meta-omic studies.
Collapse
Affiliation(s)
- Alfonso Olaya‐Abril
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Karolina Biełło
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Gema Rodríguez‐Caballero
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Purificación Cabello
- Departamento de Botánica, Ecología y Fisiología Vegetal, Edificio Celestino Mutis, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Lara P. Sáez
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Conrado Moreno‐Vivián
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - Víctor Manuel Luque‐Almagro
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| | - María Dolores Roldán
- Departamento de Bioquímica y Biología Molecular, Edificio Severo Ochoa, Campus de RabanalesUniversidad de CórdobaCórdobaSpain
| |
Collapse
|
20
|
Fuchs S, Engelmann S. Small proteins in bacteria - Big challenges in prediction and identification. Proteomics 2023; 23:e2200421. [PMID: 37609810 DOI: 10.1002/pmic.202200421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 08/03/2023] [Accepted: 08/10/2023] [Indexed: 08/24/2023]
Abstract
Proteins with up to 100 amino acids have been largely overlooked due to the challenges associated with predicting and identifying them using traditional methods. Recent advances in bioinformatics and machine learning, DNA sequencing, RNA and Ribo-seq technologies, and mass spectrometry (MS) have greatly facilitated the detection and characterisation of these elusive proteins in recent years. This has revealed their crucial role in various cellular processes including regulation, signalling and transport, as toxins and as folding helpers for protein complexes. Consequently, the systematic identification and characterisation of these proteins in bacteria have emerged as a prominent field of interest within the microbial research community. This review provides an overview of different strategies for predicting and identifying these proteins on a large scale, leveraging the power of these advanced technologies. Furthermore, the review offers insights into the future developments that may be expected in this field.
Collapse
Affiliation(s)
- Stephan Fuchs
- Genome Competence Center (MF1), Department MFI, Robert-Koch-Institut, Berlin, Germany
| | - Susanne Engelmann
- Institute for Microbiology, Technische Universität Braunschweig, Braunschweig, Germany
- Microbial Proteomics, Helmholtzzentrum für Infektionsforschung GmbH, Braunschweig, Germany
| |
Collapse
|
21
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
22
|
Fan KT, Hsu CW, Chen YR. Mass spectrometry in the discovery of peptides involved in intercellular communication: From targeted to untargeted peptidomics approaches. MASS SPECTROMETRY REVIEWS 2023; 42:2404-2425. [PMID: 35765846 DOI: 10.1002/mas.21789] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Revised: 03/17/2022] [Accepted: 04/08/2022] [Indexed: 06/15/2023]
Abstract
Endogenous peptide hormones represent an essential class of biomolecules, which regulate cell-cell communications in diverse physiological processes of organisms. Mass spectrometry (MS) has been developed to be a powerful technology for identifying and quantifying peptides in a highly efficient manner. However, it is difficult to directly identify these peptide hormones due to their diverse characteristics, dynamic regulations, low abundance, and existence in a complicated biological matrix. Here, we summarize and discuss the roles of targeted and untargeted MS in discovering peptide hormones using bioassay-guided purification, bioinformatics screening, or the peptidomics-based approach. Although the peptidomics approach is expected to discover novel peptide hormones unbiasedly, only a limited number of successful cases have been reported. The critical challenges and corresponding measures for peptidomics from the steps of sample preparation, peptide extraction, and separation to the MS data acquisition and analysis are also discussed. We also identify emerging technologies and methods that can be integrated into the discovery platform toward the comprehensive study of endogenous peptide hormones.
Collapse
Affiliation(s)
- Kai-Ting Fan
- Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan
| | - Chia-Wei Hsu
- Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan
| | - Yet-Ran Chen
- Agricultural Biotechnology Research Center, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
23
|
Kitata RB, Yang JC, Chen YJ. Advances in data-independent acquisition mass spectrometry towards comprehensive digital proteome landscape. MASS SPECTROMETRY REVIEWS 2023; 42:2324-2348. [PMID: 35645145 DOI: 10.1002/mas.21781] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 12/17/2021] [Accepted: 01/21/2022] [Indexed: 06/15/2023]
Abstract
The data-independent acquisition mass spectrometry (DIA-MS) has rapidly evolved as a powerful alternative for highly reproducible proteome profiling with a unique strength of generating permanent digital maps for retrospective analysis of biological systems. Recent advancements in data analysis software tools for the complex DIA-MS/MS spectra coupled to fast MS scanning speed and high mass accuracy have greatly expanded the sensitivity and coverage of DIA-based proteomics profiling. Here, we review the evolution of the DIA-MS techniques, from earlier proof-of-principle of parallel fragmentation of all-ions or ions in selected m/z range, the sequential window acquisition of all theoretical mass spectra (SWATH-MS) to latest innovations, recent development in computation algorithms for data informatics, and auxiliary tools and advanced instrumentation to enhance the performance of DIA-MS. We further summarize recent applications of DIA-MS and experimentally-derived as well as in silico spectra library resources for large-scale profiling to facilitate biomarker discovery and drug development in human diseases with emphasis on the proteomic profiling coverage. Toward next-generation DIA-MS for clinical proteomics, we outline the challenges in processing multi-dimensional DIA data set and large-scale clinical proteomics, and continuing need in higher profiling coverage and sensitivity.
Collapse
Affiliation(s)
| | - Jhih-Ci Yang
- Institute of Chemistry, Academia Sinica, Taipei, Taiwan
- Sustainable Chemical Science and Technology, Taiwan International Graduate Program, Academia Sinica and National Yang Ming Chiao Tung University, Taipei, Taiwan
- Department of Applied Chemistry, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Yu-Ju Chen
- Institute of Chemistry, Academia Sinica, Taipei, Taiwan
- Sustainable Chemical Science and Technology, Taiwan International Graduate Program, Academia Sinica and National Yang Ming Chiao Tung University, Taipei, Taiwan
- Department of Chemistry, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
24
|
Zhang B, Bassani-Sternberg M. Current perspectives on mass spectrometry-based immunopeptidomics: the computational angle to tumor antigen discovery. J Immunother Cancer 2023; 11:e007073. [PMID: 37899131 PMCID: PMC10619091 DOI: 10.1136/jitc-2023-007073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/21/2023] [Indexed: 10/31/2023] Open
Abstract
Identification of tumor antigens presented by the human leucocyte antigen (HLA) molecules is essential for the design of effective and safe cancer immunotherapies that rely on T cell recognition and killing of tumor cells. Mass spectrometry (MS)-based immunopeptidomics enables high-throughput, direct identification of HLA-bound peptides from a variety of cell lines, tumor tissues, and healthy tissues. It involves immunoaffinity purification of HLA complexes followed by MS profiling of the extracted peptides using data-dependent acquisition, data-independent acquisition, or targeted approaches. By incorporating DNA, RNA, and ribosome sequencing data into immunopeptidomics data analysis, the proteogenomic approach provides a powerful means for identifying tumor antigens encoded within the canonical open reading frames of annotated coding genes and non-canonical tumor antigens derived from presumably non-coding regions of our genome. We discuss emerging computational challenges in immunopeptidomics data analysis and tumor antigen identification, highlighting key considerations in the proteogenomics-based approach, including accurate DNA, RNA and ribosomal sequencing data analysis, careful incorporation of predicted novel protein sequences into reference protein database, special quality control in MS data analysis due to the expanded and heterogeneous search space, cancer-specificity determination, and immunogenicity prediction. The advancements in technology and computation is continually enabling us to identify tumor antigens with higher sensitivity and accuracy, paving the way toward the development of more effective cancer immunotherapies.
Collapse
Affiliation(s)
- Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Michal Bassani-Sternberg
- Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland
- Department of Oncology, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland
- Agora Cancer Research Centre, Lausanne, Switzerland
| |
Collapse
|
25
|
Seddiki K, Precioso F, Sanabria M, Salzet M, Fournier I, Droit A. Early Diagnosis: End-to-End CNN-LSTM Models for Mass Spectrometry Data Classification. Anal Chem 2023; 95:13431-13437. [PMID: 37624777 PMCID: PMC10501374 DOI: 10.1021/acs.analchem.3c00613] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 08/09/2023] [Indexed: 08/27/2023]
Abstract
Liquid chromatography-mass spectrometry (LC-MS) is a powerful method for cell profiling. The use of LC-MS technology is a tool of choice for cancer research since it provides molecular fingerprints of analyzed tissues. However, the ubiquitous presence of noise, the peaks shift between acquisitions, and the huge amount of information owing to the high dimensionality of the data make rapid and accurate cancer diagnosis a challenging task. Deep learning (DL) models are not only effective classifiers but are also well suited to jointly learn feature representation and classification tasks. This is particularly relevant when applied to raw LC-MS data and hence avoid the need for costly preprocessing and complicated feature selection. In this study, we propose a new end-to-end DL methodology that addresses all of the above challenges at once, while preserving the high potential of LC-MS data. Our DL model is designed to early discriminate between tumoral and normal tissues. It is a combination of a convolutional neural network (CNN) and a long short-term memory (LSTM) Network. The CNN network allows for significantly reducing the high dimensionality of the data while learning spatially relevant features. The LSTM network enables our model to capture temporal patterns. We show that our model outperforms not only benchmark models but also state-of-the-art models developed on the same data. Our framework is a promising strategy for improving early cancer detection during a diagnostic process.
Collapse
Affiliation(s)
- Khawla Seddiki
- Centre
de Recherche du CHU de Québec-Université Laval, Québec City, Québec G1V 4G2, Canada
- Univ.
Lille, Inserm, CHU Lille,
U1192-Protéomique Réponse Inflammatoire Spectrométrie
de Masse-PRISM, Lille F-59000, France
| | - Fŕed́eric Precioso
- Université
Ĉote d’Azur, CNRS, INRIA, I3S, Sophia Antipolis 06900, France
| | - Melissa Sanabria
- Université
Ĉote d’Azur, CNRS, INRIA, I3S, Sophia Antipolis 06900, France
| | - Michel Salzet
- Univ.
Lille, Inserm, CHU Lille,
U1192-Protéomique Réponse Inflammatoire Spectrométrie
de Masse-PRISM, Lille F-59000, France
| | - Isabelle Fournier
- Univ.
Lille, Inserm, CHU Lille,
U1192-Protéomique Réponse Inflammatoire Spectrométrie
de Masse-PRISM, Lille F-59000, France
| | - Arnaud Droit
- Centre
de Recherche du CHU de Québec-Université Laval, Québec City, Québec G1V 4G2, Canada
| |
Collapse
|
26
|
Abdul-Khalek N, Wimmer R, Overgaard MT, Gregersen Echers S. Insight on physicochemical properties governing peptide MS1 response in HPLC-ESI-MS/MS: A deep learning approach. Comput Struct Biotechnol J 2023; 21:3715-3727. [PMID: 37560124 PMCID: PMC10407266 DOI: 10.1016/j.csbj.2023.07.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 07/13/2023] [Accepted: 07/19/2023] [Indexed: 08/11/2023] Open
Abstract
Accurate and absolute quantification of peptides in complex mixtures using quantitative mass spectrometry (MS)-based methods requires foreground knowledge and isotopically labeled standards, thereby increasing analytical expenses, time consumption, and labor, thus limiting the number of peptides that can be accurately quantified. This originates from differential ionization efficiency between peptides and thus, understanding the physicochemical properties that influence the ionization and response in MS analysis is essential for developing less restrictive label-free quantitative methods. Here, we used equimolar peptide pool repository data to develop a deep learning model capable of identifying amino acids influencing the MS1 response. By using an encoder-decoder with an attention mechanism and correlating attention weights with amino acid physicochemical properties, we obtain insight on properties governing the peptide-level MS1 response within the datasets. While the problem cannot be described by one single set of amino acids and properties, distinct patterns were reproducibly obtained. Properties are grouped in three main categories related to peptide hydrophobicity, charge, and structural propensities. Moreover, our model can predict MS1 intensity output under defined conditions based solely on peptide sequence input. Using a refined training dataset, the model predicted log-transformed peptide MS1 intensities with an average error of 9.7 ± 0.5% based on 5-fold cross validation, and outperformed random forest and ridge regression models on both log-transformed and real scale data. This work demonstrates how deep learning can facilitate identification of physicochemical properties influencing peptide MS1 responses, but also illustrates how sequence-based response prediction and label-free peptide-level quantification may impact future workflows within quantitative proteomics.
Collapse
Affiliation(s)
- Naim Abdul-Khalek
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | - Reinhard Wimmer
- Department of Chemistry and Bioscience, Aalborg University, Aalborg 9220, Denmark
| | | | | |
Collapse
|
27
|
Affiliation(s)
- Bruna Gomes
- From the Departments of Medicine, Genetics, and Biomedical Data Science, Stanford University, Stanford, CA (B.G., E.A.A.); and the Department of Cardiology, Pneumology, and Angiology, Heidelberg University Hospital, Heidelberg, Germany (B.G.)
| | - Euan A Ashley
- From the Departments of Medicine, Genetics, and Biomedical Data Science, Stanford University, Stanford, CA (B.G., E.A.A.); and the Department of Cardiology, Pneumology, and Angiology, Heidelberg University Hospital, Heidelberg, Germany (B.G.)
| |
Collapse
|
28
|
Qu Z, Yao T, Liu X, Wang G. A Graph Convolutional Network Based on Univariate Neurodegeneration Biomarker for Alzheimer's Disease Diagnosis. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE 2023; 11:405-416. [PMID: 37492469 PMCID: PMC10365071 DOI: 10.1109/jtehm.2023.3285723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 01/20/2023] [Accepted: 06/05/2023] [Indexed: 07/27/2023]
Abstract
OBJECTIVE Alzheimer's disease (AD) is a progressive and irreversible neurodegenerative disease that is not easily detectable in the early stage. This study proposed an efficient method of applying a graph convolutional network (GCN) on the early prediction of AD. METHODS We proposed a univariate neurodegeneration biomarker (UNB) based GCN semi-supervised classification framework. We generated UNB by comparing the similarity of individual morphological atrophy pattern and the atrophy pattern of [Formula: see text] AD group according to the brain morphological abnormalities induced by AD. For the GCN semi-supervised classification model, we took the UNBs of individuals as the features of nodes and constructed the weight of edges according to the similarity of phenotypic information between individuals, which explored the essential features of individuals through spectral graph convolution. The attention module was constructed and embedded into the GCN framework, which may refine the input morphological features to highlight the main impact of AD on the cerebral cortex and weaken the instability caused by individual diversities, thereby identifying the significant ROIs affected by AD and improving the classification accuracy. RESULTS We tested the UNB-GCN framework on the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. The estimated minimum sample sizes were 156, 349 and 423 for the longitudinal [Formula: see text] AD, [Formula: see text] mild cognitive impairment (MCI) and [Formula: see text] cognitively unimpaired (CU) groups, respectively. And the proposed UNB-GCN framework combined with the attention module can effectively improve the classification performance with 93.90% classification accuracy for AD vs. CU and 82.05% for AD vs. MCI on the validation set. CONCLUSION The proposed UNB measures were superior to the conventional volume measures in describing the AD-induced cerebral cortex morphological changes. And the UNB-GCN framework combined with attention module may effectively improve the classification performance between MCI subjects and AD patients. Clinical and Translational Impact Statement: This study aims to predict the early AD patients, so as to help clinicians develop effective interventions to delay the deterioration of AD symptoms.
Collapse
Affiliation(s)
- Zongshuai Qu
- School of Information and Electrical EngineeringLudong UniversityYantai264025China
| | - Tao Yao
- School of Information and Electrical EngineeringLudong UniversityYantai264025China
| | - Xinghui Liu
- Shandong Vheng Data Technology Company Ltd.Yantai264003China
| | - Gang Wang
- School of Ulsan Ship and Ocean CollegeLudong UniversityYantai264025China
| |
Collapse
|
29
|
Hill AC, Guo C, Litkowski EM, Manichaikul AW, Yu B, Konigsberg IR, Gorbet BA, Lange LA, Pratte KA, Kechris KJ, DeCamp M, Coors M, Ortega VE, Rich SS, Rotter JI, Gerzsten RE, Clish CB, Curtis JL, Hu X, Obeidat ME, Morris M, Loureiro J, Ngo D, O'Neal WK, Meyers DA, Bleecker ER, Hobbs BD, Cho MH, Banaei-Kashani F, Bowler RP. Large scale proteomic studies create novel privacy considerations. Sci Rep 2023; 13:9254. [PMID: 37286633 PMCID: PMC10247808 DOI: 10.1038/s41598-023-34866-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 05/09/2023] [Indexed: 06/09/2023] Open
Abstract
Privacy protection is a core principle of genomic but not proteomic research. We identified independent single nucleotide polymorphism (SNP) quantitative trait loci (pQTL) from COPDGene and Jackson Heart Study (JHS), calculated continuous protein level genotype probabilities, and then applied a naïve Bayesian approach to link SomaScan 1.3K proteomes to genomes for 2812 independent subjects from COPDGene, JHS, SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA). We correctly linked 90-95% of proteomes to their correct genome and for 95-99% we identify the 1% most likely links. The linking accuracy in subjects with African ancestry was lower (~ 60%) unless training included diverse subjects. With larger profiling (SomaScan 5K) in the Atherosclerosis Risk Communities (ARIC) correct identification was > 99% even in mixed ancestry populations. We also linked proteomes-to-proteomes and used the proteome only to determine features such as sex, ancestry, and first-degree relatives. When serial proteomes are available, the linking algorithm can be used to identify and correct mislabeled samples. This work also demonstrates the importance of including diverse populations in omics research and that large proteomic datasets (> 1000 proteins) can be accurately linked to a specific genome through pQTL knowledge and should not be considered unidentifiable.
Collapse
Affiliation(s)
| | | | | | - Ani W Manichaikul
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Bing Yu
- Department of Epidemiology and Human Genetics Center, UTHealth School of Public Health, Houston, TX, USA
| | | | - Betty A Gorbet
- Department of Epidemiology and Human Genetics Center, UTHealth School of Public Health, Houston, TX, USA
| | - Leslie A Lange
- University of Colorado - Anschutz Medical Campus, Aurora, CO, USA
| | | | | | - Matthew DeCamp
- University of Colorado - Anschutz Medical Campus, Aurora, CO, USA
| | - Marilyn Coors
- University of Colorado - Anschutz Medical Campus, Aurora, CO, USA
| | | | - Stephen S Rich
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Jerome I Rotter
- Department of Pediatrics, The Institute for Translational Genomics and Population Sciences, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Robert E Gerzsten
- Division of Cardiovascular Medicine, Cardiovascular Research Center, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Clary B Clish
- Metabolomics Platform, Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA, USA
| | | | - Xiaowei Hu
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | | | | | | | | | - Wanda K O'Neal
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | | | - Brian D Hobbs
- Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Michael H Cho
- Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | | | | |
Collapse
|
30
|
Wilburn DB, Shannon AE, Spicer V, Richards AL, Yeung D, Swaney DL, Krokhin OV, Searle BC. Deep learning from harmonized peptide libraries enables retention time prediction of diverse post translational modifications. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.30.542978. [PMID: 37398395 PMCID: PMC10312522 DOI: 10.1101/2023.05.30.542978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
In proteomics experiments, peptide retention time (RT) is an orthogonal property to fragmentation when assessing detection confidence. Advances in deep learning enable accurate RT prediction for any peptide from sequence alone, including those yet to be experimentally observed. Here we present Chronologer, an open-source software tool for rapid and accurate peptide RT prediction. Using new approaches to harmonize and false-discovery correct across independently collected datasets, Chronologer is built on a massive database with >2.2 million peptides including 10 common post-translational modification (PTM) types. By linking knowledge learned across diverse peptide chemistries, Chronologer predicts RTs with less than two-thirds the error of other deep learning tools. We show how RT for rare PTMs, such as OGlcNAc, can be learned with high accuracy using as few as 10-100 example peptides in newly harmonized datasets. This iteratively updatable workflow enables Chronologer to comprehensively predict RTs for PTM-marked peptides across entire proteomes.
Collapse
|
31
|
Cai P, Liu S, Zhang D, Xing H, Han M, Liu D, Gong L, Hu QN. SynBioTools: a one-stop facility for searching and selecting synthetic biology tools. BMC Bioinformatics 2023; 24:152. [PMID: 37069545 PMCID: PMC10111727 DOI: 10.1186/s12859-023-05281-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 04/11/2023] [Indexed: 04/19/2023] Open
Abstract
BACKGROUND The rapid development of synthetic biology relies heavily on the use of databases and computational tools, which are also developing rapidly. While many tool registries have been created to facilitate tool retrieval, sharing, and reuse, no relatively comprehensive tool registry or catalog addresses all aspects of synthetic biology. RESULTS We constructed SynBioTools, a comprehensive collection of synthetic biology databases, computational tools, and experimental methods, as a one-stop facility for searching and selecting synthetic biology tools. SynBioTools includes databases, computational tools, and methods extracted from reviews via SCIentific Table Extraction, a scientific table-extraction tool that we built. Approximately 57% of the resources that we located and included in SynBioTools are not mentioned in bio.tools, the dominant tool registry. To improve users' understanding of the tools and to enable them to make better choices, the tools are grouped into nine modules (each with subdivisions) based on their potential biosynthetic applications. Detailed comparisons of similar tools in every classification are included. The URLs, descriptions, source references, and the number of citations of the tools are also integrated into the system. CONCLUSIONS SynBioTools is freely available at https://synbiotools.lifesynther.com/ . It provides end-users and developers with a useful resource of categorized synthetic biology databases, tools, and methods to facilitate tool retrieval and selection.
Collapse
Affiliation(s)
- Pengli Cai
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Sheng Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Dachuan Zhang
- Ecological Systems Design, Institute of Environmental Engineering, ETH Zurich, 8093, Zurich, Switzerland
| | - Huadong Xing
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Mengying Han
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Dongliang Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Linlin Gong
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China
| | - Qian-Nan Hu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, 200031, China.
| |
Collapse
|
32
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
33
|
Letunica N, McCafferty C, Swaney E, Cai T, Monagle P, Ignjatovic V, Attard C. Proteomic Applications and Considerations: From Research to Patient Care. Methods Mol Biol 2023; 2628:181-192. [PMID: 36781786 DOI: 10.1007/978-1-0716-2978-9_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/15/2023]
Abstract
Despite technological advancements in the field of proteomics, the rate at which serum and plasma biomarkers identified using proteomic approaches are translated into clinical use remains extremely low. In this chapter, we describe recent technological advancements and analytical strategies in proteomic methods. We also describe the progress of proteomic blood-based biomarkers to date and discuss what the future of proteomics might entail with the use of multi-omic approaches and implementing machine learning on large proteomic datasets. Lastly, we provide several key considerations for biomarker studies, ranging from sample type to the use of reference samples, in order to achieve progress from bench to bedside, ultimately improving patient diagnosis, disease, and/or therapeutic monitoring and care.
Collapse
Affiliation(s)
- Natasha Letunica
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Conor McCafferty
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Ella Swaney
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Tengyi Cai
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia
| | - Paul Monagle
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia.,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia.,Department of Clinical Haematology, Royal Children's Hospital, Melbourne, VIC, Australia.,Kids Cancer Centre, Sydney Children's Hospital, Randwick, NSW, Australia
| | - Vera Ignjatovic
- Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia.,Institute for Clinical and Translational Research, Johns Hopkins All Children's Hospital, St. Petersburg, USA.,Department of Pediatrics, Johns Hopkins University, Baltimore, USA
| | - Chantal Attard
- Haematology Research, Murdoch Children's Research Institute, Melbourne, VIC, Australia. .,Department of Paediatrics, The University of Melbourne, Melbourne, VIC, Australia. .,The Royal Children's Hospital, Parkville, VIC, Australia.
| |
Collapse
|
34
|
Rehfeldt T, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizcaíno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res 2023; 22:632-636. [PMID: 36693629 PMCID: PMC9903315 DOI: 10.1021/acs.jproteome.2c00629] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Indexed: 01/26/2023]
Abstract
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML.
Collapse
Affiliation(s)
- Tobias
G. Rehfeldt
- Institute
for Mathematics and Computer Science, University
of Southern Denmark, 5000 Odense, Denmark
| | - Ralf Gabriels
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | - Robbin Bouwmeester
- VIB-UGent
Center for Medical Biotechnology, VIB, Ghent 9052, Belgium
- Department
of Biomolecular Medicine, Ghent University, Ghent 9052, Belgium
| | | | - Benjamin A. Neely
- National
Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| | - Magnus Palmblad
- Center for
Proteomics and Metabolomics, Leiden University
Medical Center, 2300 RC Leiden, The Netherlands
| | - Yasset Perez-Riverol
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | | | - Juan Antonio Vizcaíno
- European
Molecular Biology Laboratory, European Bioinformatics
Institute (EMBL-EBI), Wellcome Trust
Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Eric W. Deutsch
- Institute
for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
35
|
Nabeel Asim M, Ali Ibrahim M, Fazeel A, Dengel A, Ahmed S. DNA-MP: a generalized DNA modifications predictor for multiple species based on powerful sequence encoding method. Brief Bioinform 2023; 24:6931721. [PMID: 36528802 DOI: 10.1093/bib/bbac546] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/06/2022] [Accepted: 11/12/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
36
|
Liu J, Tang X, Guan X. Grain protein function prediction based on self-attention mechanism and bidirectional LSTM. Brief Bioinform 2023; 24:6886418. [PMID: 36567619 DOI: 10.1093/bib/bbac493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 10/13/2022] [Accepted: 10/18/2022] [Indexed: 12/27/2022] Open
Abstract
With the development of genome sequencing technology, using computing technology to predict grain protein function has become one of the important tasks of bioinformatics. The protein data of four grains, soybean, maize, indica and japonica are selected in this experimental dataset. In this paper, a novel neural network algorithm Chemical-SA-BiLSTM is proposed for grain protein function prediction. The Chemical-SA-BiLSTM algorithm fuses the chemical properties of proteins on the basis of amino acid sequences, and combines the self-attention mechanism with the bidirectional Long Short-Term Memory network. The experimental results show that the Chemical-SA-BiLSTM algorithm is superior to other classical neural network algorithms, and can more accurately predict the protein function, which proves the effectiveness of the Chemical-SA-BiLSTM algorithm in the prediction of grain protein function. The source code of our method is available at https://github.com/HwaTong/Chemical-SA-BiLSTM.
Collapse
Affiliation(s)
- Jing Liu
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xinghua Tang
- College of Information Engineering, Shanghai Maritime University, 201306, Shanghai, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 200093, Shanghai, China
| |
Collapse
|
37
|
Yi X, Wen B, Ji S, Saltzman A, Jaehnig EJ, Lei JT, Gao Q, Zhang B. Deep learning prediction boosts phosphoproteomics-based discoveries through improved phosphopeptide identification. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.11.523329. [PMID: 36711982 PMCID: PMC9882090 DOI: 10.1101/2023.01.11.523329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
Shotgun phosphoproteomics enables high-throughput analysis of phosphopeptides in biological samples, but low phosphopeptide identification rate in data analysis limits the potential of this technology. Here we present DeepRescore2, a computational workflow that leverages deep learning-based retention time and fragment ion intensity predictions to improve phosphopeptide identification and phosphosite localization. Using a state-of-the-art computational workflow as a benchmark, DeepRescore2 increases the number of correctly identified peptide-spectrum matches by 17% in a synthetic dataset and identifies 19%-46% more phosphopeptides in biological datasets. In a liver cancer dataset, 30% of the significantly altered phosphosites between tumor and normal tissues and 60% of the prognosis-associated phosphosites identified from DeepRescore2-processed data could not be identified based on the state-of-the-art workflow. Notably, DeepRescore2-processed data uniquely identifies EGFR hyperactivation as a new target in poor-prognosis liver cancer, which is validated experimentally. Integration of deep learning prediction in DeepRescore2 improves phosphopeptide identification and facilitates biological discoveries.
Collapse
|
38
|
Gutiérrez-Mondragón MA, König C, Vellido A. Layer-Wise Relevance Analysis for Motif Recognition in the Activation Pathway of the β2- Adrenergic GPCR Receptor. Int J Mol Sci 2023; 24:ijms24021155. [PMID: 36674669 PMCID: PMC9865744 DOI: 10.3390/ijms24021155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/22/2022] [Accepted: 12/30/2022] [Indexed: 01/11/2023] Open
Abstract
G-protein-coupled receptors (GPCRs) are cell membrane proteins of relevance as therapeutic targets, and are associated to the development of treatments for illnesses such as diabetes, Alzheimer's, or even cancer. Therefore, comprehending the underlying mechanisms of the receptor functional properties is of particular interest in pharmacoproteomics and in disease therapy at large. Their interaction with ligands elicits multiple molecular rearrangements all along their structure, inducing activation pathways that distinctly influence the cell response. In this work, we studied GPCR signaling pathways from molecular dynamics simulations as they provide rich information about the dynamic nature of the receptors. We focused on studying the molecular properties of the receptors using deep-learning-based methods. In particular, we designed and trained a one-dimensional convolution neural network and illustrated its use in a classification of conformational states: active, intermediate, or inactive, of the β2-adrenergic receptor when bound to the full agonist BI-167107. Through a novel explainability-oriented investigation of the prediction results, we were able to identify and assess the contribution of individual motifs (residues) influencing a particular activation pathway. Consequently, we contribute a methodology that assists in the elucidation of the underlying mechanisms of receptor activation-deactivation.
Collapse
Affiliation(s)
- Mario A. Gutiérrez-Mondragón
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| | - Caroline König
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Correspondence:
| | - Alfredo Vellido
- Computer Science Department, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
- Intelligent Data Science and Artificial Intelligence (IDEAI-UPC) Research Center, Universitat Politècnica de Catalunya—UPC BarcelonaTech, 08034 Barcelona, Spain
| |
Collapse
|
39
|
Cox J. Prediction of peptide mass spectral libraries with machine learning. Nat Biotechnol 2023; 41:33-43. [PMID: 36008611 DOI: 10.1038/s41587-022-01424-w] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 07/11/2022] [Indexed: 01/21/2023]
Abstract
The recent development of machine learning methods to identify peptides in complex mass spectrometric data constitutes a major breakthrough in proteomics. Longstanding methods for peptide identification, such as search engines and experimental spectral libraries, are being superseded by deep learning models that allow the fragmentation spectra of peptides to be predicted from their amino acid sequence. These new approaches, including recurrent neural networks and convolutional neural networks, use predicted in silico spectral libraries rather than experimental libraries to achieve higher sensitivity and/or specificity in the analysis of proteomics data. Machine learning is galvanizing applications that involve large search spaces, such as immunopeptidomics and proteogenomics. Current challenges in the field include the prediction of spectra for peptides with post-translational modifications and for cross-linked pairs of peptides. Permeation of machine-learning-based spectral prediction into search engines and spectrum-centric data-independent acquisition workflows for diverse peptide classes and measurement conditions will continue to push sensitivity and dynamic range in proteomics applications in the coming years.
Collapse
Affiliation(s)
- Jürgen Cox
- Computational Systems Biochemistry Research Group, Max-Planck Institute of Biochemistry, Martinsried, Germany.
- Department of Biological and Medical Psychology, University of Bergen, Bergen, Norway.
| |
Collapse
|
40
|
Jia J, Sun M, Wu G, Qiu W. DeepDN_iGlu: prediction of lysine glutarylation sites based on attention residual learning method and DenseNet. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:2815-2830. [PMID: 36899559 DOI: 10.3934/mbe.2023132] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
As a key issue in orchestrating various biological processes and functions, protein post-translational modification (PTM) occurs widely in the mechanism of protein's function of animals and plants. Glutarylation is a type of protein-translational modification that occurs at active ε-amino groups of specific lysine residues in proteins, which is associated with various human diseases, including diabetes, cancer, and glutaric aciduria type I. Therefore, the issue of prediction for glutarylation sites is particularly important. This study developed a brand-new deep learning-based prediction model for glutarylation sites named DeepDN_iGlu via adopting attention residual learning method and DenseNet. The focal loss function is utilized in this study in place of the traditional cross-entropy loss function to address the issue of a substantial imbalance in the number of positive and negative samples. It can be noted that DeepDN_iGlu based on the deep learning model offers a greater potential for the glutarylation site prediction after employing the straightforward one hot encoding method, with Sensitivity (Sn), Specificity (Sp), Accuracy (ACC), Mathews Correlation Coefficient (MCC), and Area Under Curve (AUC) of 89.29%, 61.97%, 65.15%, 0.33 and 0.80 accordingly on the independent test set. To the best of the authors' knowledge, this is the first time that DenseNet has been used for the prediction of glutarylation sites. DeepDN_iGlu has been deployed as a web server (https://bioinfo.wugenqiang.top/~smw/DeepDN_iGlu/) that is available to make glutarylation site prediction data more accessible.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Mingwei Sun
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Genqiang Wu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| |
Collapse
|
41
|
Gueto-Tettay C, Tang D, Happonen L, Heusel M, Khakzad H, Malmström J, Malmström L. Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics. PLoS Comput Biol 2023; 19:e1010457. [PMID: 36668672 PMCID: PMC9891523 DOI: 10.1371/journal.pcbi.1010457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 02/01/2023] [Accepted: 01/04/2023] [Indexed: 01/21/2023] Open
Abstract
Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.
Collapse
Affiliation(s)
- Carlos Gueto-Tettay
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| | - Di Tang
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| | - Lotta Happonen
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| | - Moritz Heusel
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| | - Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
| | - Johan Malmström
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| | - Lars Malmström
- Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden
| |
Collapse
|
42
|
Do TTT, Nguyen-Vo TH, Pham HT, Trinh QH, Nguyen BP. iNSP-GCAAP: Identifying nonclassical secreted proteins using global composition of amino acid properties. Proteomics 2023; 23:e2100134. [PMID: 36401584 DOI: 10.1002/pmic.202100134] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 08/02/2022] [Accepted: 11/10/2022] [Indexed: 11/21/2022]
Abstract
Nonclassical secreted proteins (NSPs) refer to a group of proteins released into the extracellular environment under the facilitation of different biological transporting pathways apart from the Sec/Tat system. As experimental determination of NSPs is often costly and requires skilled handling techniques, computational approaches are necessary. In this study, we introduce iNSP-GCAAP, a computational prediction framework, to identify NSPs. We propose using global composition of a customized set of amino acid properties to encode sequence data and use the random forest (RF) algorithm for classification. We used the training dataset introduced by Zhang et al. (Bioinformatics, 36(3), 704-712, 2020) to develop our model and test it with the independent test set in the same study. The area under the receiver operating characteristic curve on that test set was 0.9256, which outperformed other state-of-the-art methods using the same datasets. Our framework is also deployed as a user-friendly web-based application to support the research community to predict NSPs.
Collapse
Affiliation(s)
- Trang T T Do
- School of Innovation, Design and Technology, Wellington Institute of Technology, Lower Hutt, New Zealand
| | - Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington, New Zealand
| | - Hung T Pham
- Faculty of Information Technology, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
| | - Quang H Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi, Vietnam
| | - Binh P Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington, New Zealand
| |
Collapse
|
43
|
Rehfeldt TG, Krawczyk K, Echers SG, Marcatili P, Palczynski P, Röttger R, Schwämmle V. Variability analysis of LC-MS experimental factors and their impact on machine learning. Gigascience 2022; 12:giad096. [PMID: 37983748 PMCID: PMC10659119 DOI: 10.1093/gigascience/giad096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 08/23/2023] [Accepted: 10/11/2023] [Indexed: 11/22/2023] Open
Abstract
BACKGROUND Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data-processing pipeline from raw data analysis to end-user predictions and rescoring. ML models need large-scale datasets for training and repurposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs. RESULTS We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variability in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. CONCLUSIONS Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it is important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pretrained model.
Collapse
Affiliation(s)
- Tobias Greisager Rehfeldt
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Konrad Krawczyk
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | | | - Paolo Marcatili
- Department of Health Technology, Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - Pawel Palczynski
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, 5230 Odense, Denmark
| | - Veit Schwämmle
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, 5230 Odense, Denmark
| |
Collapse
|
44
|
Zeng WF, Zhou XX, Willems S, Ammar C, Wahle M, Bludau I, Voytik E, Strauss MT, Mann M. AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics. Nat Commun 2022; 13:7238. [PMID: 36433986 PMCID: PMC9700817 DOI: 10.1038/s41467-022-34904-3] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 11/10/2022] [Indexed: 11/27/2022] Open
Abstract
Machine learning and in particular deep learning (DL) are increasingly important in mass spectrometry (MS)-based proteomics. Recent DL models can predict the retention time, ion mobility and fragment intensities of a peptide just from the amino acid sequence with good accuracy. However, DL is a very rapidly developing field with new neural network architectures frequently appearing, which are challenging to incorporate for proteomics researchers. Here we introduce AlphaPeptDeep, a modular Python framework built on the PyTorch DL library that learns and predicts the properties of peptides ( https://github.com/MannLabs/alphapeptdeep ). It features a model shop that enables non-specialists to create models in just a few lines of code. AlphaPeptDeep represents post-translational modifications in a generic manner, even if only the chemical composition is known. Extensive use of transfer learning obviates the need for large data sets to refine models for particular experimental conditions. The AlphaPeptDeep models for predicting retention time, collisional cross sections and fragment intensities are at least on par with existing tools. Additional sequence-based properties can also be predicted by AlphaPeptDeep, as demonstrated with a HLA peptide prediction model to improve HLA peptide identification for data-independent acquisition ( https://github.com/MannLabs/PeptDeep-HLA ).
Collapse
Affiliation(s)
- Wen-Feng Zeng
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Xie-Xuan Zhou
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Sander Willems
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Constantin Ammar
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Maria Wahle
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Isabell Bludau
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Eugenia Voytik
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Maximillian T Strauss
- Proteomics Program, NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Matthias Mann
- Department of Proteomics and Signal Transduction, Max Planck Institute of Biochemistry, Martinsried, Germany.
- Proteomics Program, NNF Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
45
|
Zhang H, Wang Y, Pan Z, Sun X, Mou M, Zhang B, Li Z, Li H, Zhu F. ncRNAInter: a novel strategy based on graph neural network to discover interactions between lncRNA and miRNA. Brief Bioinform 2022; 23:6747810. [PMID: 36198065 DOI: 10.1093/bib/bbac411] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 08/04/2022] [Accepted: 08/23/2022] [Indexed: 12/14/2022] Open
Abstract
In recent years, many studies have illustrated the significant role that non-coding RNA (ncRNA) plays in biological activities, in which lncRNA, miRNA and especially their interactions have been proved to affect many biological processes. Some in silico methods have been proposed and applied to identify novel lncRNA-miRNA interactions (LMIs), but there are still imperfections in their RNA representation and information extraction approaches, which imply there is still room for further improving their performances. Meanwhile, only a few of them are accessible at present, which limits their practical applications. The construction of a new tool for LMI prediction is thus imperative for the better understanding of their relevant biological mechanisms. This study proposed a novel method, ncRNAInter, for LMI prediction. A comprehensive strategy for RNA representation and an optimized deep learning algorithm of graph neural network were utilized in this study. ncRNAInter was robust and showed better performance of 26.7% higher Matthews correlation coefficient than existing reputable methods for human LMI prediction. In addition, ncRNAInter proved its universal applicability in dealing with LMIs from various species and successfully identified novel LMIs associated with various diseases, which further verified its effectiveness and usability. All source code and datasets are freely available at https://github.com/idrblab/ncRNAInter.
Collapse
Affiliation(s)
- Hanyu Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiuna Sun
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Honglin Li
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China.,Shanghai Key Laboratory of New Drug Design, East China University of Science and Technology, Shanghai 200237, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
46
|
Dai Y, Millikin R, Rolfs Z, Shortreed MR, Smith LM. A Hybrid Spectral Library and Protein Sequence Database Search Strategy for Bottom-Up and Top-Down Proteomic Data Analysis. J Proteome Res 2022; 21:2609-2618. [PMID: 36206157 PMCID: PMC9869658 DOI: 10.1021/acs.jproteome.2c00305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Tandem mass spectrometry (MS/MS) is widely employed for the analysis of complex proteomic samples. While protein sequence database searching and spectral library searching are both well-established peptide identification methods, each has shortcomings. Protein sequence databases lack fragment peak intensity information, which can result in poor discrimination between correct and incorrect spectrum assignments. Spectral libraries usually contain fewer peptides than protein sequence databases, which limits the number of peptides that can be identified. Notably, few post-translationally modified peptides are represented in spectral libraries. This is because few search engines can both identify a broad spectrum of PTMs and create corresponding spectral libraries. Also, programs that generate spectral libraries using deep learning approaches are not yet able to accurately predict spectra for the vast majority of PTMs. Here, we address these limitations through use of a hybrid search strategy that combines protein sequence database and spectral library searches to improve identification success rates and sensitivity. This software uses Global PTM Discovery (G-PTM-D) to produce spectral libraries for a wide variety of different PTMs. These features, along with a new spectrum annotation and visualization tool, have been integrated into the freely available and open-source search engine MetaMorpheus.
Collapse
Affiliation(s)
- Yuling Dai
- Department of Chemistry, University of Wisconsin, 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Robert Millikin
- Department of Chemistry, University of Wisconsin, 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Zach Rolfs
- Department of Chemistry, University of Wisconsin, 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Michael R. Shortreed
- Department of Chemistry, University of Wisconsin, 1101 University Avenue, Madison, Wisconsin 53706, United States
| | - Lloyd M. Smith
- Department of Chemistry, University of Wisconsin, 1101 University Avenue, Madison, Wisconsin 53706, United States
| |
Collapse
|
47
|
Gill ML. The rise of the machines in chemistry. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1044-1051. [PMID: 35976263 DOI: 10.1002/mrc.5304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 08/07/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
The use of artificial intelligence and, more specifically, deep learning methods in chemistry is becoming increasingly common. Applications in informatics fields, such as cheminformatics and proteomics, structural biology, and spectroscopy, including NMR, are on the rise. Recent developments in model architectures, such as graph convolutional neural networks and transformers, have been enabled by advancements in computational hardware and software. However, model architectures with more predictive power often require larger amounts of training data, which can be challenging to acquire, but this requirement can be mitigated through techniques like pretraining and fine-tuning. In spite of these successes, challenges remain, such as normalization and scaling of data, availability of experimentally acquired data, and model explainability.
Collapse
|
48
|
Zhao J, Jiang H, Zou G, Lin Q, Wang Q, Liu J, Ma L. CNNArginineMe: A CNN structure for training models for predicting arginine methylation sites based on the One-Hot encoding of peptide sequence. Front Genet 2022; 13:1036862. [PMID: 36324513 PMCID: PMC9618650 DOI: 10.3389/fgene.2022.1036862] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 10/04/2022] [Indexed: 11/30/2022] Open
Abstract
Protein arginine methylation (PRme), as one post-translational modification, plays a critical role in numerous cellular processes and regulates critical cellular functions. Though several in silico models for predicting PRme sites have been reported, new models may be required to develop due to the significant increase of identified PRme sites. In this study, we constructed multiple machine-learning and deep-learning models. The deep-learning model CNN combined with the One-Hot coding showed the best performance, dubbed CNNArginineMe. CNNArginineMe performed best in AUC scoring metrics in comparisons with several reported predictors. Additionally, we employed CNNArginineMe to predict arginine methylation proteome and performed functional analysis. The arginine methylated proteome is significantly enriched in the amyotrophic lateral sclerosis (ALS) pathway. CNNArginineMe is freely available at https://github.com/guoyangzou/CNNArginineMe.
Collapse
Affiliation(s)
- Jiaojiao Zhao
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Haoqiang Jiang
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Guoyang Zou
- School of Basic Medicine, Qingdao University, Qingdao, China
| | - Qian Lin
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
| | - Qiang Wang
- Oncology Department, Shandong Second Provincial General Hospital, Jinan, China
| | - Jia Liu
- Department of Pharmacology, School of Pharmacy, Qingdao University, Qingdao, China
| | - Leina Ma
- Cancer Institute of the Affiliated Hospital of Qingdao University and Qingdao Cancer Institute, Qingdao University, Qingdao, China
- *Correspondence: Leina Ma,
| |
Collapse
|
49
|
Yang Y, Qiao L. Data-independent acquisition proteomics methods for analyzing post-translational modifications. Proteomics 2022; 23:e2200046. [PMID: 36036492 DOI: 10.1002/pmic.202200046] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 08/20/2022] [Accepted: 08/23/2022] [Indexed: 11/06/2022]
Abstract
Protein post-translational modifications (PTMs) increase the functional diversity of the cellular proteome. Accurate and high throughput identification and quantification of protein PTMs is a key task in proteomics research. Recent advancements in data-independent acquisition (DIA) mass spectrometry (MS) technology have achieved deep coverage and accurate quantification of proteins and PTMs. This review provides an overview of DIA data processing methods that cover three aspects of PTMs analysis, i.e., detection of PTMs, site localization, and characterization of complex modification moieties, such as glycosylation. In addition, a survey of deep learning methods that boost DIA-based PTMs analysis is presented, including in silico spectral library generation, as well as feature scoring and error rate control. The limitations and future directions of DIA methods for PTMs analysis are also discussed. Novel data analysis methods will take advantage of advanced MS instrumentation techniques to empower DIA MS for in-depth and accurate PTMs measurements. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| |
Collapse
|
50
|
Mini-review: Recent advances in post-translational modification site prediction based on deep learning. Comput Struct Biotechnol J 2022; 20:3522-3532. [PMID: 35860402 PMCID: PMC9284371 DOI: 10.1016/j.csbj.2022.06.045] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/23/2022] Open
Abstract
Post-translational modifications (PTMs) are closely linked to numerous diseases, playing a significant role in regulating protein structures, activities, and functions. Therefore, the identification of PTMs is crucial for understanding the mechanisms of cell biology and diseases therapy. Compared to traditional machine learning methods, the deep learning approaches for PTM prediction provide accurate and rapid screening, guiding the downstream wet experiments to leverage the screened information for focused studies. In this paper, we reviewed the recent works in deep learning to identify phosphorylation, acetylation, ubiquitination, and other PTM types. In addition, we summarized PTM databases and discussed future directions with critical insights.
Collapse
Key Words
- AAindex, Amino acid index
- ATP, Adenosine triphosphate
- AUC, Area under curve
- Ac, Acetylation
- BE, Binary encoding
- BLOSUM, Blocks substitution matrix
- Bi-LSTM, Bidirectional LSTM
- CKSAAP, Composition of k-spaced amino acid Pairs
- CNN, Convolutional neural network
- CNNOH, CNN with the one-hot encoding
- CNNWE, CNN with the word-embedding encoding
- CNNrgb, CNN red green blue
- CV, Cross-validation
- DC-CNN, Densely connected convolutional neural network
- DL, Deep learning
- DNNs, Deep neural networks
- Deep learning
- E. coli, Escherichia coli
- EBGW, Encoding based on grouped weight
- EGAAC, Enhanced grouped amino acids content
- IG, Information gain
- K, Lysine
- KNN, k nearest neighbor
- LASSO, Least absolute shrinkage and selection operator
- LSTM, Long short-term memory
- LSTMWE, LSTM with the word-embedding encoding
- M.musculus, Mus musculus
- MDC, Modular densely connected convolutional networks
- MDCAN, Multilane dense convolutional attention network
- ML, Machine learning
- MLP, Multilayer perceptron
- MMI, Multivariate mutual information
- Machine learning
- Mass spectrometry
- NMBroto, Normalized Moreau-Broto autocorrelation
- P, Proline
- PSP, PhosphoSitePlus
- PSSM, Position-specific scoring matrix
- PTM, Post-translational modifications
- Ph, Phosphorylation
- Post-translational modification
- Prediction
- PseAAC, Pseudo-amino acid composition
- R, Arginine
- RF, Random forest
- RNN, Recurrent neural network
- ROC, Receiver operating characteristic
- S, Serine
- S. typhimurium, Salmonella typhimurium
- S.cerevisiae, Saccharomyces cerevisiae
- SE, Squeeze and excitation
- SEV, Split to Equal Validation
- ST, Source and target
- SUMO, Small ubiquitin-like modifier
- SVM, Support vector machines
- T, Threonine
- Ub, Ubiquitination
- Y, Tyrosine
- ZSL, Zero-shot learning
Collapse
|