1
|
Kilgore HR, Chinn I, Mikhael PG, Mitnikov I, Van Dongen C, Zylberberg G, Afeyan L, Banani SF, Wilson-Hawken S, Lee TI, Barzilay R, Young RA. Protein codes promote selective subcellular compartmentalization. Science 2025:eadq2634. [PMID: 39913643 DOI: 10.1126/science.adq2634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 11/07/2024] [Accepted: 01/28/2025] [Indexed: 02/12/2025]
Abstract
Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must assemble. Here, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. A protein language model, ProtGPS, was developed that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in the nucleolus. ProtGPS identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized code governing their distribution to diverse subcellular compartments.
Collapse
Affiliation(s)
- Henry R Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Itamar Chinn
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, MIT, Cambridge, MA, USA
| | - Peter G Mikhael
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, MIT, Cambridge, MA, USA
| | - Ilan Mitnikov
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, MIT, Cambridge, MA, USA
| | | | - Guy Zylberberg
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, MIT, Cambridge, MA, USA
| | - Lena Afeyan
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, MIT, Cambridge, MA, USA
| | - Salman F Banani
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Susana Wilson-Hawken
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Computational and Systems Biology Program, MIT, Cambridge, MA, USA
| | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA
- Abdul Latif Jameel Clinic for Machine Learning in Health, MIT, Cambridge, MA, USA
| | - Richard A Young
- Whitehead Institute for Biomedical Research, Cambridge, MA, USA
- Department of Biology, MIT, Cambridge, MA, USA
| |
Collapse
|
2
|
Hui T, Secor M, Ho MN, Bayaraa N, Lin YS. Molecular Dynamics (MD)-Derived Features for Canonical and Noncanonical Amino Acids. J Chem Inf Model 2025. [PMID: 39895111 DOI: 10.1021/acs.jcim.4c02102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Machine learning (ML) models have become increasingly popular for predicting and designing structures and properties of peptides and proteins. These ML models typically use peptides and proteins containing only canonical amino acids as the training data. Consequently, these models struggle to make accurate predictions for peptides and proteins containing new amino acids that are absent in the training data set (e.g., noncanonical amino acids). One approach to improve the accuracy of the models is to collect more training data with the desired amino acids. However, this strategy is suboptimal as new data may not be easily attainable, and additional time is required to retrain the ML models. Alternatively, the extendibility of the ML models can be improved if the amino acid features used are representative and generalizable to the unseen amino acids. Herein, we develop amino acid features using molecular dynamics (MD) simulation results. Specifically, for a given amino acid, we perform MD simulation of its dipeptide to create features based on its backbone (ϕ, ψ) distributions and its electrostatic potentials. We demonstrate that these new features enable our ML models to more accurately predict the structural ensembles of cyclic peptides containing amino acids not present in the original training data set. For example, we build ML models to predict cyclic pentapeptide structures, with the training data set containing a library of 15 amino acids and the test data set containing the same 15-amino-acid library or an extended 50-amino-acid library. When using popular features such as Morgan fingerprints and MACCS keys to represent amino acids, the ML models achieve R2 = 0.963 for structural predictions of test cyclic pentapeptides containing the same 15-amino-acid library. However, these models' performances decrease significantly to R2 = 0.430 and R2 = 0.508, respectively, when tasked to predict the structures of cyclic pentapeptides containing a library of 50 amino acids. On the other hand, the model using our backbone (ϕ, ψ) features outperforms those using Morgan fingerprints and MACCS keys, with R2 = 0.700. Overall, instead of having to collect more training data, our new features enable predictions of peptide sequences containing amino acids not originally present in the training data set at the mere cost of performing new dipeptide simulations for the new amino acids.
Collapse
Affiliation(s)
- Tiffani Hui
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Maxim Secor
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Minh Ngoc Ho
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Nomindari Bayaraa
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| | - Yu-Shan Lin
- Department of Chemistry, Tufts University, Medford, Massachusetts 02155, United States
| |
Collapse
|
3
|
Vieira LC, Handojo ML, Wilke CO. Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.22.624936. [PMID: 39605589 PMCID: PMC11601519 DOI: 10.1101/2024.11.22.624936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various pLMs across multiple biological datasets to assess the impact of model size on transfer learning. Surprisingly, we found that larger models not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts-ESM-2 15B and ESM C 6B-despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.
Collapse
Affiliation(s)
- Luiz C. Vieira
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| | - Morgan L. Handojo
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| |
Collapse
|
4
|
Gelman S, Johnson B, Freschlin C, Sharma A, D'Costa S, Peters J, Gitter A, Romero PA. Biophysics-based protein language models for protein engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.03.15.585128. [PMID: 38559182 PMCID: PMC10980077 DOI: 10.1101/2024.03.15.585128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Collapse
Affiliation(s)
- Sam Gelman
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Bryce Johnson
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | | | - Arnav Sharma
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Sameer D'Costa
- Department of Biochemistry, University of Wisconsin-Madison
| | - John Peters
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison
- Department of Biomedical Engineering, Duke University
| |
Collapse
|
5
|
Sidi T, Bahiri-Elitzur S, Tuller T, Kolodny R. Predicting gene sequences with AI to study codon usage patterns. Proc Natl Acad Sci U S A 2025; 122:e2410003121. [PMID: 39739812 PMCID: PMC11725940 DOI: 10.1073/pnas.2410003121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 11/27/2024] [Indexed: 01/02/2025] Open
Abstract
Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. We trained AI models to predict codons given their amino acid sequence in the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis to study the extent to which we can learn patterns in naturally occurring codons to improve predictions. We trained our models on a subset of the proteins and evaluated their predictions on large, separate sets of proteins of varying lengths and expression levels. Our models significantly outperformed naïve frequency-based approaches, demonstrating that there are learnable dependencies in evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes and is greater in bacteria than eukaryotes, supporting the hypothesis that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In S. cerevisiae and bacteria, our models were more accurate for longer proteins, suggesting that the learned patterns may be related to cotranslational folding. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on prediction accuracy, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. Our study employing contemporary AI methods offers a unique perspective and a deep-learning-based prediction tool for evolutionary-selected codons. We hope that these can be useful to optimize codon usage in endogenous and heterologous proteins.
Collapse
Affiliation(s)
- Tomer Sidi
- Department of Computer Science, University of Haifa, Haifa3303221, Israel
| | - Shir Bahiri-Elitzur
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv6139001, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv6139001, Israel
- The Sagol School of Neuroscience, Tel-Aviv University, Tel Aviv6139001, Israel
| | - Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa3303221, Israel
| |
Collapse
|
6
|
Shen H, Li Y, Pi Q, Tian J, Xu X, Huang Z, Huang J, Pian C, Mao S. Unveiling novel antimicrobial peptides from the ruminant gastrointestinal microbiomes: A deep learning-driven approach yields an anti-MRSA candidate. J Adv Res 2025:S2090-1232(25)00005-0. [PMID: 39756573 DOI: 10.1016/j.jare.2025.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 01/01/2025] [Accepted: 01/02/2025] [Indexed: 01/07/2025] Open
Abstract
INTRODUCTION Antimicrobial peptides (AMPs) present a promising avenue to combat the growing threat of antibiotic resistance. The ruminant gastrointestinal microbiome serves as a unique ecosystem that offers untapped potential for AMP discovery. OBJECTIVES The aims of this study are to develop an effective methodology for the identification of novel AMPs from ruminant gastrointestinal microbiomes, followed by evaluating their antimicrobial efficacy and elucidating the mechanisms underlying their activity. METHODS We developed a deep learning-based model to identify AMP candidates from a dataset comprising 120 metagenomes and 10,373 metagenome-assembled genomes derived from the ruminant gastrointestinal tract. Both in vivo and in vitro experiments were performed to examine and validate the antimicrobial activities of the AMP candidates that were selected through bioinformatic analysis and subsequently synthesized chemically. Additionally, molecular dynamics simulations were conducted to explore the action mechanism of the most potent AMP candidate. RESULTS The deep learning model identified 27,192 potential secretory AMP candidates. Following bioinformatic analysis, 39 candidates were synthesized and tested. Remarkably, all synthesized peptides demonstrated antimicrobial activity against Staphylococcus aureus, with 79.5% showing effectiveness against multiple pathogens. Notably, Peptide 4, which exhibited the highest antimicrobial activity against methicillin-resistant Staphylococcus aureus (MRSA), confirmed this effect in a mouse model with wound infection, exhibiting a low propensity for resistance development and minimal cytotoxicity and hemolysis towards mammalian cells. Molecular dynamics simulations provided insights into the mechanism of Peptide 4, primarily its ability to disrupt bacterial cell membranes, leading to cell death. CONCLUSION This study highlights the power of combining deep learning with microbiome research to uncover novel therapeutic candidates, paving the way for the development of next-generation antimicrobials like Peptide 4 to combat the growing threat of MRSA would infections. It also underscores the value of utilizing ruminant microbial resources.
Collapse
Affiliation(s)
- Hong Shen
- Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Yanru Li
- College of Agriculture, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Qingjie Pi
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Junru Tian
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Xianghan Xu
- College of Veterinary Medicine, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Zan Huang
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| | - Jinghu Huang
- College of Veterinary Medicine, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| | - Cong Pian
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, Jiangsu, China.
| | - Shengyong Mao
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| |
Collapse
|
7
|
Kim J, Woo J, Park JY, Kim KJ, Kim D. Deep learning for NAD/NADP cofactor prediction and engineering using transformer attention analysis in enzymes. Metab Eng 2025; 87:86-94. [PMID: 39571721 DOI: 10.1016/j.ymben.2024.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 09/25/2024] [Accepted: 11/17/2024] [Indexed: 12/13/2024]
Abstract
Understanding and manipulating the cofactor preferences of NAD(P)-dependent oxidoreductases, the most widely distributed enzyme group in nature, is increasingly crucial in bioengineering. However, large-scale identification of the cofactor preferences and the design of mutants to switch cofactor specificity remain as complex tasks. Here, we introduce DISCODE (Deep learning-based Iterative pipeline to analyze Specificity of COfactors and to Design Enzyme), a novel transformer-based deep learning model to predict NAD(P) cofactor preferences. For model training, a total of 7,132 NAD(P)-dependent enzyme sequences were collected. Leveraging whole-length sequence information, DISCODE classifies the cofactor preferences of NAD(P)-dependent oxidoreductase protein sequences without structural or taxonomic limitation. The model showed 97.4% and 97.3% of accuracy and F1 score, respectively. A notable feature of DISCODE is the interpretability of its transformer layers. Analysis of attention layers in the model enables identification of several residues that showed significantly higher attention weights. They were well aligned with structurally important residues that closely interact with NAD(P), facilitating the identification of key residues for determining cofactor specificities. These key residues showed high consistency with verified cofactor switching mutants. Integrated into an enzyme design pipeline, DISCODE coupled with attention analysis, enables a fully automated approach to redesign cofactor specificity.
Collapse
Affiliation(s)
- Jaehyung Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Jihoon Woo
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Joon Young Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Kyung-Jin Kim
- School of Life Sciences, BK21 FOUR KNU Creative BioResearch Group, KNU Institute of Microbiology, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Donghyuk Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea.
| |
Collapse
|
8
|
Doron G, Genway S, Roberts M, Jasti S. Generative AI: driving productivity and scientific breakthroughs in pharmaceutical R&D. Drug Discov Today 2025; 30:104272. [PMID: 39675517 DOI: 10.1016/j.drudis.2024.104272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 11/20/2024] [Accepted: 12/10/2024] [Indexed: 12/17/2024]
Abstract
The rapid advancement of generative artificial intelligence (AI) is reshaping pharmaceutical research and development (R&D), offering opportunities across drug discovery and development. Generative AI (GenAI) enhances productivity by enabling virtual assistants, which help automate routine tasks. It advances novel small-molecule drug design and drives new machine learning (ML) applications through synthetic data generation. Further impact is anticipated in drug development from improving operational efficiencies to novel digital innovations. Converging technologies enable rich data set capture, and next-generation AI will enable rapid, automated hypothesis generation and testing. Here, we assess the current and future applications, and the mid-term and long-term transformative potential, of GenAI in pharmaceutical R&D.
Collapse
Affiliation(s)
- Guy Doron
- Data Sciences & AI, R&D, Pharmaceuticals, Bayer AG, Berlin, Germany.
| | - Sam Genway
- Hybrid Intelligence, Capgemini Engineering, Stevenage, UK
| | - Mark Roberts
- Hybrid Intelligence, Capgemini Engineering, Stevenage, UK
| | - Sai Jasti
- Data Sciences & AI, R&D, Pharmaceuticals, Bayer AG, Berlin, Germany
| |
Collapse
|
9
|
Totaro MG, Vide U, Zausinger R, Winkler A, Oberdorfer G. ESM-scan-A tool to guide amino acid substitutions. Protein Sci 2024; 33:e5221. [PMID: 39565080 PMCID: PMC11577456 DOI: 10.1002/pro.5221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 09/27/2024] [Accepted: 10/28/2024] [Indexed: 11/21/2024]
Abstract
Protein structure prediction and (re)design have gone through a revolution in the last 3 years. The tremendous progress in these fields has been almost exclusively driven by readily available machine learning algorithms applied to protein folding and sequence design problems. Despite these advancements, predicting site-specific mutational effects on protein stability and function remains an unsolved problem. This is a persistent challenge, mainly because the free energy of large systems is very difficult to compute with absolute accuracy and subtle changes to protein structures are hard to capture with computational models. Here, we describe the implementation and use of ESM-Scan, which uses the ESM zero-shot predictor to scan entire protein sequences for preferential amino acid changes, thus enabling in silico deep mutational scanning experiments. We benchmark ESM-Scan on its predictive capabilities for stability and functionality of sequence changes using three publicly available datasets and proceed by experimentally testing the tool's performance on a challenging test case of a blue-light-activated diguanylate cyclase from Methylotenera species (MsLadC), where it accurately predicted the importance of a highly conserved residue in a region involved in allosteric product inhibition. Our experimental results show that the ESM-zero shot model is capable of inferring the effects of a set of amino acid substitutions in their correlation between predicted fitness and experimental results. ESM-Scan is publicly available at https://huggingface.co/spaces/thaidaev/zsp.
Collapse
Affiliation(s)
| | - Uršula Vide
- Institute of BiochemistryGraz University of TechnologyGrazAustria
| | - Regina Zausinger
- Institute of BiochemistryGraz University of TechnologyGrazAustria
| | - Andreas Winkler
- Institute of BiochemistryGraz University of TechnologyGrazAustria
- BioTechMedGrazAustria
| | - Gustav Oberdorfer
- Institute of BiochemistryGraz University of TechnologyGrazAustria
- BioTechMedGrazAustria
| |
Collapse
|
10
|
Asediya VS, Anjaria PA, Mathakiya RA, Koringa PG, Nayak JB, Bisht D, Fulmali D, Patel VA, Desai DN. Vaccine development using artificial intelligence and machine learning: A review. Int J Biol Macromol 2024; 282:136643. [PMID: 39426778 DOI: 10.1016/j.ijbiomac.2024.136643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 09/30/2024] [Accepted: 10/15/2024] [Indexed: 10/21/2024]
Abstract
The COVID-19 pandemic has underscored the critical importance of effective vaccines, yet their development is a challenging and demanding process. It requires identifying antigens that elicit protective immunity, selecting adjuvants that enhance immunogenicity, and designing delivery systems that ensure optimal efficacy. Artificial intelligence (AI) can facilitate this process by using machine learning methods to analyze large and diverse datasets, suggest novel vaccine candidates, and refine their design and predict their performance. This review explores how AI can be applied to various aspects of vaccine development, such as predicting immune response from protein sequences, discovering adjuvants, optimizing vaccine doses, modeling vaccine supply chains, and predicting protein structures. We also address the challenges and ethical issues that emerge from the use of AI in vaccine development, such as data privacy, algorithmic bias, and health data sensitivity. We contend that AI has immense potential to accelerate vaccine development and respond to future pandemics, but it also requires careful attention to the quality and validity of the data and methods used.
Collapse
Affiliation(s)
| | | | | | | | | | - Deepanker Bisht
- Indian Veterinary Research Institute, Izatnagar, U.P., India
| | | | | | | |
Collapse
|
11
|
Lange MA, Chen Y, Fu H, Korada A, Guo C, Ma YY. CalTrig: A GUI-based Machine Learning Approach for Decoding Neuronal Calcium Transients in Freely Moving Rodents. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.30.615860. [PMID: 39372793 PMCID: PMC11451592 DOI: 10.1101/2024.09.30.615860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/08/2024]
Abstract
Advances in in vivoC a 2 + imaging using miniature microscopes have enabled researchers to study single-neuron activity in freely-moving animals. Tools such as MiniAN and CalmAn have been developed to convertCa 2 + visual signals to numerical data, collectively referred to as CalV2N. However, substantial challenges remain in analyzing the large datasets generated by CalV2N, particularly in integrating data streams, evaluating CalV2N output quality, and reliably and efficiently identifyingC a 2 + transients. In this study, we introduce CalTrig, an open-source graphical user interface (GUI) tool designed to address these challenges at the post-CalV2N stage of data processing. CalTrig integrates multiple data streams, includingC a 2 + imaging, neuronal footprints,C a 2 + traces, and behavioral tracking, and offers capabilities for evaluating the quality of CalV2N outputs. It enables synchronized visualization and efficientC a 2 + transient identification. We evaluated four machine learning models (i.e., GRU, LSTM, Transformer, and Local Transformer) forC a 2 + transient detection. Our results indicate that the GRU model offers the highest predictability and computational efficiency, achieving stable performance across training sessions, different animals and even among different brain regions. The integration of manual, parameter-based, and machine learning-based detection methods in CalTrig provides flexibility and accuracy for various research applications. The user-friendly interface and low computing demands of CalTrig make it accessible to neuroscientists without programming expertise. We further conclude that CalTrig enables deeper exploration of brain function, supports hypothesis generation about neuronal mechanisms, and opens new avenues for understanding neurological disorders and developing treatments.
Collapse
Affiliation(s)
- Michal A. Lange
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Yingying Chen
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Haoying Fu
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Amith Korada
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Changyong Guo
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Yao-Ying Ma
- Department of Pharmacology and Toxicology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Stark Neurosciences Research Institute, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| |
Collapse
|
12
|
Nakagawa S, Sakaguchi S. Exploring the hidden world of RNA viruses with a transformer-based tool. PATTERNS (NEW YORK, N.Y.) 2024; 5:101095. [PMID: 39568477 PMCID: PMC11573883 DOI: 10.1016/j.patter.2024.101095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2024]
Abstract
Hou and He et al.1 developed a new RNA virus identification tool named LucaProt, a transformer-based bioinformatics software using sequence and structural characteristics of RNA-dependent RNA polymerases (RdRPs), which are essential for almost all RNA viruses. LucaProt can identify RdRPs from highly diverse RNA viruses, unveiling the hidden RNA virosphere.
Collapse
Affiliation(s)
- So Nakagawa
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Japan
- Division of Omics Sciences, Institute of Medical Sciences, Tokai University, Isehara, Japan
- Division of Interdisciplinary Merging of Health Research, Micro/Nano Technology Center, Tokai University, Hiratsuka, Japan
| | - Shoichi Sakaguchi
- Department of Microbiology and Infection Control, Faculty of Medicine, Osaka Medical and Pharmaceutical University, Takatsuki, Japan
| |
Collapse
|
13
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for over 450 enzymes of unknown function from the model bacteria Escherichia coli uxgsing the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome. Article Summary Many proteins in any genome, ranging from 30 to 70%, lack an assigned function. This knowledge gap limits the full use of the vast available genomic data. Machine learning has shown promise in transferring functional knowledge from proteins of known functions to similar ones, but largely fails to predict novel functions not seen in its training data. Understanding these failures can guide the development of better machine-learning methods to help experts make accurate functional predictions for uncharacterized proteins.
Collapse
|
14
|
Knapp BD, Shi H, Huang KC. Complex state transitions of the bacterial cell division protein FtsZ. Mol Biol Cell 2024; 35:ar130. [PMID: 39083352 PMCID: PMC11481701 DOI: 10.1091/mbc.e23-11-0446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 07/25/2024] [Accepted: 07/25/2024] [Indexed: 08/02/2024] Open
Abstract
The key bacterial cell division protein FtsZ can adopt multiple conformations, and prevailing models suggest that transitions of FtsZ subunits from the closed to open state are necessary for filament formation and stability. Using all-atom molecular dynamics simulations, we analyzed state transitions of Staphylococcus aureus FtsZ as a monomer, dimer, and hexamer. We found that monomers can adopt intermediate states but preferentially adopt a closed state that is robust to forced reopening. Dimer subunits transitioned between open and closed states, and dimers with both subunits in the closed state remained highly stable, suggesting that open-state conformations are not necessary for filament formation. Mg2+ strongly stabilized the conformation of GTP-bound subunits and the dimer filament interface. Our hexamer simulations indicate that the plus end subunit preferentially closes and that other subunits can transition between states without affecting inter-subunit stability. We found that rather than being correlated with subunit opening, inter-subunit stability was strongly correlated with catalytic site interactions. By leveraging deep-learning models, we identified key intrasubunit interactions governing state transitions. Our findings suggest a greater range of possible monomer and filament states than previously considered and offer new insights into the nuanced interplay between subunit states and the critical role of nucleotide hydrolysis and Mg2+ in FtsZ filament dynamics.
Collapse
Affiliation(s)
| | - Handuo Shi
- Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305
- Department of Bioengineering, Stanford University, Stanford, CA 94305
| | - Kerwyn Casey Huang
- Biophysics Program, Stanford University, Stanford, CA 94305
- Department of Microbiology and Immunology, Stanford University School of Medicine, Stanford, CA 94305
- Department of Bioengineering, Stanford University, Stanford, CA 94305
- Chan Zuckerberg Biohub, San Francisco, CA 94158
| |
Collapse
|
15
|
Hu X, Zhang X, Sun W, Liu C, Deng P, Cao Y, Zhang C, Xu N, Zhang T, Zhang Y, Liu JJ, Wang H. Systematic discovery of DNA-binding tandem repeat proteins. Nucleic Acids Res 2024; 52:10464-10489. [PMID: 39189466 PMCID: PMC11417379 DOI: 10.1093/nar/gkae710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/30/2024] [Accepted: 08/07/2024] [Indexed: 08/28/2024] Open
Abstract
Tandem repeat proteins (TRPs) are widely distributed and bind to a wide variety of ligands. DNA-binding TRPs such as zinc finger (ZNF) and transcription activator-like effector (TALE) play important roles in biology and biotechnology. In this study, we first conducted an extensive analysis of TRPs in public databases, and found that the enormous diversity of TRPs is largely unexplored. We then focused our efforts on identifying novel TRPs possessing DNA-binding capabilities. We established a protein language model for DNA-binding protein prediction (PLM-DBPPred), and predicted a large number of DNA-binding TRPs. A subset was then selected for experimental screening, leading to the identification of 11 novel DNA-binding TRPs, with six showing sequence specificity. Notably, members of the STAR (Short TALE-like Repeat proteins) family can be programmed to target specific 9 bp DNA sequences with high affinity. Leveraging this property, we generated artificial transcription factors using reprogrammed STAR proteins and achieved targeted activation of endogenous gene sets. Furthermore, the members of novel families such as MOON (Marine Organism-Originated DNA binding protein) and pTERF (prokaryotic mTERF-like protein) exhibit unique features and distinct DNA-binding characteristics, revealing interesting biological clues. Our study expands the diversity of DNA-binding TRPs, and demonstrates that a systematic approach greatly enhances the discovery of new biological insights and tools.
Collapse
Affiliation(s)
- Xiaoxuan Hu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Xuechun Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Wen Sun
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| | - Chunhong Liu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Pujuan Deng
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Yuanwei Cao
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Chenze Zhang
- National Key Laboratory of Efficacy and Mechanism on Chinese Medicine for Metabolic Diseases, Beijing University of Chinese Medicine, Beijing 100029, China
| | - Ning Xu
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Tongtong Zhang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
| | - Yong E Zhang
- University of Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Jun-Jie Gogo Liu
- State Key Laboratory of Membrane Biology, Beijing Frontier Research Center for Biological Structure, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing 100084, China
| | - Haoyi Wang
- Key Laboratory of Organ Regeneration and Reconstruction, State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- Institute for Stem Cell and Regeneration, Chinese Academy of Sciences, Beijing 100101, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing 100101, China
| |
Collapse
|
16
|
Martin C, Gitter A, Anantharaman K. Protein Set Transformer: A protein-based genome language model to power high diversity viromics. RESEARCH SQUARE 2024:rs.3.rs-4844047. [PMID: 39399683 PMCID: PMC11469463 DOI: 10.21203/rs.3.rs-4844047/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Exponential increases in microbial and viral genomic data demand transformational advances in scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of microbial and especially viral genomes and proteins that significantly decreases the volume of usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model that models genomes as sets of proteins without considering sparsely available functional labels. Trained on >100k viruses, PST outperformed other homology- and language model-based approaches for relating viral genomes based on shared protein content. Further, PST demonstrated protein structural and functional awareness by clustering capsid-fold-containing proteins with known capsid proteins and uniquely clustering late gene proteins within related viruses. Our data establish PST as a valuable method for diverse viral genomics, ecology, and evolutionary applications. We posit that the PST framework can be a foundation model for microbial genomics when trained on suitable data.
Collapse
Affiliation(s)
- Cody Martin
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
17
|
Niu B, Lee B, Wang L, Chen W, Johnson J. The Accurate Prediction of Antibody Deamidations by Combining High-Throughput Automated Peptide Mapping and Protein Language Model-Based Deep Learning. Antibodies (Basel) 2024; 13:74. [PMID: 39311379 PMCID: PMC11417914 DOI: 10.3390/antib13030074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 08/30/2024] [Accepted: 09/06/2024] [Indexed: 09/26/2024] Open
Abstract
Therapeutic antibodies such as monoclonal antibodies (mAbs), bispecific and multispecific antibodies are pivotal in therapeutic protein development and have transformed disease treatments across various therapeutic areas. The integrity of therapeutic antibodies, however, is compromised by sequence liabilities, notably deamidation, where asparagine (N) and glutamine (Q) residues undergo chemical degradations. Deamidation negatively impacts the efficacy, stability, and safety of diverse classes of antibodies, thus necessitating the critical need for the early and accurate identification of vulnerable sites. In this article, a comprehensive antibody deamidation-specific dataset (n = 2285) of varied modalities was created by using high-throughput automated peptide mapping followed by supervised machine learning to predict the deamidation propensities, as well as the extents, throughout the entire antibody sequences. We propose a novel chimeric deep learning model, integrating protein language model (pLM)-derived embeddings with local sequence information for enhanced deamidation predictions. Remarkably, this model requires only sequence inputs, eliminating the need for laborious feature engineering. Our approach demonstrates state-of-the-art performance, offering a streamlined workflow for high-throughput automated peptide mapping and deamidation prediction, with the potential of broader applicability to other antibody sequence liabilities.
Collapse
Affiliation(s)
- Ben Niu
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Benjamin Lee
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Lili Wang
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | - Wen Chen
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| | - Jeffrey Johnson
- Discovery Biotherapeutics, Bristol Myers Squibb, San Diego, CA 92121, USA
| |
Collapse
|
18
|
Peng S, Rajjou L. Advancing plant biology through deep learning-powered natural language processing. PLANT CELL REPORTS 2024; 43:208. [PMID: 39102077 DOI: 10.1007/s00299-024-03294-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/19/2024] [Indexed: 08/06/2024]
Abstract
The application of deep learning methods, specifically the utilization of Large Language Models (LLMs), in the field of plant biology holds significant promise for generating novel knowledge on plant cell systems. The LLM framework exhibits exceptional potential, particularly with the development of Protein Language Models (PLMs), allowing for in-depth analyses of nucleic acid and protein sequences. This analytical capacity facilitates the discernment of intricate patterns and relationships within biological data, encompassing multi-scale information within DNA or protein sequences. The contribution of PLMs extends beyond mere sequence patterns and structure--function recognition; it also supports advancements in genetic improvements for agriculture. The integration of deep learning approaches into the domain of plant sciences offers opportunities for major breakthroughs in basic research across multi-scale plant traits. Consequently, the strategic application of deep learning methodologies, particularly leveraging the potential of LLMs, will undoubtedly play a pivotal role in advancing plant sciences, plant production, plant uses and propelling the trajectory toward sustainable agroecological and agro-food transitions.
Collapse
Affiliation(s)
- Shuang Peng
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Loïc Rajjou
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France.
| |
Collapse
|
19
|
Martin C, Gitter A, Anantharaman K. Protein Set Transformer: A protein-based genome language model to power high diversity viromics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.26.605391. [PMID: 39131363 PMCID: PMC11312453 DOI: 10.1101/2024.07.26.605391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Exponential increases in microbial and viral genomic data demand transformational advances in scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of microbial and especially viral genomes and proteins that significantly decreases the volume of usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model that models genomes as sets of proteins without considering sparsely available functional labels. Trained on >100k viruses, PST outperformed other homology- and language model-based approaches for relating viral genomes based on shared protein content. Further, PST demonstrated protein structural and functional awareness by clustering capsid-fold-containing proteins with known capsid proteins and uniquely clustering late gene proteins within related viruses. Our data establish PST as a valuable method for diverse viral genomics, ecology, and evolutionary applications. We posit that the PST framework can be a foundation model for microbial genomics when trained on suitable data.
Collapse
Affiliation(s)
- Cody Martin
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
20
|
Teimouri H, Medvedeva A, Kolomeisky AB. Unraveling the role of physicochemical differences in predicting protein-protein interactions. J Chem Phys 2024; 161:045102. [PMID: 39051836 DOI: 10.1063/5.0219501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 07/09/2024] [Indexed: 07/27/2024] Open
Abstract
The ability to accurately predict protein-protein interactions is critically important for understanding major cellular processes. However, current experimental and computational approaches for identifying them are technically very challenging and still have limited success. We propose a new computational method for predicting protein-protein interactions using only primary sequence information. It utilizes the concept of physicochemical similarity to determine which interactions will most likely occur. In our approach, the physicochemical features of proteins are extracted using bioinformatics tools for different organisms. Then they are utilized in a machine-learning method to identify successful protein-protein interactions via correlation analysis. It was found that the most important property that correlates most with the protein-protein interactions for all studied organisms is dipeptide amino acid composition (the frequency of specific amino acid pairs in a protein sequence). While current approaches often overlook the specificity of protein-protein interactions with different organisms, our method yields context-specific features that determine protein-protein interactions. The analysis is specifically applied to the bacterial two-component system that includes histidine kinase and transcriptional response regulators, as well as to the barnase-barstar complex, demonstrating the method's versatility across different biological systems. Our approach can be applied to predict protein-protein interactions in any biological system, providing an important tool for investigating complex biological processes' mechanisms.
Collapse
Affiliation(s)
- Hamid Teimouri
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| | - Angela Medvedeva
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| | - Anatoly B Kolomeisky
- Department of Chemistry, Rice University, Houston, Texas 77005, USA
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, USA
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
21
|
Yang S, Xu P. HemoDL: Hemolytic peptides prediction by double ensemble engines from Rich sequence-derived and transformer-enhanced information. Anal Biochem 2024; 690:115523. [PMID: 38552762 DOI: 10.1016/j.ab.2024.115523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 03/20/2024] [Accepted: 03/22/2024] [Indexed: 04/02/2024]
Abstract
Hemolytic peptides can trigger hemolysis by rupturing red blood cells' membranes and triggering cell disruption. Due to the labor-intensive and time-consuming in-lab identification process, accurate, high-throughput hemolytic peptide prediction is crucial for the growth of peptide sequence data in proteomics and peptidomics. In this study, we offer the HemoDL ensemble learning model, which learns the distinct distribution of sequence characteristics for predicting the hemolytic activity of peptides using a double LightGBM framework. To determine the most informative encoding features, we compare 17 widely used features across four benchmark datasets. Our investigation reveals that CTD, BPF, Charge, AAC, GDPC, ATC, QSO, and transformer-based features exhibit more positive contributions to detecting the hemolytic activity of peptides. Comparison with eight state-of-the-art methods demonstrates that HemoDL outperforms other models, attaining higher Matthews Correlation Coefficient values on four test datasets, ranging from 6.30% to 16.04%, 6.63%-11.26%, 4.76%-9.92%, and 7.41%-15.03%, respectively. Additionally, we provide the HemoDL with a user-friendly graphical interface available at https://github.com/abcair/HemoDL. In summary, the HemoDL model, leveraging CTD, BPF, Charge, AAC, GDPC, ATC, QSO and transformer-based encoding features within a double LightGBM learning framework, achieves high accuracy in predicting the hemolytic activity of peptides.
Collapse
Affiliation(s)
- Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, 213164, China; The Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University, Changzhou, 213164, China
| | - Piao Xu
- College of Economics and Management, Nanjing Forestry University, China.
| |
Collapse
|
22
|
Almotairi S, Badr E, Abdelbaky I, Elhakeem M, Abdul Salam M. Hybrid transformer-CNN model for accurate prediction of peptide hemolytic potential. Sci Rep 2024; 14:14263. [PMID: 38902287 PMCID: PMC11190137 DOI: 10.1038/s41598-024-63446-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 05/29/2024] [Indexed: 06/22/2024] Open
Abstract
Hemolysis is a crucial factor in various biomedical and pharmaceutical contexts, driving our interest in developing advanced computational techniques for precise prediction. Our proposed approach takes advantage of the unique capabilities of convolutional neural networks (CNNs) and transformers to detect complex patterns inherent in the data. The integration of CNN and transformers' attention mechanisms allows for the extraction of relevant information, leading to accurate predictions of hemolytic potential. The proposed method was trained on three distinct data sets of peptide sequences known as recurrent neural network-hemolytic (RNN-Hem), Hlppredfuse, and Combined. Our computational results demonstrated the superior efficacy of our models compared to existing methods. The proposed approach demonstrated impressive Matthews correlation coefficients of 0.5962, 0.9111, and 0.7788 respectively, indicating its effectiveness in predicting hemolytic activity. With its potential to guide experimental efforts in peptide design and drug development, this method holds great promise for practical applications. Integrating CNNs and transformers proves to be a powerful tool in the fields of bioinformatics and therapeutic research, highlighting their potential to drive advancement in this area.
Collapse
Affiliation(s)
- Sultan Almotairi
- Department of Computer Science, Faculty of College of Computer and Information Sciences, Majmaah University, 11952, Majmaah, Saudi Arabia
- Department of Computer Science, Faculty of Computer and Information Systems, Islamic University of Madinah, 42351, Medinah, Saudi Arabia
| | - Elsayed Badr
- Scientific Computing Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt.
- The Egyptian School of Data Science (ESDS), Benha, Egypt.
| | - Ibrahim Abdelbaky
- Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
| | - Mohamed Elhakeem
- Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt.
| | - Mustafa Abdul Salam
- Artificial Intelligence Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
- Department of Computer Science, College of Arts and Science, Wadi Addawasir, Prince Sattam Bin Abdulaziz University, 16273, Al-Kharj, Saudi Arabia
| |
Collapse
|
23
|
Boffi NM, Vanden-Eijnden E. Deep learning probability flows and entropy production rates in active matter. Proc Natl Acad Sci U S A 2024; 121:e2318106121. [PMID: 38861599 PMCID: PMC11194503 DOI: 10.1073/pnas.2318106121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 05/01/2024] [Indexed: 06/13/2024] Open
Abstract
Active matter systems, from self-propelled colloids to motile bacteria, are characterized by the conversion of free energy into useful work at the microscopic scale. They involve physics beyond the reach of equilibrium statistical mechanics, and a persistent challenge has been to understand the nature of their nonequilibrium states. The entropy production rate and the probability current provide quantitative ways to do so by measuring the breakdown of time-reversal symmetry. Yet, their efficient computation has remained elusive, as they depend on the system's unknown and high-dimensional probability density. Here, building upon recent advances in generative modeling, we develop a deep learning framework to estimate the score of this density. We show that the score, together with the microscopic equations of motion, gives access to the entropy production rate, the probability current, and their decomposition into local contributions from individual particles. To represent the score, we introduce a spatially local transformer network architecture that learns high-order interactions between particles while respecting their underlying permutation symmetry. We demonstrate the broad utility and scalability of the method by applying it to several high-dimensional systems of active particles undergoing motility-induced phase separation (MIPS). We show that a single network trained on a system of 4,096 particles at one packing fraction can generalize to other regions of the phase diagram, including to systems with as many as 32,768 particles. We use this observation to quantify the spatial structure of the departure from equilibrium in MIPS as a function of the number of particles and the packing fraction.
Collapse
Affiliation(s)
- Nicholas M. Boffi
- Courant Institute of Mathematical Sciences, New York University, New York, NY10012
| | - Eric Vanden-Eijnden
- Courant Institute of Mathematical Sciences, New York University, New York, NY10012
| |
Collapse
|
24
|
Vu MH, Robert PA, Akbar R, Swiatczak B, Sandve GK, Haug DTT, Greiff V. Linguistics-based formalization of the antibody language as a basis for antibody language models. NATURE COMPUTATIONAL SCIENCE 2024; 4:412-422. [PMID: 38877120 DOI: 10.1038/s43588-024-00642-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 05/13/2024] [Indexed: 06/16/2024]
Abstract
Apparent parallels between natural language and antibody sequences have led to a surge in deep language models applied to antibody sequences for predicting cognate antigen recognition. However, a linguistic formal definition of antibody language does not exist, and insight into how antibody language models capture antibody-specific binding features remains largely uninterpretable. Here we describe how a linguistic formalization of the antibody language, by characterizing its tokens and grammar, could address current challenges in antibody language model rule mining.
Collapse
Affiliation(s)
- Mai Ha Vu
- Department of Linguistics and Scandinavian Studies, University of Oslo, Oslo, Norway.
| | - Philippe A Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Bartlomiej Swiatczak
- Department of History of Science and Scientific Archeology, University of Science and Technology of China, Hefei, China
| | | | | | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
25
|
Kilgore HR, Chinn I, Mikhael PG, Mitnikov I, Van Dongen C, Zylberberg G, Afeyan L, Banani S, Wilson-Hawken S, Lee TI, Barzilay R, Young RA. Protein codes promote selective subcellular compartmentalization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589616. [PMID: 38659952 PMCID: PMC11042338 DOI: 10.1101/2024.04.15.589616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must efficiently assemble. Here, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. A protein language model, ProtGPS, was developed that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in targeted subcellular compartments. ProtGPS also identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized code governing their distribution in specific cellular compartments.
Collapse
Affiliation(s)
- Henry R. Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Itamar Chinn
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Peter G. Mikhael
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Ilan Mitnikov
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | - Guy Zylberberg
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Lena Afeyan
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Salman Banani
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Susana Wilson-Hawken
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Program of Computational & Systems Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Richard A. Young
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
26
|
Carbery A, Buttenschoen M, Skyner R, von Delft F, Deane CM. Learnt representations of proteins can be used for accurate prediction of small molecule binding sites on experimentally determined and predicted protein structures. J Cheminform 2024; 16:32. [PMID: 38486231 PMCID: PMC10941399 DOI: 10.1186/s13321-024-00821-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 03/01/2024] [Indexed: 03/17/2024] Open
Abstract
Protein-ligand binding site prediction is a useful tool for understanding the functional behaviour and potential drug-target interactions of a novel protein of interest. However, most binding site prediction methods are tested by providing crystallised ligand-bound (holo) structures as input. This testing regime is insufficient to understand the performance on novel protein targets where experimental structures are not available. An alternative option is to provide computationally predicted protein structures, but this is not commonly tested. However, due to the training data used, computationally-predicted protein structures tend to be extremely accurate, and are often biased toward a holo conformation. In this study we describe and benchmark IF-SitePred, a protein-ligand binding site prediction method which is based on the labelling of ESM-IF1 protein language model embeddings combined with point cloud annotation and clustering. We show that not only is IF-SitePred competitive with state-of-the-art methods when predicting binding sites on experimental structures, but it performs better on proxies for novel proteins where low accuracy has been simulated by molecular dynamics. Finally, IF-SitePred outperforms other methods if ensembles of predicted protein structures are generated.
Collapse
Affiliation(s)
- Anna Carbery
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK
- Diamond Light Source, Harwell Science and Innovation Campus, Didcot, OX11 0DE, UK
| | - Martin Buttenschoen
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK
| | - Rachael Skyner
- OMass Therapeutics, Building 4000, Chancellor Court, John Smith Drive, ARC Oxford, OX4 2GX, UK
| | - Frank von Delft
- Diamond Light Source, Harwell Science and Innovation Campus, Didcot, OX11 0DE, UK
- Centre for Medicines Discovery, University of Oxford, Oxford, OX3 7DQ, UK
- Research Complex at Harwell, Harwell Science and Innovation Campus, Didcot, OX11 0FA, United Kingdom
- Department of Biochemistry, University of Johannesburg, Johannesburg, 2006, South Africa
| | - Charlotte M Deane
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, OX1 3LB, UK.
| |
Collapse
|
27
|
Teimouri H, Medvedeva A, Kolomeisky AB. Physical-Chemical Features Selection Reveals That Differences in Dipeptide Compositions Correlate Most with Protein-Protein Interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582345. [PMID: 38464064 PMCID: PMC10925282 DOI: 10.1101/2024.02.27.582345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The ability to accurately predict protein-protein interactions is critically important for our understanding of major cellular processes. However, current experimental and computational approaches for identifying them are technically very challenging and still have limited success. We propose a new computational method for predicting protein-protein interactions using only primary sequence information. It utilizes a concept of physical-chemical similarity to determine which interactions will most probably occur. In our approach, the physical-chemical features of protein are extracted using bioinformatics tools for different organisms, and then they are utilized in a machine-learning method to identify successful protein-protein interactions via correlation analysis. It is found that the most important property that correlates most with the protein-protein interactions for all studied organisms is dipeptide amino acid compositions. The analysis is specifically applied to the bacterial two-component system that includes histidine kinase and transcriptional response regulators. Our theoretical approach provides a simple and robust method for quantifying the important details of complex mechanisms of biological processes.
Collapse
Affiliation(s)
- Hamid Teimouri
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
| | - Angela Medvedeva
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
| | - Anatoly B. Kolomeisky
- Department of Chemistry, Rice University, Houston, Texas, United States
- Center for Theoretical Biological Physics, Rice University, Houston, Texas, United States
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, United States
- Department of Physics and Astronomy, Rice University, Houston, TX, United States
| |
Collapse
|
28
|
Du Z, Ding X, Hsu W, Munir A, Xu Y, Li Y. pLM4ACE: A protein language model based predictor for antihypertensive peptide screening. Food Chem 2024; 431:137162. [PMID: 37604011 DOI: 10.1016/j.foodchem.2023.137162] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 08/09/2023] [Accepted: 08/13/2023] [Indexed: 08/23/2023]
Abstract
Angiotensin-I converting enzyme (ACE) regulates the renin-angiotensin system and is a drug target in clinical treatment for hypertension. This study aims to develop a protein language model (pLM) with evolutionary scale modeling (ESM-2) embeddings that is trained on experimental data to screen peptides with strong ACE inhibitory activity. Twelve conventional peptide embedding approaches and five machine learning (ML) modeling methods were also tested for performance comparison. Among the 65 classifiers tested, logistic regression with ESM-2 embeddings showed the best performance, with balanced accuracy (BACC), Matthews correlation coefficient (MCC), and area under the curve of 0.883 ± 0.017, 0.77 ± 0.032, and 0.96 ± 0.009, respectively. Multilayer perceptron and support vector machine also exhibited great compatibility with ESM-2 embeddings. The ESM-2 embeddings showed superior performance in enhancing the prediction model compared to the 12 traditional embedding methods. A user-friendly webserver (https://sqzujiduce.us-east-1.awsapprunner.com) with the top three models is now freely available.
Collapse
Affiliation(s)
- Zhenjiao Du
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA
| | - Xingjian Ding
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - William Hsu
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - Arslan Munir
- Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA
| | - Yixiang Xu
- Healthy Processed Foods Research Unit, Western Regional Research Center, USDA-ARS, 800 Buchanan Street, Albany, CA 94710, USA
| | - Yonghui Li
- Department of Grain Science and Industry, Kansas State University, Manhattan, KS 66506, USA.
| |
Collapse
|
29
|
Li X, Perez R, Giannakoulias S, Petersson EJ. Proteins Need Extra Attention: Improving the Predictive Power of Protein Language Models on Mutational Datasets with Hint Tokens. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.05.570055. [PMID: 38106169 PMCID: PMC10723359 DOI: 10.1101/2023.12.05.570055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
In this computational study, we introduce "hint token learning," a novel machine learning approach designed to enhance protein language modeling. This method effectively addresses the unique challenges of protein mutational datasets, characterized by highly similar inputs that may differ by only a single token. Our research highlights the superiority of hint token learning over traditional fine-tuning methods through three distinct case studies. We first developed a highly accurate free energy of folding model using the largest protein stability dataset to date. Then, we applied hint token learning to predict a biophysical attribute, the brightness of green fluorescent protein mutants. In our third case, hint token learning was utilized to assess the impact of mutations on RecA bioactivity. These diverse applications collectively demonstrate the potential of hint token learning for improving protein language modeling across general and specific mutational datasets. To facilitate broader use, we have integrated our protein language models into the HuggingFace ecosystem for downstream, mutational fine-tuning tasks.
Collapse
|
30
|
Li D, Jiang W. Classification of helical polymers with deep-learning language models. J Struct Biol 2023; 215:108041. [PMID: 37939748 PMCID: PMC10843845 DOI: 10.1016/j.jsb.2023.108041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Revised: 10/11/2023] [Accepted: 10/31/2023] [Indexed: 11/10/2023]
Abstract
Many macromolecules in biological systems exist in the form of helical polymers. However, the inherent polymorphism and heterogeneity of samples complicate the reconstruction of helical polymers from cryo-EM images. Currently, available 2D classification methods are effective at separating particles of interest from contaminants, but they do not effectively differentiate between polymorphs, resulting in heterogeneity in the 2D classes. As such, it is crucial to develop a method that can computationally divide a dataset of polymorphic helical structures into homogenous subsets. In this work, we utilized deep-learning language models to embed the filaments as vectors in hyperspace and group them into clusters. Tests with both simulated and experimental datasets have demonstrated that our method - HLM (Helical classification with Language Model) can effectively distinguish different types of filaments, in the presence of many contaminants and low signal-to-noise ratios. We also demonstrate that HLM can isolate homogeneous subsets of particles from a publicly available dataset, resulting in the discovery of a previously unreported filament variant with an extra density around the tau filaments.
Collapse
Affiliation(s)
- Daoyi Li
- Department of Biological Sciences, Purdue University
| | - Wen Jiang
- Department of Biological Sciences, Purdue University.
| |
Collapse
|
31
|
Le NQK. Leveraging transformers-based language models in proteome bioinformatics. Proteomics 2023; 23:e2300011. [PMID: 37381841 DOI: 10.1002/pmic.202300011] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/13/2023] [Accepted: 06/13/2023] [Indexed: 06/30/2023]
Abstract
In recent years, the rapid growth of biological data has increased interest in using bioinformatics to analyze and interpret this data. Proteomics, which studies the structure, function, and interactions of proteins, is a crucial area of bioinformatics. Using natural language processing (NLP) techniques in proteomics is an emerging field that combines machine learning and text mining to analyze biological data. Recently, transformer-based NLP models have gained significant attention for their ability to process variable-length input sequences in parallel, using self-attention mechanisms to capture long-range dependencies. In this review paper, we discuss the recent advancements in transformer-based NLP models in proteome bioinformatics and examine their advantages, limitations, and potential applications to improve the accuracy and efficiency of various tasks. Additionally, we highlight the challenges and future directions of using these models in proteome bioinformatics research. Overall, this review provides valuable insights into the potential of transformer-based NLP models to revolutionize proteome bioinformatics.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan
- AIBioMed Research Group, Taipei Medical University, Taipei, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| |
Collapse
|
32
|
Jia W, Peng J, Zhang Y, Zhu J, Qiang X, Zhang R, Shi L. Exploring novel ANGICon-EIPs through ameliorated peptidomics techniques: Can deep learning strategies as a core breakthrough in peptide structure and function prediction? Food Res Int 2023; 174:113640. [PMID: 37986483 DOI: 10.1016/j.foodres.2023.113640] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/23/2023] [Accepted: 10/24/2023] [Indexed: 11/22/2023]
Abstract
Dairy-derived angiotensin-I-converting enzyme inhibitory peptides (ANGICon-EIPs) have been regarded as a relatively safe supplementary diet-therapy strategy for individuals with hypertension, and short-chain peptides may have more relevant antihypertensive benefits due to their direct intestinal absorption. Our previous explorations have confirmed that endogenous goat milk short-chain peptides are also an essential source of ANGICon-EIPs. Nonetheless, there are limited explorations on endogenous ANGICon-EIPs owing to the limitations of the extraction and enrichment of endogenous peptides, currently. This review outlined ameliorated pre-treatment strategies, data acquisition methods, and tools for the prediction of peptide structure and function, aiming to provide creative ideas for discovering novel ANGICon-EIPs. Currently, deep learning-based peptide structure and function prediction algorithms have achieved significant advancements. The convolutional neural network (CNN) and peptide sequence-based multi-label deep learning approach for determining the multi-functionalities of bioactive peptides (MLBP) can predict multiple peptide functions with absolute true value and accuracy of 0.699 and 0.708, respectively. Utilizing peptide sequence input, torsion angles, and inter-residue distance to train neural networks, APPTEST predicted the average backbone root mean square deviation (RMSD) value of peptide (5-40 aa) structures as low as 1.96 Å. Overall, with the exploration of more neural network architectures, deep learning could be considered a critical research tool to reduce the cost and improve the efficiency of identifying novel endogenous ANGICon-EIPs.
Collapse
Affiliation(s)
- Wei Jia
- School of Food and Bioengineering, Shaanxi University of Science and Technology, Xi'an 710021, China; Inspection and Testing Center of Fuping County (Shaanxi goat milk product quality supervision and Inspection Center), Weinan 711700, China; Shaanxi Research Institute of Agricultural Products Processing Technology, Xi'an 710021, China.
| | - Jian Peng
- School of Food and Bioengineering, Shaanxi University of Science and Technology, Xi'an 710021, China
| | - Yan Zhang
- Inspection and Testing Center of Fuping County (Shaanxi goat milk product quality supervision and Inspection Center), Weinan 711700, China
| | - Jiying Zhu
- School of Food and Bioengineering, Shaanxi University of Science and Technology, Xi'an 710021, China
| | - Xin Qiang
- Inspection and Testing Center of Fuping County (Shaanxi goat milk product quality supervision and Inspection Center), Weinan 711700, China
| | - Rong Zhang
- School of Food and Bioengineering, Shaanxi University of Science and Technology, Xi'an 710021, China
| | - Lin Shi
- School of Food and Bioengineering, Shaanxi University of Science and Technology, Xi'an 710021, China
| |
Collapse
|
33
|
Liu H, Guan F, Liu T, Yang L, Fan L, Liu X, Luo H, Wu N, Yao B, Tian J, Huang H. MECE: a method for enhancing the catalytic efficiency of glycoside hydrolase based on deep neural networks and molecular evolution. Sci Bull (Beijing) 2023; 68:2793-2805. [PMID: 37867059 DOI: 10.1016/j.scib.2023.09.039] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 07/14/2023] [Accepted: 09/25/2023] [Indexed: 10/24/2023]
Abstract
The demand for high efficiency glycoside hydrolases (GHs) is on the rise due to their various industrial applications. However, improving the catalytic efficiency of an enzyme remains a challenge. This investigation showcases the capability of a deep neural network and method for enhancing the catalytic efficiency (MECE) platform to predict mutations that improve catalytic activity in GHs. The MECE platform includes DeepGH, a deep learning model that is able to identify GH families and functional residues. This model was developed utilizing 119 GH family protein sequences obtained from the Carbohydrate-Active enZYmes (CAZy) database. After undergoing ten-fold cross-validation, the DeepGH models exhibited a predictive accuracy of 96.73%. The utilization of gradient-weighted class activation mapping (Grad-CAM) was used to aid us in comprehending the classification features, which in turn facilitated the creation of enzyme mutants. As a result, the MECE platform was validated with the development of CHIS1754-MUT7, a mutant that boasts seven amino acid substitutions. The kcat/Km of CHIS1754-MUT7 was found to be 23.53 times greater than that of the wild type CHIS1754. Due to its high computational efficiency and low experimental cost, this method offers significant advantages and presents a novel approach for the intelligent design of enzyme catalytic efficiency. As a result, it holds great promise for a wide range of applications.
Collapse
Affiliation(s)
- Hanqing Liu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China; Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Feifei Guan
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Tuoyu Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Lixin Yang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Lingxi Fan
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Xiaoqing Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Huiying Luo
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Ningfeng Wu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Bin Yao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Jian Tian
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China; Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
| | - Huoqing Huang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China.
| |
Collapse
|
34
|
Chandra A, Sharma A, Dehzangi I, Tsunoda T, Sattar A. PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features. Sci Rep 2023; 13:20882. [PMID: 38016996 PMCID: PMC10684570 DOI: 10.1038/s41598-023-47624-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 11/16/2023] [Indexed: 11/30/2023] Open
Abstract
Protein-peptide interactions play a crucial role in various cellular processes and are implicated in abnormal cellular behaviors leading to diseases such as cancer. Therefore, understanding these interactions is vital for both functional genomics and drug discovery efforts. Despite a significant increase in the availability of protein-peptide complexes, experimental methods for studying these interactions remain laborious, time-consuming, and expensive. Computational methods offer a complementary approach but often fall short in terms of prediction accuracy. To address these challenges, we introduce PepCNN, a deep learning-based prediction model that incorporates structural and sequence-based information from primary protein sequences. By utilizing a combination of half-sphere exposure, position specific scoring matrices from multiple-sequence alignment tool, and embedding from a pre-trained protein language model, PepCNN outperforms state-of-the-art methods in terms of specificity, precision, and AUC. The PepCNN software and datasets are publicly available at https://github.com/abelavit/PepCNN.git .
Collapse
Affiliation(s)
- Abel Chandra
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
| | - Alok Sharma
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, USA
- Center for Computational and Integrative Biology, Rutgers University, Camden, USA
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Abdul Sattar
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia
| |
Collapse
|
35
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
36
|
Balakrishnan N, Katkar R, Pham PV, Downey T, Kashyap P, Anastasiu DC, Ramasubramanian AK. Prospection of Peptide Inhibitors of Thrombin from Diverse Origins Using a Machine Learning Pipeline. Bioengineering (Basel) 2023; 10:1300. [PMID: 38002424 PMCID: PMC10669389 DOI: 10.3390/bioengineering10111300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/30/2023] [Accepted: 11/04/2023] [Indexed: 11/26/2023] Open
Abstract
Thrombin is a key enzyme involved in the development and progression of many cardiovascular diseases. Direct thrombin inhibitors (DTIs), with their minimum off-target effects and immediacy of action, have greatly improved the treatment of these diseases. However, the risk of bleeding, pharmacokinetic issues, and thrombotic complications remain major concerns. In an effort to increase the effectiveness of the DTI discovery pipeline, we developed a two-stage machine learning pipeline to identify and rank peptide sequences based on their effective thrombin inhibitory potential. The positive dataset for our model consisted of thrombin inhibitor peptides and their binding affinities (KI) curated from published literature, and the negative dataset consisted of peptides with no known thrombin inhibitory or related activity. The first stage of the model identified thrombin inhibitory sequences with Matthew's Correlation Coefficient (MCC) of 83.6%. The second stage of the model, which covers an eight-order of magnitude range in KI values, predicted the binding affinity of new sequences with a log room mean square error (RMSE) of 1.114. These models also revealed physicochemical and structural characteristics that are hidden but unique to thrombin inhibitor peptides. Using the model, we classified more than 10 million peptides from diverse sources and identified unique short peptide sequences (<15 aa) of interest, based on their predicted KI. Based on the binding energies of the interaction of the peptide with thrombin, we identified a promising set of putative DTI candidates. The prediction pipeline is available on a web server.
Collapse
Affiliation(s)
- Nivedha Balakrishnan
- Department of Chemical and Materials Engineering, San José State University, San Jose, CA 95192, USA (P.K.)
| | - Rahul Katkar
- Department of Chemical and Materials Engineering, San José State University, San Jose, CA 95192, USA (P.K.)
| | - Peter V. Pham
- Department of Chemical and Materials Engineering, San José State University, San Jose, CA 95192, USA (P.K.)
| | - Taylor Downey
- Department of Computer Science and Engineering, Santa Clara University, Santa Clara, CA 95053, USA (D.C.A.)
| | - Prarthna Kashyap
- Department of Chemical and Materials Engineering, San José State University, San Jose, CA 95192, USA (P.K.)
| | - David C. Anastasiu
- Department of Computer Science and Engineering, Santa Clara University, Santa Clara, CA 95053, USA (D.C.A.)
| | - Anand K. Ramasubramanian
- Department of Chemical and Materials Engineering, San José State University, San Jose, CA 95192, USA (P.K.)
| |
Collapse
|
37
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
38
|
Khazaaleh MK, Alsharaiah MA, Alsharafat W, Abu-Shareha AA, Haziemeh FA, Al-Nawashi MM, abu alhija M. Handling DNA malfunctions by unsupervised machine learning model. J Pathol Inform 2023; 14:100340. [PMID: 38028128 PMCID: PMC10630639 DOI: 10.1016/j.jpi.2023.100340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 09/25/2023] [Accepted: 10/09/2023] [Indexed: 12/01/2023] Open
Abstract
The cell cycle is a rich field for research, especially, the DNA damage. DNA damage, which happened naturally or as a result of environmental influences causes change in the chemical structure of DNA. The extent of DNA damage has a significant impact on the fate of the cell in later stages. In this paper, we introduced an Unsupervised Machine learning Model for DNA Damage Diagnosis and Analysis. Mainly, we employed K-means clustering unsupervised machine learning algorithms. Unsupervised algorithms commonly draw conclusions from datasets by solely utilizing input vectors, disregarding any known or labeled outcomes. The model provided deep insight about DNA damage and exposes the protein levels for proteins when work together in sub-network model to deal with DNA damage occurrence, the unsupervised artificial model explained the sub-network biological model activities in regard to the changing in their concentrations in several clusters, they have been grouped in such as (0 - no damage, 1 - low, 2 - medium, 3 - high, and 4 - excess) DNA damage clusters. The results provided a rational and persuasive explanation for numerous important phenomena, including the oscillation of the protein p53, in a clear and understandable manner. Which is encouraging since it demonstrates that the K-means clustering approach can be easily applied to many similar biological systems, which aids in better understanding the key dynamics of these systems.
Collapse
Affiliation(s)
- Mutaz Kh. Khazaaleh
- Department of Computer Science, Al-Balqa Applied University, Al-Salt, Jordan
| | - Mohammad A. Alsharaiah
- Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan
| | - Wafa Alsharafat
- Department of Information Systems, Al al-Bayt University, Mafraq, Jordan
| | - Ahmad Adel Abu-Shareha
- Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan
| | - Feras A. Haziemeh
- Department of Computer Science, Al-Balqa Applied University, Al-Salt, Jordan
| | - Malek M. Al-Nawashi
- Department of Computer Science, Al-Balqa Applied University, Al-Salt, Jordan
| | - Mwaffaq abu alhija
- Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Amman, Jordan
| |
Collapse
|
39
|
Huang Y, Huang HY, Chen Y, Lin YCD, Yao L, Lin T, Leng J, Chang Y, Zhang Y, Zhu Z, Ma K, Cheng YN, Lee TY, Huang HD. A Robust Drug-Target Interaction Prediction Framework with Capsule Network and Transfer Learning. Int J Mol Sci 2023; 24:14061. [PMID: 37762364 PMCID: PMC10531393 DOI: 10.3390/ijms241814061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 08/27/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023] Open
Abstract
Drug-target interactions (DTIs) are considered a crucial component of drug design and drug discovery. To date, many computational methods were developed for drug-target interactions, but they are insufficiently informative for accurately predicting DTIs due to the lack of experimentally verified negative datasets, inaccurate molecular feature representation, and ineffective DTI classifiers. Therefore, we address the limitations of randomly selecting negative DTI data from unknown drug-target pairs by establishing two experimentally validated datasets and propose a capsule network-based framework called CapBM-DTI to capture hierarchical relationships of drugs and targets, which adopts pre-trained bidirectional encoder representations from transformers (BERT) for contextual sequence feature extraction from target proteins through transfer learning and the message-passing neural network (MPNN) for the 2-D graph feature extraction of compounds to accurately and robustly identify drug-target interactions. We compared the performance of CapBM-DTI with state-of-the-art methods using four experimentally validated DTI datasets of different sizes, including human (Homo sapiens) and worm (Caenorhabditis elegans) species datasets, as well as three subsets (new compounds, new proteins, and new pairs). Our results demonstrate that the proposed model achieved robust performance and powerful generalization ability in all experiments. The case study on treating COVID-19 demonstrates the applicability of the model in virtual screening.
Collapse
Affiliation(s)
- Yixian Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Hsi-Yuan Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yigang Chen
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yang-Chi-Dung Lin
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Lantian Yao
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Tianxiu Lin
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Junlin Leng
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yuan Chang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yuntian Zhang
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Zihao Zhu
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Kun Ma
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| | - Yeong-Nan Cheng
- Institute of Bioinformatics and Systems Biology, Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (Y.-N.C.)
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan; (Y.-N.C.)
| | - Hsien-Da Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.H.); (Y.C.); (J.L.)
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (L.Y.); (Y.C.)
| |
Collapse
|
40
|
Yan Y, Shi Z, Wei H. ROSes-FINDER: a multi-task deep learning framework for accurate prediction of microorganism reactive oxygen species scavenging enzymes. Front Microbiol 2023; 14:1245805. [PMID: 37744924 PMCID: PMC10513406 DOI: 10.3389/fmicb.2023.1245805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 08/21/2023] [Indexed: 09/26/2023] Open
Abstract
Reactive oxygen species (ROS) are highly reactive molecules that play important roles in microbial biological processes. However, excessive accumulation of ROS can lead to oxidative stress and cellular damage. Microorganism have evolved a diverse suite of enzymes to mitigate the harmful effects of ROS. Accurate prediction of ROS scavenging enzymes classes (ROSes) is crucial for understanding the mechanisms of oxidative stress and developing strategies to combat related diseases. Nevertheless, the existing approaches for categorizing ROS-related proteins exhibit certain drawbacks with regards to their precision and inclusiveness. To address this, we propose a new multi-task deep learning framework called ROSes-FINDER. This framework integrates three component methods using a voting-based approach to predict multiple ROSes properties simultaneously. It can identify whether a given protein sequence is a ROSes and determine its type. The three component methods used in the framework are ROSes-CNN, which extracts raw sequence encoding features, ROSes-NN, which predicts protein functions based on sequence information, and ROSes-XGBoost, which performs functional classification using ensemble machine learning. Comprehensive experiments demonstrate the superior performance and robustness of our method. ROSes-FINDER is freely available at https://github.com/alienn233/ROSes-Finder for predicting ROSes classes.
Collapse
Affiliation(s)
- Yueyang Yan
- College of Veterinary Medicine, Jilin University, Changchun, China
| | - Zhanpeng Shi
- College of Veterinary Medicine, Jilin University, Changchun, China
| | - Haijian Wei
- Department of Organ Transplantation, The Affiliated Yantai Yuhuangding Hospital of Qingdao University, Yantai City, China
| |
Collapse
|
41
|
Karlsen ST, Rau MH, Sánchez BJ, Jensen K, Zeidan AA. From genotype to phenotype: computational approaches for inferring microbial traits relevant to the food industry. FEMS Microbiol Rev 2023; 47:fuad030. [PMID: 37286882 PMCID: PMC10337747 DOI: 10.1093/femsre/fuad030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/31/2023] [Accepted: 06/06/2023] [Indexed: 06/09/2023] Open
Abstract
When selecting microbial strains for the production of fermented foods, various microbial phenotypes need to be taken into account to achieve target product characteristics, such as biosafety, flavor, texture, and health-promoting effects. Through continuous advances in sequencing technologies, microbial whole-genome sequences of increasing quality can now be obtained both cheaper and faster, which increases the relevance of genome-based characterization of microbial phenotypes. Prediction of microbial phenotypes from genome sequences makes it possible to quickly screen large strain collections in silico to identify candidates with desirable traits. Several microbial phenotypes relevant to the production of fermented foods can be predicted using knowledge-based approaches, leveraging our existing understanding of the genetic and molecular mechanisms underlying those phenotypes. In the absence of this knowledge, data-driven approaches can be applied to estimate genotype-phenotype relationships based on large experimental datasets. Here, we review computational methods that implement knowledge- and data-driven approaches for phenotype prediction, as well as methods that combine elements from both approaches. Furthermore, we provide examples of how these methods have been applied in industrial biotechnology, with special focus on the fermented food industry.
Collapse
Affiliation(s)
- Signe T Karlsen
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Martin H Rau
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Benjamín J Sánchez
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Kristian Jensen
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| | - Ahmad A Zeidan
- Bioinformatics & Modeling, R&D Digital Innovation, Chr. Hansen A/S, Bøge Allé 10-12, 2970 Hørsholm, Denmark
| |
Collapse
|