1
|
Maier A, Cha M, Burgess S, Wang A, Cuellar C, Kim S, Rajan NS, Neyyan J, Sengupta R, O’Connor K, Ott N, Williams A. Predicting purification process fit of monoclonal antibodies using machine learning. MAbs 2025; 17:2439988. [PMID: 39782766 PMCID: PMC11730362 DOI: 10.1080/19420862.2024.2439988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 12/03/2024] [Accepted: 12/04/2024] [Indexed: 01/12/2025] Open
Abstract
In early-stage development of therapeutic monoclonal antibodies, assessment of the viability and ease of their purification typically requires extensive experimentation. However, the work required for upstream protein expression and downstream purification development often conflicts with timeline pressures and material constraints, limiting the number of molecules and process conditions that can reasonably be assessed. Recently, high-throughput batch-binding screen data along with improved molecular descriptors have enabled development of robust quantitative structure-property relationship (QSPR) models that predict monoclonal antibody chromatographic binding behavior from the amino acid sequence. Here, we describe a QSPR strategy for in silico monoclonal antibody purification process fit assessment. Principal Component Analysis is applied to extract a one-dimensional basis for comparison of molecular chromatographic binding behavior from multi-dimensional high-throughput batch-binding screen data. Kernel Ridge Regression is used to predict the first principal component for new molecular sequences. This workflow is demonstrated with a set of 97 monoclonal antibodies for five chromatography resins in two salt types across a range of pH and salt concentrations. Model development benchmarks four descriptor sets from biophysical structural models and protein language models. The investigation illustrates the value QSPR models can provide to purification process fit assessment, and selection of resins and operating conditions from sequence alone.
Collapse
Affiliation(s)
- Andrew Maier
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Minjeong Cha
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Sean Burgess
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Amy Wang
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Carlos Cuellar
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Soo Kim
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Neeraja Sundar Rajan
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Josephine Neyyan
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Rituparna Sengupta
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Kelly O’Connor
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Nicole Ott
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| | - Ambrose Williams
- Department of Purification, Microbiology and Virology, Genentech Inc, South San Francisco, CA, USA
| |
Collapse
|
2
|
Le VT, Malik MS, Lin YJ, Liu YC, Chang YY, Ou YY. ATP_mCNN: Predicting ATP binding sites through pretrained language models and multi-window neural networks. Comput Biol Med 2025; 185:109541. [PMID: 39653625 DOI: 10.1016/j.compbiomed.2024.109541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 11/20/2024] [Accepted: 12/05/2024] [Indexed: 01/26/2025]
Abstract
Adenosine triphosphate plays a vital role in providing energy and enabling key cellular processes through interactions with binding proteins. The increasing amount of protein sequence data necessitates computational methods for identifying binding sites. However, experimental identification of adenosine triphosphate-binding residues remains challenging. To address the challenge, we developed a multi-window convolutional neural network architecture taking pre-trained protein language model embeddings as input features. In particular, multiple parallel convolutional layers scan for motifs localized to different window sizes. Max pooling extracts salient features concatenated across windows into a final multi-scale representation for residue-level classification. On benchmark datasets, our model achieves an area under the ROC curve of 0.95, significantly improving on prior sequence-based models and outperforming convolutional neural network baselines. This demonstrates the utility of pre-trained language models and multi-window convolutional neural networks for advanced sequence-based prediction of adenosine triphosphate-binding residues. Our approach provides a promising new direction for elucidating binding mechanisms and interactions from primary structure.
Collapse
Affiliation(s)
- Van-The Le
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Muhammad-Shahid Malik
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; Department of Computer Sciences, Karakoram International University, Gilgit-Baltistan, 15100, Pakistan
| | - Yi-Jing Lin
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Yu-Chen Liu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Yan-Yun Chang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
3
|
Lv Z, Wei M, Pei H, Peng S, Li M, Jiang L. PTSP-BERT: Predict the thermal stability of proteins using sequence-based bidirectional representations from transformer-embedded features. Comput Biol Med 2025; 185:109598. [PMID: 39708499 DOI: 10.1016/j.compbiomed.2024.109598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 12/16/2024] [Accepted: 12/17/2024] [Indexed: 12/23/2024]
Abstract
Thermophilic proteins, mesophiles proteins and psychrophilic proteins have wide industrial applications, as enzymes with different optimal temperatures are often needed for different purposes. Convenient methods are needed to determine the optimal temperatures for proteins; however, laboratory methods for this purpose are time-consuming and laborious, and existing machine learning methods can only perform binary classification of thermophilic and non-thermophilic proteins, or psychrophilic and non-psychrophilic proteins. Here, we developed a deep learning model, PSTP-BERT, based on protein sequences that can directly perform Three classes identification of thermophilic, mesophilic, and psychrophilic proteins. By comparing BERT-bfd with other deep learning models using five-fold cross-validation, we found that BERT-bfd-extracted features achieved the highest accuracy under six classifiers. Furthermore, to improve the model's accuracy, we used SMOTE (synthetic minority oversampling technique) to balance the dataset and light gradient-boosting machine to rank BERT-bfd-extracted features according to their weights. We obtained the best-performing model with five-fold cross-validation accuracy of 89.59 % and independent test accuracy of 85.42 %. The performance of the PSTP-BERT is significantly better than that of existing models in Three classes identification task. In order to compare with previous binary classification models, we used PSTP-BERT to perform binary classification tasks of thermophilic and non-thermophilic protein, and psychrophilic and non-psychrophilic protein on an independent test set. PSTP-BERT achieved the highest accuracy on both binary classification tasks, with an accuracy of 93.33 % for thermophilic protein binary classification and 88.33 % for psychrophilic protein binary classification. The accuracy of the independent test of the model can reach between 89.8 % and 92.9 % after training and optimization of the training set with different sequence similarities, and the prediction accuracy of the new data can exceed 97 %. For the convenience of future researchers to use and reference, we have uploaded source code of PSTP-BERT to GitHub.
Collapse
Affiliation(s)
- Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China.
| | - Mingxuan Wei
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Hongdi Pei
- Department of Biomedical Engineering, Johns Hopkins University, MD, 21218, USA
| | - Shiyu Peng
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu, 610065, China
| | - Liangzhen Jiang
- College of Food and Biological Engineering, Chengdu University, Chengdu, 610106, China; Country Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, Chengdu, 610106, China
| |
Collapse
|
4
|
Ji S, Wu J, An F, Lou M, Zhang T, Guo J, Wu P, Zhu Y, Wu R. Umami-gcForest: Construction of a predictive model for umami peptides based on deep forest. Food Chem 2025; 464:141826. [PMID: 39522377 DOI: 10.1016/j.foodchem.2024.141826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 10/07/2024] [Accepted: 10/27/2024] [Indexed: 11/16/2024]
Abstract
Umami peptides have recently gained attention for their ability to enhance umami flavor, reduce salt content, and provide nutritional benefits. However, traditional wet laboratory methods to identify them are time-consuming, laborious, and costly. Therefore, we developed the Umami-gcForest model using the deep forest algorithm. It constructs amino acid feature matrices using ProtBERT, amino acid composition, composition-transition-distribution, and pseudo amino acid composition, applying mutual information for feature selection to optimize dimensions. Compared to other machine learning baseline, umami peptide prediction, and composite models, the validation results of Umami-gcForest on different test sets demonstrated outstanding predictive accuracy. Using SHapley Additive exPlanations to calculate feature contributions, we found that the key features of Umami-gcForest were hydrophobicity, charge, and polarity. Based on this, an online platform was developed to facilitate its user application. In conclusion, Umami-gcForest serves as a powerful tool, providing a solid foundation for the efficient and accurate screening of umami peptides.
Collapse
Affiliation(s)
- Shuaiqi Ji
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Junrui Wu
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Feiyu An
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang 110866, PR China
| | - Mengxue Lou
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Taowei Zhang
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Jiawei Guo
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Penggong Wu
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang 110866, PR China
| | - Yi Zhu
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Shenyang Key Laboratory of Microbial Fermentation Technology Innovation, Shenyang 110866, PR China
| | - Rina Wu
- College of Food Science, Shenyang Agricultural University, Shenyang 110866, PR China; Liaoning Engineering Research Center of Food Fermentation Technology, Shenyang 110866, PR China.
| |
Collapse
|
5
|
Hu X, Li J, Liu T. Alg-MFDL: A multi-feature deep learning framework for allergenic proteins prediction. Anal Biochem 2025; 697:115701. [PMID: 39481588 DOI: 10.1016/j.ab.2024.115701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 10/26/2024] [Accepted: 10/28/2024] [Indexed: 11/02/2024]
Abstract
The escalating global incidence of allergy patients illustrates the growing impact of allergic issues on global health. Allergens are small molecule antigens that trigger allergic reactions. A widely recognized strategy for allergy prevention involves identifying allergens and avoiding re-exposure. However, the laboratory methods to identify allergenic proteins are often time-consuming and resource-intensive. There is a crucial need to establish efficient and reliable computational approaches for the identification of allergenic proteins. In this study, we developed a novel allergenic proteins predictor named Alg-MFDL, which integrates pre-trained protein language models (PLMs) and traditional handcrafted features to achieve a more complete protein representation. First, we compared the performance of eight pre-trained PLMs from ProtTrans and ESM-2 and selected the best-performing one from each of the two groups. In addition, we evaluated the performance of three handcrafted features and different combinations of them to select the optimal feature or feature combination. Then, these three protein representations were fused and used as inputs to train the convolutional neural network (CNN). Finally, the independent validation was performed on benchmark datasets to evaluate the performance of Alg-MFDL. As a result, Alg-MFDL achieved an accuracy of 0.973, a precision of 0.996, a sensitivity of 0.951, and an F1 value of 0.973, outperforming the most of current state-of-the-art (SOTA) methods across all key metrics. We anticipated that the proposed model could be considered a useful tool for predicting allergen proteins.
Collapse
Affiliation(s)
- Xiang Hu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Jingyi Li
- AIEN Institute, Shanghai Ocean University, Shanghai, 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China.
| |
Collapse
|
6
|
Kumar N, Du Z, Li Y. pLM4CPPs: Protein Language Model-Based Predictor for Cell Penetrating Peptides. J Chem Inf Model 2025. [PMID: 39878455 DOI: 10.1021/acs.jcim.4c01338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
Cell-penetrating peptides (CPPs) are short peptides capable of penetrating cell membranes, making them valuable for drug delivery and intracellular targeting. Accurate prediction of CPPs can streamline experimental validation in the lab. This study aims to assess pretrained protein language models (pLMs) for their effectiveness in representing CPPs and develop a reliable model for CPP classification. We evaluated peptide embeddings generated from BEPLER, CPCProt, SeqVec, various ESM variants (ESM, ESM-2 with expanded feature set, ESM-1b, and ESM-1v), ProtT5-XL UniRef50, ProtT5-XL BFD, and ProtBERT. We developed pLM4CCPs, a novel deep learning architecture using convolutional neural networks (CNNs) as the classifier for binary classification of CPPs. pLM4CCPs demonstrated superior performance over existing state-of-the-art CPP prediction models, achieving improvements in accuracy (ACC) by 4.9-5.5%, Matthews correlation coefficient (MCC) by 9.3-10.2%, and sensitivity (Sn) by 14.1-19.6%. Among all the tested models, ESM-1280 and ProtT5-XL BFD demonstrated the highest overall performance on the kelm data set. ESM-1280 achieved an ACC of 0.896, an MCC of 0.796, a Sn of 0.844, and a specificity (Sp) of 0.978. ProtT5-XL BFD exhibited superior performance with an ACC of 0.901, an MCC of 0.802, an Sn of 0.885, and an Sp of 0.917. pLM4CCPs combine predictions from multiple models to provide a consensus on whether a given peptide sequence is classified as a CPP or non-CPP. This approach will enhance prediction reliability by leveraging the strengths of each individual model. A user-friendly web server for bioactivity predictions, along with data sets, is available at https://ry2acnp6ep.us-east-1.awsapprunner.com. The source code and protocol for adapting pLM4CPPs can be accessed on GitHub at https://github.com/drkumarnandan/pLM4CPPs. This platform aims to advance CPP prediction and peptide functionality modeling, aiding researchers in exploring peptide functionality effectively.
Collapse
Affiliation(s)
- Nandan Kumar
- Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States
| | - Zhenjiao Du
- Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States
| | - Yonghui Li
- Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States
| |
Collapse
|
7
|
Yuan Y, Chen S, Hu R, Wang X. MutualDTA: An Interpretable Drug-Target Affinity Prediction Model Leveraging Pretrained Models and Mutual Attention. J Chem Inf Model 2025. [PMID: 39878060 DOI: 10.1021/acs.jcim.4c01893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
Efficient and accurate drug-target affinity (DTA) prediction can significantly accelerate the drug development process. Recently, deep learning models have been widely applied to DTA prediction and have achieved notable success. However, existing methods often encounter several common issues: first, the data representations lack sufficient information; second, the extracted features are not comprehensive; and third, most methods lack interpretability when modeling drug-target binding. To overcome the above-mentioned problems, we propose an interpretable deep learning model called MutualDTA for predicting DTA. MutualDTA leverages the power of pretrained models to obtain accurate representations of drugs and targets. It also employs well-designed modules to extract hidden features from these representations. Furthermore, the interpretability of MutualDTA is realized by the Mutual-Attention module, which (i) establishes relationships between drugs and proteins from the perspective of intermolecular interactions between drug atoms and protein amino acid residues and (ii) allows MutualDTA to capture the binding sites based on attention scores. The test results on two benchmark data sets show that MutualDTA achieves the best performance compared to the 12 state-of-the-art models. Attention visualization experiments show that MutualDTA can capture partial interaction sites, which not only helps drug developers reduce the search space for binding sites, but also demonstrates the interpretability of MutualDTA. Finally, the trained MutualDTA is applied to screen high-affinity drug screens targeting Alzheimer's disease (AD)-related proteins, and the screened drugs are partially present in the anti-AD drug library. These results demonstrate the reliability of MutualDTA in drug development.
Collapse
Affiliation(s)
- Yongna Yuan
- School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China
| | - Siming Chen
- School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China
| | - Rizhen Hu
- School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China
| | - Xin Wang
- School of Information Science & Engineering, Lanzhou University, Lanzhou 730000, China
| |
Collapse
|
8
|
Lytras S, Lamb KD, Ito J, Grove J, Yuan K, Sato K, Hughes J, Robertson DL. Pathogen genomic surveillance and the AI revolution. J Virol 2025:e0160124. [PMID: 39878472 DOI: 10.1128/jvi.01601-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2025] Open
Abstract
The unprecedented sequencing efforts during the COVID-19 pandemic paved the way for genomic surveillance to become a powerful tool for monitoring the evolution of circulating viruses. Herein, we discuss how a state-of-the-art artificial intelligence approach called protein language models (pLMs) can be used for effectively analyzing pathogen genomic data. We highlight examples of pLMs applied to predicting viral properties and evolution and lay out a framework for integrating pLMs into genomic surveillance pipelines.
Collapse
Affiliation(s)
- Spyros Lytras
- Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
| | - Kieran D Lamb
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
- School of Computing Science, University of Glasgow, Glasgow, Scotland, United Kingdom
| | - Jumpei Ito
- Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Joe Grove
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
| | - Ke Yuan
- School of Computing Science, University of Glasgow, Glasgow, Scotland, United Kingdom
- School of Cancer Sciences, University of Glasgow, Glasgow, Scotland, United Kingdom
- Cancer Research UK Scotland Institute, Glasgow, Scotland, United Kingdom
| | - Kei Sato
- Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
- International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan
- International Vaccine Design Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
- Collaboration Unit for Infection, Joint Research Center for Human Retrovirus Infection, Kumamoto University, Kumamoto, Japan
| | - Joseph Hughes
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
| | - David L Robertson
- MRC-University of Glasgow Centre for Virus Research, Glasgow, Scotland, United Kingdom
| |
Collapse
|
9
|
Dosajh A, Agrawal P, Chatterjee P, Priyakumar UD. Modern machine learning methods for protein property prediction. Curr Opin Struct Biol 2025; 90:102990. [PMID: 39881454 DOI: 10.1016/j.sbi.2025.102990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 12/06/2024] [Accepted: 01/04/2025] [Indexed: 01/31/2025]
Abstract
Recent progress and development of artificial intelligence and machine learning (AI/ML) techniques have enabled addressing complex biomolecular problems. AI/ML models learn the underlying distribution of data they are trained on and when exposed to new inputs, they make predictions based on patterns and relationships previously observed in the training set. Further, generative artificial intelligence (GenAI) can be used to accurately generate protein structure or sequence from specific selected properties. This review specifically focuses on the applications of AI/ML in predicting important functional properties of proteins, and the potential prospects of reverse-engineering in depicting the sequence and structure, from available protein-property information.
Collapse
Affiliation(s)
- Arjun Dosajh
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - Prakul Agrawal
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - Prathit Chatterjee
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500032, Telangana, India.
| |
Collapse
|
10
|
Feller AL, Wilke CO. Peptide-Aware Chemical Language Model Successfully Predicts Membrane Diffusion of Cyclic Peptides. J Chem Inf Model 2025; 65:571-579. [PMID: 39772542 DOI: 10.1021/acs.jcim.4c01441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2025]
Abstract
Language modeling applied to biological data has significantly advanced the prediction of membrane penetration for small-molecule drugs and natural peptides. However, accurately predicting membrane diffusion for peptides with pharmacologically relevant modifications remains a substantial challenge. Here, we introduce PeptideCLM, a peptide-focused chemical language model capable of encoding peptides with chemical modifications, unnatural or noncanonical amino acids, and cyclizations. We assess this model by predicting membrane diffusion of cyclic peptides, demonstrating greater predictive power than existing chemical language models. Our model is versatile and can be extended beyond membrane diffusion predictions to other target values. Its advantages include the ability to model macromolecules using chemical string notation, a largely unexplored domain, and a simple, flexible architecture that allows for adaptation to any peptide or other macromolecule data set.
Collapse
Affiliation(s)
- Aaron L Feller
- Interdisciplinary Life Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Claus O Wilke
- Interdisciplinary Life Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
- Department of Integrative Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| |
Collapse
|
11
|
Wu J, Liu Y, Zhang Y, Wang X, Yan H, Zhu Y, Song J, Yu DJ. Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models. J Chem Inf Model 2025; 65:1040-1052. [PMID: 39788787 DOI: 10.1021/acs.jcim.4c02092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.
Collapse
Affiliation(s)
- Jiashun Wu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, Yangzhou 225100, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - He Yan
- College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China
| | - Yiheng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
12
|
Creanza TM, Alberga D, Patruno C, Mangiatordi GF, Ancona N. Transformer Decoder Learns from a Pretrained Protein Language Model to Generate Ligands with High Affinity. J Chem Inf Model 2025. [PMID: 39871540 DOI: 10.1021/acs.jcim.4c02019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2025]
Abstract
The drug discovery process can be significantly accelerated by using deep learning methods to suggest molecules with druglike features and, more importantly, that are good candidates to bind specific proteins of interest. We present a novel deep learning generative model, Prot2Drug, that learns to generate ligands binding specific targets leveraging (i) the information carried by a pretrained protein language model and (ii) the ability of transformers to capitalize the knowledge gathered from thousands of protein-ligand interactions. The embedding unveils the receipt to follow for designing molecules binding a given protein, and Prot2Drug translates such instructions by using the syntax of the molecular language generating novel compounds which are predicted to have favorable physicochemical properties and high affinity toward specific targets. Moreover, Prot2Drug reproduced numerous known interactions between compounds and proteins used for generating them and suggested novel protein targets for known compounds, indicating potential drug repurposing strategies. Remarkably, Prot2Drug facilitates the design of promising ligands even for protein targets with limited or no information about their ligands or 3D structure.
Collapse
Affiliation(s)
- Teresa Maria Creanza
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | - Domenico Alberga
- Institute of Crystallography, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | - Cosimo Patruno
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | | | - Nicola Ancona
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| |
Collapse
|
13
|
Yue Y, Cheng Y, Marquet C, Xiao C, Guo J, Li S, He S. Meta-Learning Enables Complex Cluster-Specific Few-Shot Binding Affinity Prediction for Protein-Protein Interactions. J Chem Inf Model 2025; 65:580-588. [PMID: 39772708 DOI: 10.1021/acs.jcim.4c01607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2025]
Abstract
Predicting protein-protein interaction (PPI) binding affinities in unseen protein complex clusters is essential for elucidating complex protein interactions and for the targeted screening of peptide- or protein-based drugs. We introduce MCGLPPI++, a meta-learning framework designed to improve the adaptability of pretrained geometric models in such scenarios. To effectively boost the meta-learning optimization by injecting prior intersample distribution knowledge, three specially designed training sample cluster splitting patterns based on protein interaction interfaces are introduced. Additionally, MCGLPPI++ is equipped with an independent energy component which explicitly models interface nonbonded interaction energies closely related to the strengths of PPIs. To validate our approach, we curate a new data set featuring a challenging test cluster of T-cell receptors binding to antigenic peptide-MHC molecules (TCR-pMHC). Experimental results show that geometric models enhanced by the MCGLPPI++ framework achieve significantly more robust binding affinity predictions after fine-tuning on a few samples from this novel cluster compared to their vanilla counterparts, which demonstrates the effectiveness of the framework.
Collapse
Affiliation(s)
- Yang Yue
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Yihua Cheng
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching 85748, Munich, Germany
| | - Chenguang Xiao
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| | - Jingjing Guo
- Centre of Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR 999078, China
| | - Shu Li
- Centre of Artificial Intelligence Driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR 999078, China
| | - Shan He
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K
| |
Collapse
|
14
|
Bhat S, Palepu K, Hong L, Mao J, Ye T, Iyer R, Zhao L, Chen T, Vincoff S, Watson R, Wang TZ, Srijay D, Kavirayuni VS, Kholina K, Goel S, Vure P, Deshpande AJ, Soderling SH, DeLisa MP, Chatterjee P. De novo design of peptide binders to conformationally diverse targets with contrastive language modeling. SCIENCE ADVANCES 2025; 11:eadr8638. [PMID: 39841846 PMCID: PMC11753435 DOI: 10.1126/sciadv.adr8638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Accepted: 12/20/2024] [Indexed: 01/24/2025]
Abstract
Designing binders to target undruggable proteins presents a formidable challenge in drug discovery. In this work, we provide an algorithmic framework to design short, target-binding linear peptides, requiring only the amino acid sequence of the target protein. To do this, we propose a process to generate naturalistic peptide candidates through Gaussian perturbation of the peptidic latent space of the ESM-2 protein language model and subsequently screen these novel sequences for target-selective interaction activity via a contrastive language-image pretraining (CLIP)-based contrastive learning architecture. By integrating these generative and discriminative steps, we create a Peptide Prioritization via CLIP (PepPrCLIP) pipeline and validate highly ranked, target-specific peptides experimentally, both as inhibitory peptides and as fusions to E3 ubiquitin ligase domains. PepPrCLIP-derived constructs demonstrate functionally potent binding and degradation of conformationally diverse, disease-driving targets in vitro. In total, PepPrCLIP empowers the modulation of previously inaccessible proteins without reliance on stable and ordered tertiary structures.
Collapse
Affiliation(s)
- Suhaas Bhat
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Kalyan Palepu
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Lauren Hong
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Joey Mao
- Department of Cell Biology, Duke University, Durham, NC, USA
| | - Tianzheng Ye
- Robert F. Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA
| | - Rema Iyer
- Cancer Genome and Epigenetics Program, Sanford Burnham Prebys Institute, San Diego, CA, USA
| | - Lin Zhao
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Tianlai Chen
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Sophia Vincoff
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Rio Watson
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Tian Z. Wang
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Divya Srijay
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | | | - Kseniia Kholina
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Shrey Goel
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Pranay Vure
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
| | - Aniruddha J. Deshpande
- Cancer Genome and Epigenetics Program, Sanford Burnham Prebys Institute, San Diego, CA, USA
| | | | - Matthew P. DeLisa
- Robert F. Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY, USA
- Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY, USA
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, USA
| | - Pranam Chatterjee
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
- Department of Computer Science, Duke University, Durham, NC, USA
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| |
Collapse
|
15
|
Jiang K, Yan Z, Di Bernardo M, Sgrizzi SR, Villiger L, Kayabolen A, Kim BJ, Carscadden JK, Hiraizumi M, Nishimasu H, Gootenberg JS, Abudayyeh OO. Rapid in silico directed evolution by a protein language model with EVOLVEpro. Science 2025; 387:eadr6006. [PMID: 39571002 DOI: 10.1126/science.adr6006] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 11/12/2024] [Indexed: 01/25/2025]
Abstract
Directed protein evolution is central to biomedical applications but faces challenges such as experimental complexity, inefficient multiproperty optimization, and local maxima traps. Although in silico methods that use protein language models (PLMs) can provide modeled fitness landscape guidance, they struggle to generalize across diverse protein families and map to protein activity. We present EVOLVEpro, a few-shot active learning framework that combines PLMs and regression models to rapidly improve protein activity. EVOLVEpro surpasses current methods, yielding up to 100-fold improvements in desired properties. We demonstrate its effectiveness across six proteins in RNA production, genome editing, and antibody binding applications. These results highlight the advantages of few-shot active learning with minimal experimental data over zero-shot predictions. EVOLVEpro opens new possibilities for artificial intelligence-guided protein engineering in biology and medicine.
Collapse
Affiliation(s)
- Kaiyi Jiang
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
- Department of Bioengineering Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Zhaoqing Yan
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Matteo Di Bernardo
- Whitehead Institute Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Samantha R Sgrizzi
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Lukas Villiger
- Department of Dermatology and Allergology Kantonspital St. Gallen, St. Gallen, Switzerland
| | - Alisan Kayabolen
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - B J Kim
- Koch Institute for Integrative Cancer Research at MIT Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Josephine K Carscadden
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo, Japan
- Inamori Research Institute for Science, 620 Suiginya-cho, Shimogyo-ku, Kyoto, Japan
| | - Jonathan S Gootenberg
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| | - Omar O Abudayyeh
- Department of Medicine Division of Engineering in Medicine Brigham and Women's Hospital Harvard Medical School, Boston, MA, USA
- Gene and Cell Therapy Institute Mass General Brigham, Cambridge, MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School, Boston, MA, USA
| |
Collapse
|
16
|
De Waele G, Menschaert G, Vandamme P, Waegeman W. Pre-trained Maldi Transformers improve MALDI-TOF MS-based prediction. Comput Biol Med 2025; 186:109695. [PMID: 39847945 DOI: 10.1016/j.compbiomed.2025.109695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 01/10/2025] [Accepted: 01/13/2025] [Indexed: 01/25/2025]
Abstract
For the last decade, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) has been the reference method for species identification in clinical microbiology. Hampered by a historical lack of open data, machine learning research towards models specifically adapted to MALDI-TOF MS remains in its infancy. Given the growing complexity of available datasets (such as large-scale antimicrobial resistance prediction), a need for models that (1) are specifically designed for MALDI-TOF MS data, and (2) have high representational capacity, presents itself. Here, we introduce Maldi Transformer, an adaptation of the state-of-the-art transformer architecture to the MALDI-TOF mass spectral domain. We propose the first self-supervised pre-training technique specifically designed for mass spectra. The technique is based on shuffling peaks across spectra, and pre-training the transformer as a peak discriminator. Extensive benchmarks confirm the efficacy of this novel design. The final result is a model exhibiting state-of-the-art (or competitive) performance on downstream prediction tasks. In addition, we show that Maldi Transformer's identification of noisy spectra may be leveraged towards higher predictive performance. All code supporting this study is distributed on PyPI and is packaged under: https://github.com/gdewael/maldi-nn.
Collapse
Affiliation(s)
- Gaetan De Waele
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium.
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| | - Peter Vandamme
- Laboratory of Microbiology, Ghent University, K. L. Ledeganckstraat 35, Ghent, 9000, Belgium
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| |
Collapse
|
17
|
Majila K, Ullanat V, Viswanath S. A deep learning method for predicting interactions for intrinsically disordered regions of proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.19.629373. [PMID: 39763873 PMCID: PMC11702703 DOI: 10.1101/2024.12.19.629373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2025]
Abstract
Intrinsically disordered proteins or regions (IDPs/IDRs) adopt diverse binding modes with different partners, ranging from ordered to multivalent to fuzzy conformations in the bound state. Characterizing IDR interfaces is challenging experimentally and computationally. Alphafold-multimer and Alphafold3, the state-of-the-art structure prediction methods, are less accurate at predicting IDR binding sites at their benchmarked confidence cutoffs. Their performance improves upon lowering the confidence cutoffs. Here, we developed Disobind, a deep-learning method that predicts inter-protein contact maps and interface residues for an IDR and a partner protein, given their sequences. It outperforms AlphaFold-multimer and AlphaFold3 at multiple confidence cutoffs. Combining the Disobind and AlphaFold-multimer predictions further improves the performance. In contrast to most current methods, Disobind considers the context of the binding partner and does not depend on structures and multiple sequence alignments. Its predictions can be used to localize IDRs in integrative structures of large assemblies and characterize and modulate IDR-mediated interactions.
Collapse
Affiliation(s)
- Kartik Majila
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| | - Varun Ullanat
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| | - Shruthi Viswanath
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| |
Collapse
|
18
|
Elkin ME, Zhu X. Paying attention to the SARS-CoV-2 dialect : a deep neural network approach to predicting novel protein mutations. Commun Biol 2025; 8:98. [PMID: 39838059 PMCID: PMC11751191 DOI: 10.1038/s42003-024-07262-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Accepted: 11/13/2024] [Indexed: 01/23/2025] Open
Abstract
Predicting novel mutations has long-lasting impacts on life science research. Traditionally, this problem is addressed through wet-lab experiments, which are often expensive and time consuming. The recent advancement in neural language models has provided stunning results in modeling and deciphering sequences. In this paper, we propose a Deep Novel Mutation Search (DNMS) method, using deep neural networks, to model protein sequence for mutation prediction. We use SARS-CoV-2 spike protein as the target and use a protein language model to predict novel mutations. Different from existing research which is often limited to mutating the reference sequence for prediction, we propose a parent-child mutation prediction paradigm where a parent sequence is modeled for mutation prediction. Because mutations introduce changing context to the underlying sequence, DNMS models three aspects of the protein sequences: semantic changes, grammatical changes, and attention changes, each modeling protein sequence aspects from shifting of semantics, grammar coherence, and amino-acid interactions in latent space. A ranking approach is proposed to combine all three aspects to capture mutations demonstrating evolving traits, in accordance with real-world SARS-CoV-2 spike protein sequence evolution. DNMS can be adopted for an early warning variant detection system, creating public health awareness of future SARS-CoV-2 mutations.
Collapse
Affiliation(s)
- Magdalyn E Elkin
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| | - Xingquan Zhu
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| |
Collapse
|
19
|
Meng L, Wei L, Wu R. MVGNN-PPIS: A novel multi-view graph neural network for protein-protein interaction sites prediction based on Alphafold3-predicted structures and transfer learning. Int J Biol Macromol 2025; 300:140096. [PMID: 39848362 DOI: 10.1016/j.ijbiomac.2025.140096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 01/04/2025] [Accepted: 01/17/2025] [Indexed: 01/25/2025]
Abstract
Protein-protein interactions (PPI) are crucial for understanding numerous biological processes and pathogenic mechanisms. Identifying interaction sites is essential for biomedical research and targeted drug development. Compared to experimental methods, accurate computational approaches for protein-protein interaction sites (PPIS) prediction can save significant time and costs. In this study, we propose a novel model named MVGNN-PPIS. To the best of our knowledge, it is the first to utilize predicted structures generated by AlphaFold3, and combined with transfer learning techniques, for predicting PPIS. This approach addresses the limitations of traditional methods that depend on native protein structures and multiple sequence alignments (MSA). Additionally, we introduced a multi-view graph framework based on two types of graph structures: the k-nearest neighbor graph and the adjacency matrix. By alternately employing a Graph Transformer and Graph Convolutional Networks (GCN) to aggregate node information, this framework effectively captures both local and global dependencies of each residue in the predicted structures, thereby significantly enhancing the model's sensitivity to binding sites. This framework further integrates direction, distances and angular information between the 3D coordinates of side-chain atom centroids to construct a relative coordinate system, generating enhanced edge features that ensure the model's equivariance to molecular translations and rotations in space. During training, the Focal Loss function is employed to effectively address the class imbalance in the dataset. Experimental results demonstrate that MVGNN outperforms the current state-of-the-art methods across multiple PPIS benchmark datasets. To further validate the model's generalization capability, we extended MVGNN to the domain of predicting protein-nucleic acid interaction sites, where it also achieved superior performance.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, China.
| | - Lishuai Wei
- College of Information Science and Engineering, Northeastern University, China
| | - Rina Wu
- College of Information Science and Engineering, Northeastern University, China
| |
Collapse
|
20
|
Howladar N, Kabir MWU, Hoque F, Katebi A, Hoque MT. PPILS: Protein-protein interaction prediction with language of biological coding. Comput Biol Med 2025; 186:109678. [PMID: 39832439 DOI: 10.1016/j.compbiomed.2025.109678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/03/2025] [Accepted: 01/12/2025] [Indexed: 01/22/2025]
Abstract
Protein-protein interactions within a cell are essential for various fundamental biological processes. Computational techniques have arisen in bioinformatics due to the challenging and resource-intensive nature of experimental protein pair interaction studies. This research seeks to create a cutting-edge machine learning method for predicting protein pair interactions using carefully chosen input features and leveraging evolutionary data. PPILS leverages evolutionary knowledge from the protein language model. It develops an encoder-decoder architecture with light attention. The trained model obtains protein embeddings from a language model and employs a light attention-based encoder, where a single convolution operation generates attention. A subsequent convolution is applied to input features, creating a representative construct for the protein interaction prediction. These encoded representations are then channeled into the decoder to predict protein interactions. Our findings indicated that PPILS outperformed existing methods in PPI prediction. The proposed method could be essential in protein-protein interaction prediction, further accelerating the discovery of protein-based drugs.
Collapse
Affiliation(s)
- Nayan Howladar
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Md Wasi Ul Kabir
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| | - Foyzul Hoque
- Department of Computer Science & Engineering, Independent University, Bangladesh.
| | - Ataur Katebi
- Department of Bioengineering, Northeastern University, Boston, MA, USA; Center for Theoretical Biological Physics, Northeastern University, Boston, MA, USA.
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA.
| |
Collapse
|
21
|
Mall R, Kaushik R, Martinez ZA, Thomson MW, Castiglione F. Benchmarking protein language models for protein crystallization. Sci Rep 2025; 15:2381. [PMID: 39827171 PMCID: PMC11743144 DOI: 10.1038/s41598-025-86519-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Accepted: 01/13/2025] [Indexed: 01/22/2025] Open
Abstract
The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3-[Formula: see text] than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.
Collapse
Affiliation(s)
- Raghvendra Mall
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates.
| | - Rahul Kaushik
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates
| | - Zachary A Martinez
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, 91125, CA, USA
| | - Matt W Thomson
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, 91125, CA, USA
| | - Filippo Castiglione
- Biotechnology Research Center, Technology Innovation Institute, P.O. Box 9639, Abu Dhabi, United Arab Emirates.
- Institute for Applied Computing, National Research Council of Italy, 00185, Rome, Italy.
| |
Collapse
|
22
|
Hayes T, Rao R, Akin H, Sofroniew NJ, Oktay D, Lin Z, Verkuil R, Tran VQ, Deaton J, Wiggert M, Badkundri R, Shafkat I, Gong J, Derry A, Molina RS, Thomas N, Khan YA, Mishra C, Kim C, Bartie LJ, Nemeth M, Hsu PD, Sercu T, Candido S, Rives A. Simulating 500 million years of evolution with a language model. Science 2025:eads0018. [PMID: 39818825 DOI: 10.1126/science.ads0018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 01/07/2025] [Indexed: 01/19/2025]
Abstract
More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to alignment to improve its fidelity. We have prompted ESM3 to generate fluorescent proteins. Among the generations that we synthesized, we found a bright fluorescent protein at a far distance (58% sequence identity) from known fluorescent proteins, which we estimate is equivalent to simulating five hundred million years of evolution.
Collapse
Affiliation(s)
| | - Roshan Rao
- EvolutionaryScale, PBC, New York, NY, USA
| | - Halil Akin
- EvolutionaryScale, PBC, New York, NY, USA
| | | | | | - Zeming Lin
- EvolutionaryScale, PBC, New York, NY, USA
| | | | - Vincent Q Tran
- Arc Institute, Palo Alto, CA, USA
- University of California, Berkeley, Berkeley, CA, USA
| | | | | | | | | | - Jun Gong
- EvolutionaryScale, PBC, New York, NY, USA
| | | | | | | | | | | | | | | | | | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA
- University of California, Berkeley, Berkeley, CA, USA
| | - Tom Sercu
- EvolutionaryScale, PBC, New York, NY, USA
| | | | | |
Collapse
|
23
|
Yang J, Lal RG, Bowden JC, Astudillo R, Hameedi MA, Kaur S, Hill M, Yue Y, Arnold FH. Active learning-assisted directed evolution. Nat Commun 2025; 16:714. [PMID: 39821082 PMCID: PMC11739421 DOI: 10.1038/s41467-025-55987-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 01/02/2025] [Indexed: 01/19/2025] Open
Abstract
Directed evolution (DE) is a powerful tool to optimize protein fitness for a specific application. However, DE can be inefficient when mutations exhibit non-additive, or epistatic, behavior. Here, we present Active Learning-assisted Directed Evolution (ALDE), an iterative machine learning-assisted DE workflow that leverages uncertainty quantification to explore the search space of proteins more efficiently than current DE methods. We apply ALDE to an engineering landscape that is challenging for DE: optimization of five epistatic residues in the active site of an enzyme. In three rounds of wet-lab experimentation, we improve the yield of a desired product of a non-native cyclopropanation reaction from 12% to 93%. We also perform computational simulations on existing protein sequence-fitness datasets to support our argument that ALDE can be more effective than DE. Overall, ALDE is a practical and broadly applicable strategy to unlock improved protein engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ravi G Lal
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - James C Bowden
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
- Computer Science, University of California-Berkeley, Berkeley, CA, USA
| | - Raul Astudillo
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Mikhail A Hameedi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Matthew Hill
- Elegen Corp, 1300 Industrial Road #16, San Carlos, CA, USA
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA.
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
24
|
Ovchinnikov V, Karplus M. Phenomenological Modeling of Antibody Response from Vaccine Strain Composition. Antibodies (Basel) 2025; 14:6. [PMID: 39846614 PMCID: PMC11755667 DOI: 10.3390/antib14010006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 01/11/2025] [Accepted: 01/14/2025] [Indexed: 01/24/2025] Open
Abstract
The elicitation of broadly neutralizing antibodies (bnAbs) is a major goal of vaccine design for highly mutable pathogens, such as influenza, HIV, and coronavirus. Although many rational vaccine design strategies for eliciting bnAbs have been devised, their efficacies need to be evaluated in preclinical animal models and in clinical trials. To improve outcomes for such vaccines, it would be useful to develop methods that can predict vaccine efficacies against arbitrary pathogen variants. As a step in this direction, here, we describe a simple biologically motivated model of antibody reactivity elicited by nanoparticle-based vaccines using only antigen amino acid sequences, parametrized with a small sample of experimental antibody binding data from influenza or SARS-CoV-2 nanoparticle vaccinations. Results: The model is able to recapitulate the experimental data to within experimental uncertainty, is relatively insensitive to the choice of the parametrization/training set, and provides qualitative predictions about the antigenic epitopes exploited by the vaccine, which are testable by experiment. For the mosaic nanoparticle vaccines considered here, model results suggest indirectly that the sera obtained from vaccinated mice contain bnAbs, rather than simply different strain-specific Abs. Although the present model was motivated by nanoparticle vaccines, we also apply it to a mutlivalent mRNA flu vaccination study, and demonstrate good recapitulation of experimental results. This suggests that the model formalism is, in principle, sufficiently flexible to accommodate different vaccination strategies. Finally, we show how the model could be used to rank the efficacies of vaccines with different antigen compositions. Conclusions: Overall, this study suggests that simple models of vaccine efficacy parametrized with modest amounts of experimental data could be used to compare the effectiveness of designed vaccines.
Collapse
Affiliation(s)
- Victor Ovchinnikov
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Martin Karplus
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
- Laboratoire de Chimie Biophysique, ISIS, Université de Strasbourg, 67000 Strasbourg, France
| |
Collapse
|
25
|
Nagano Y, Pyo AGT, Milighetti M, Henderson J, Shawe-Taylor J, Chain B, Tiffeau-Mayer A. Contrastive learning of T cell receptor representations. Cell Syst 2025; 16:101165. [PMID: 39778580 DOI: 10.1016/j.cels.2024.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 10/09/2024] [Accepted: 12/06/2024] [Indexed: 01/11/2025]
Abstract
Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labeled TCR data remain sparse. In other domains, the pre-training of language models on unlabeled data has been successfully used to address data bottlenecks. However, it is unclear how to best pre-train protein language models for TCR specificity prediction. Here, we introduce a TCR language model called SCEPTR (simple contrastive embedding of the primary sequence of T cell receptors), which is capable of data-efficient transfer learning. Through our model, we introduce a pre-training strategy combining autocontrastive learning and masked-language modeling, which enables SCEPTR to achieve its state-of-the-art performance. In contrast, existing protein language models and a variant of SCEPTR pre-trained without autocontrastive learning are outperformed by sequence alignment-based methods. We anticipate that contrastive learning will be a useful paradigm to decode the rules of TCR specificity. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Yuta Nagano
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Division of Medicine, University College London, London WC1E 6BT, UK
| | - Andrew G T Pyo
- Center for the Physics of Biological Function, Princeton University, Princeton, NJ 08544, USA
| | - Martina Milighetti
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Cancer Institute, University College London, London WC1E 6DD, UK
| | - James Henderson
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Institute for the Physics of Living Systems, University College London, London WC1E 6BT, UK
| | - John Shawe-Taylor
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Benny Chain
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Andreas Tiffeau-Mayer
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Institute for the Physics of Living Systems, University College London, London WC1E 6BT, UK.
| |
Collapse
|
26
|
Gelman S, Johnson B, Freschlin C, Sharma A, D'Costa S, Peters J, Gitter A, Romero PA. Biophysics-based protein language models for protein engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.03.15.585128. [PMID: 38559182 PMCID: PMC10980077 DOI: 10.1101/2024.03.15.585128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Collapse
|
27
|
Changiarath A, Arya A, Xenidis VA, Padeken J, Stelzl LS. Sequence determinants of protein phase separation and recognition by protein phase-separated condensates through molecular dynamics and active learning. Faraday Discuss 2025; 256:235-254. [PMID: 39319382 DOI: 10.1039/d4fd00099d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Elucidating how protein sequence determines the properties of disordered proteins and their phase-separated condensates is a great challenge in computational chemistry, biology, and biophysics. Quantitative molecular dynamics simulations and derived free energy values can in principle capture how a sequence encodes the chemical and biological properties of a protein. These calculations are, however, computationally demanding, even after reducing the representation by coarse-graining; exploring the large spaces of potentially relevant sequences remains a formidable task. We employ an "active learning" scheme introduced by Yang et al. (bioRxiv, 2022, https://doi.org/10.1101/2022.08.05.502972) to reduce the number of labelled examples needed from simulations, where a neural network-based model suggests the most useful examples for the next training cycle. Applying this Bayesian optimisation framework, we determine properties of protein sequences with coarse-grained molecular dynamics, which enables the network to establish sequence-property relationships for disordered proteins and their self-interactions and their interactions in phase-separated condensates. We show how iterative training with second virial coefficients derived from the simulations of disordered protein sequences leads to a rapid improvement in predicting peptide self-interactions. We employ this Bayesian approach to efficiently search for new sequences that bind to condensates of the disordered C-terminal domain (CTD) of RNA Polymerase II, by simulating molecular recognition of peptides to phase-separated condensates in coarse-grained molecular dynamics. By searching for protein sequences which prefer to self-interact rather than interact with another protein sequence we are able to shape the morphology of protein condensates and design multiphasic protein condensates.
Collapse
Affiliation(s)
- Arya Changiarath
- Institute of Physics, Johannes Gutenberg University (JGU) Mainz, Germany
| | - Aayush Arya
- Institute of Physics, Johannes Gutenberg University (JGU) Mainz, Germany
| | | | - Jan Padeken
- Institute of Molecular Biology (IMB) Mainz, Germany
| | - Lukas S Stelzl
- Institute of Molecular Biology (IMB) Mainz, Germany
- Institute of Molecular Physiology, Johannes Gutenberg University (JGU) Mainz, Germany.
- KOMET1, Institute of Physics, Johannes Gutenberg University (JGU) Mainz, Germany
| |
Collapse
|
28
|
Lee J, Bang D, Kim S. Residue-Level Multiview Deep Learning for ATP Binding Site Prediction and Applications in Kinase Inhibitors. J Chem Inf Model 2025; 65:50-61. [PMID: 39690486 DOI: 10.1021/acs.jcim.4c01255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
Accurate identification of adenosine triphosphate (ATP) binding sites is crucial for understanding cellular functions and advancing drug discovery, particularly in targeting kinases for cancer treatment. Existing methods face significant challenges due to their reliance on time-consuming precomputed features and the heavily imbalanced nature of binding site data without further investigations on their utility in drug discovery. To address these limitations, we introduced Multiview-ATPBind and ResiBoost. Multiview-ATPBind is an end-to-end deep learning model that integrates one-dimensional (1D) sequence and three-dimensional (3D) structural information for rapid and precise residue-level pocket-ligand interaction predictions. Additionally, ResiBoost is a novel residue-level boosting algorithm designed to mitigate data imbalance by enhancing the prediction of rare positive binding residues. Our approach outperforms state-of-the-art models on benchmark data sets, showing significant improvements in balanced metrics with both experimental and AI-predicted structures. Furthermore, our model seamlessly transfers to predicting binding sites and enhancing docking simulations for kinase inhibitors, including imatinib and dasatinib, underscoring the potential of our method in drug discovery applications.
Collapse
Affiliation(s)
- Jaechan Lee
- Department of Computer Science and Engineering, Seoul National University, Seoul 08826, Republic of Korea
- AIGENDRUG Co., Ltd., Seoul 08826, Republic of Korea
| | - Dongmin Bang
- AIGENDRUG Co., Ltd., Seoul 08826, Republic of Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Republic of Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Seoul 08826, Republic of Korea
- AIGENDRUG Co., Ltd., Seoul 08826, Republic of Korea
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Republic of Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul 08826, Republic of Korea
| |
Collapse
|
29
|
Huang H, Shi X, Lei H, Hu F, Cai Y. ProtChat: An AI Multi-Agent for Automated Protein Analysis Leveraging GPT-4 and Protein Language Model. J Chem Inf Model 2025; 65:62-70. [PMID: 39690112 DOI: 10.1021/acs.jcim.4c01345] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
Large language models (LLMs) have transformed natural language processing, enabling advanced human-machine communication. Similarly, in computational biology, protein sequences are interpreted as natural language, facilitating the creation of protein large language models (PLLMs). However, applying PLLMs requires specialized preprocessing and script development, increasing the complexity of their use. Researchers have integrated LLMs with PLLMs to develop automated protein analysis tools to address these challenges, simplifying analytical workflows. Existing technologies often require substantial human intervention for specific protein-related tasks, maintaining high barriers to implementing automated protein analysis systems. Here, we propose ProtChat, an AI multiagent system for protein analysis that integrates the inference capabilities of PLLMs with the task-planning abilities of LLMs. ProtChat integrates GPT-4 with multiple PLLMs, like ESM and MASSA, to automate tasks such as protein property prediction and protein-drug interactions without human intervention. This AI agent enables users to input instructions directly, significantly improving efficiency and usability, making it suitable for researchers without a computational background. Experiments demonstrate that ProtChat can automate complex protein tasks accurately, avoiding manual intervention and delivering results rapidly. This advancement opens new research avenues in computational biology and drug discovery. Future applications may extend ProtChat's capabilities to broader biological data analysis. Our code and data are publicly available at github.com/SIAT-code/ProtChat.
Collapse
Affiliation(s)
- Huazhen Huang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Xianguo Shi
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Hongyang Lei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Fan Hu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Yunpeng Cai
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| |
Collapse
|
30
|
Subramanian AM, Martinez ZA, Lourenço AL, Liu S, Thomson M. Unexplored regions of the protein sequence-structure map revealed at scale by a library of foldtuned language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.12.22.573145. [PMID: 38187750 PMCID: PMC10769378 DOI: 10.1101/2023.12.22.573145] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
The combinatorial scale of amino-acid sequence-space has traditionally precluded substantive study of the full protein sequence-structure map. It remains unknown, for instance, how much of the vast uncharted landscape of far-from-natural sequences encodes the familiar ensemble of natural folds in a fashion consistent with the laws of biophysics but seemingly untouched by evolution on Earth. The scale of sequence perturbations required to access these spaces exceeds the reach of even gold-standard experimental approaches such as directed evolution. We surpass this limitation guided by the innate capacity of protein language models (pLMs) to explore sequences outside their natural training data through generation and self-feedback. We recast pLMs as probes that explore into regions of protein "deep space" that possess little-to-no detectable homology to natural examples, while enforcing core structural constraints, in a novel sequence design approach that we term "foldtuning." We build a library of foldtuned pLMs for >700 natural folds in the SCOP database, covering numerous high-priority targets for synthetic biology, including GPCRs and small GTPases, composable cell-surface-receptor and DNA-binding domains, and small signaling/regulatory domains. Candidate proteins generated by foldtuned pLMs reflect distinctive new "rules of language" for sequence innovation beyond detectable homology to any known protein and sample subtle structural alterations in a manner reminiscent of natural structural evolution and diversification. Experimental validation of two markedly different fold targets; the tyrosine-kinase- and small-GTPase-regulating SH3 domain and the bacterial RNase inhibitor barstar demonstrates that fold-tuning proposes protein variants that express and fold stably in vitro and function in vivo . Foldtuning reveals protein sequence-structure information at scale out-side of the context of evolution and promises to push forward the redesign and reconstitution of novel-to-nature synthetic biological systems for applications in health and catalysis.
Collapse
|
31
|
Johnson S, Weigele P, Fomenkov A, Ge A, Vincze A, Eaglesham J, Roberts R, Sun Z. Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems. Nucleic Acids Res 2025; 53:gkae1175. [PMID: 39657740 PMCID: PMC11754643 DOI: 10.1093/nar/gkae1175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/07/2024] [Accepted: 11/15/2024] [Indexed: 12/12/2024] Open
Abstract
The availability of large databases of biological sequences presents an opportunity for in-depth exploration of gene diversity and function. Bacterial defense systems are a rich source of diverse but difficult to annotate genes with biotechnological applications. In this work, we present Domainator, a flexible and modular software suite for domain-based gene neighborhood and protein search, extraction and clustering. We demonstrate the utility of Domainator through three examples related to bacterial defense systems. First, we cluster CRISPR-associated Rossman fold (CARF) containing proteins with difficult to annotate effector domains, classifying most of them as likely transcriptional regulators and a subset as likely RNases. Second, we extract and cluster P4-like phage satellite defense hotspots, identify an abundant variant of Lamassu defense systems and demonstrate its in vivo activity against several T-even phages. Third, we integrate a protein language model into Domainator and use it to identify restriction endonucleases with low similarity to known reference sequences, validating the activity of one example in vitro. Domainator is made available as an open-source package with detailed documentation and usage examples.
Collapse
Affiliation(s)
| | | | | | - Andrew Ge
- New England Biolabs Inc., Ipswich, MA 01938, USA
| | - Anna Vincze
- New England Biolabs Inc., Ipswich, MA 01938, USA
| | | | | | - Zhiyi Sun
- New England Biolabs Inc., Ipswich, MA 01938, USA
| |
Collapse
|
32
|
Chen Z, Ji C, Xu W, Gao J, Huang J, Xu H, Qian G, Huang J. UniAMP: enhancing AMP prediction using deep neural networks with inferred information of peptides. BMC Bioinformatics 2025; 26:10. [PMID: 39799358 PMCID: PMC11725221 DOI: 10.1186/s12859-025-06033-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Accepted: 01/02/2025] [Indexed: 01/15/2025] Open
Abstract
Antimicrobial peptides (AMPs) have been widely recognized as a promising solution to combat antimicrobial resistance of microorganisms due to the increasing abuse of antibiotics in medicine and agriculture around the globe. In this study, we propose UniAMP, a systematic prediction framework for discovering AMPs. We observe that feature vectors used in various existing studies constructed from peptide information, such as sequence, composition, and structure, can be augmented and even replaced by information inferred by deep learning models. Specifically, we use a feature vector with 2924 values inferred by two deep learning models, UniRep and ProtT5, to demonstrate that such inferred information of peptides suffice for the task, with the help of our proposed deep neural network model composed of fully connected layers and transformer encoders for predicting the antibacterial activity of peptides. Evaluation results demonstrate superior performance of our proposed model on both balanced benchmark datasets and imbalanced test datasets compared with existing studies. Subsequently, we analyze the relations among peptide sequences, manually extracted features, and automatically inferred information by deep learning models, leading to observations that the inferred information is more comprehensive and non-redundant for the task of predicting AMPs. Moreover, this approach alleviates the impact of the scarcity of positive data and demonstrates great potential in future research and applications.
Collapse
Affiliation(s)
- Zixin Chen
- College of Artificial Intelligence, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China
| | - Chengming Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China
| | - Wenwen Xu
- College of Artificial Intelligence, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China
| | - Jianfeng Gao
- StarHelix Inc, Jiangmiao Road, Nanjing, 210000, Jiangsu, China
| | - Ji Huang
- College of Agriculture, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China
| | - Huanliang Xu
- College of Artificial Intelligence, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China
| | - Guoliang Qian
- College of Plant Protection, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China.
| | - Junxian Huang
- College of Artificial Intelligence, Nanjing Agricultural University, Weigang No.1, Nanjing, 210095, Jiangsu, China.
| |
Collapse
|
33
|
Wang R, Ji Y, Li Y, Lee ST. Applications of Transformers in Computational Chemistry: Recent Progress and Prospects. J Phys Chem Lett 2025; 16:421-434. [PMID: 39737793 DOI: 10.1021/acs.jpclett.4c03128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2025]
Abstract
The powerful data processing and pattern recognition capabilities of machine learning (ML) technology have provided technical support for the innovation in computational chemistry. Compared with traditional ML and deep learning (DL) techniques, transformers possess fine-grained feature-capturing abilities, which are able to efficiently and accurately model the dependencies of long-sequence data, simulate complex and diverse chemical spaces, and explore the computational logic behind the data. In this Perspective, we provide an overview of the application of transformer models in computational chemistry. We first introduce the working principle of transformer models and analyze the transformer-based architectures in computational chemistry. Next, we explore the practical applications of the model in a number of specific scenarios such as property prediction and chemical structure generation. Finally, based on these applications and research results, we provide an outlook for the research of this field in the future.
Collapse
Affiliation(s)
- Rui Wang
- Macao Institute of Materials Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, Macau SAR 999078, China
| | - Yujin Ji
- Institute of Functional Nano & Soft Materials (FUNSOM), Jiangsu Key Laboratory for Carbon-Based Functional Materials & Devices, Soochow University, Suzhou, Jiangsu 215123, China
| | - Youyong Li
- Macao Institute of Materials Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, Macau SAR 999078, China
- Institute of Functional Nano & Soft Materials (FUNSOM), Jiangsu Key Laboratory for Carbon-Based Functional Materials & Devices, Soochow University, Suzhou, Jiangsu 215123, China
| | - Shuit-Tong Lee
- Macao Institute of Materials Science and Engineering, Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, Macau SAR 999078, China
- Institute of Functional Nano & Soft Materials (FUNSOM), Jiangsu Key Laboratory for Carbon-Based Functional Materials & Devices, Soochow University, Suzhou, Jiangsu 215123, China
| |
Collapse
|
34
|
Das S, Ghosh S, Jana ND. TransConv: convolution-infused transformer for protein secondary structure prediction. J Mol Model 2025; 31:37. [PMID: 39776295 DOI: 10.1007/s00894-024-06259-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 12/15/2024] [Indexed: 01/11/2025]
Abstract
CONTEXT Protein secondary structure prediction is essential for understanding protein function and characteristics and can also facilitate drug discovery. Traditional methods for experimentally determining protein structures are both time-consuming and costly. Computational biology offers a viable alternative by predicting protein structures from their sequences. Protein secondary structure is defined by the local spatial arrangement of the protein backbone, resulting from hydrogen bonds between amino acids. METHODS In this study, we introduce TransConv, a model that combines transformer models with convolutional blocks to predict protein secondary structures from amino acid sequences. Transformer models are effective at capturing long-range dependencies through self-attention mechanisms. We integrate convolutional blocks into the transformer architecture to improve the detection of important local features. This hybrid model captures both long-range interactions and local features, leading to more accurate predictions of protein secondary structures, thus offering an efficient solution for this critical task. The experimental outcomes on the benchmark datasets depict the superiority of the proposed approach over the state-of-the-art (SOTA) models in the literature.
Collapse
Affiliation(s)
- Sayantan Das
- National Institute of Technology Durgapur, Durgapur, India
| | - Subhayu Ghosh
- National Institute of Technology Durgapur, Durgapur, India.
| | | |
Collapse
|
35
|
Yan B, Nam Y, Li L, Deek RA, Li H, Ma S. Recent advances in deep learning and language models for studying the microbiome. Front Genet 2025; 15:1494474. [PMID: 39840283 PMCID: PMC11747409 DOI: 10.3389/fgene.2024.1494474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 12/13/2024] [Indexed: 01/23/2025] Open
Abstract
Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein and genomic sequences, like natural languages, form a language of life, enabling the adoption of LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications of deep learning and language models in analyzing microbiome and metagenomics data. We focus on problem formulations, necessary datasets, and the integration of language modeling techniques. We provide an extensive overview of protein/genomic language modeling and their contributions to microbiome studies. We also discuss applications such as novel viromics language modeling, biosynthetic gene cluster prediction, and knowledge integration for metagenomics studies.
Collapse
Affiliation(s)
- Binghao Yan
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Yunbi Nam
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Lingyao Li
- School of Information, University of South Florida, Tampa, FL, United States
| | - Rebecca A. Deek
- Department of Biostatistics and Health Data Science, University of Pittsburgh, Pittsburgh, PA, United States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Siyuan Ma
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
36
|
Eom H, Park S, Cho K, Lee J, Kim H, Kim S, Yang J, Han YH, Lee J, Seok C, Lee M, Song W, Steinegger M. Discovery of highly active kynureninases for cancer immunotherapy through protein language model. Nucleic Acids Res 2025; 53:gkae1245. [PMID: 39777462 PMCID: PMC11704957 DOI: 10.1093/nar/gkae1245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/16/2024] [Accepted: 12/05/2024] [Indexed: 01/11/2025] Open
Abstract
Tailor-made enzymes empower a wide range of versatile applications, although searching for the desirable enzymes often requires high throughput screening and thus poses significant challenges. In this study, we employed homology searches and protein language models to discover and prioritize enzymes by their kinetic parameters. We aimed to discover kynureninases as a potentially versatile therapeutic enzyme, which hydrolyses L-kynurenine, a potent immunosuppressive metabolite, to overcome the immunosuppressive tumor microenvironment in anticancer therapy. Subsequently, we experimentally validated the efficacy of four top-ranked kynureninases under in vitro and in vivo conditions. Our findings revealed a catalytically most active one with a nearly twofold increase in turnover number over the prior best and a 3.4-fold reduction in tumor weight in mouse model comparisons. Consequently, our approach holds promise for the targeted quantitative enzyme discovery and selection suitable for specific applications with higher accuracy, significantly broadening the scope of enzyme utilization. A web-executable version of our workflow is available at seekrank.steineggerlab.com and our code is available as free open-source software at github.com/steineggerlab/SeekRank.
Collapse
Affiliation(s)
- Hyunuk Eom
- Department of Chemistry, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Sukhwan Park
- School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Kye Soo Cho
- Galux Inc, 1837 Nambusunhwan-ro, Gwanak-gu, Seoul 08738, Republic of Korea
| | - Jihyeon Lee
- Department of Chemistry, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Hyunbin Kim
- School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Stephanie Kim
- School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Jinsol Yang
- Galux Inc, 1837 Nambusunhwan-ro, Gwanak-gu, Seoul 08738, Republic of Korea
| | - Young-Hyun Han
- Galux Inc, 1837 Nambusunhwan-ro, Gwanak-gu, Seoul 08738, Republic of Korea
| | - Juyong Lee
- Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- School of Pharmacy, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Arontier Co., 241 Gangnam-daero, Seocho-gu, Seoul 06735, Republic of Korea
| | - Chaok Seok
- Artificial Intelligence Institute, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Institute of Molecular Biology and Genetics, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Department of Chemistry, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Galux Inc, 1837 Nambusunhwan-ro, Gwanak-gu, Seoul 08738, Republic of Korea
| | - Myeong Sup Lee
- Department of Biomedical Sciences, University of Ulsan College of Medicine, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Republic of Korea
- Galux Inc, 1837 Nambusunhwan-ro, Gwanak-gu, Seoul 08738, Republic of Korea
| | - Woon Ju Song
- Department of Chemistry, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Artificial Intelligence Institute, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
- Institute of Molecular Biology and Genetics, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea
| |
Collapse
|
37
|
Piovesan D, Del Conte A, Mehdiabadi M, Aspromonte M, Blum M, Tesei G, von Bülow S, Lindorff-Larsen K, Tosatto SE. MOBIDB in 2025: integrating ensemble properties and function annotations for intrinsically disordered proteins. Nucleic Acids Res 2025; 53:D495-D503. [PMID: 39470701 PMCID: PMC11701742 DOI: 10.1093/nar/gkae969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/07/2024] [Accepted: 10/11/2024] [Indexed: 10/30/2024] Open
Abstract
The MobiDB database (URL: https://mobidb.org/) aims to provide structural and functional information about intrinsic protein disorder, aggregating annotations from the literature, experimental data, and predictions for all known protein sequences. Here, we describe the improvements made to our resource to capture more information, simplify access to the aggregated data, and increase documentation of all MobiDB features. Compared to the previous release, all underlying pipeline modules were updated. The prediction module is ten times faster and can detect if a predicted disordered region is structurally extended or compact. The PDB component is now able to process large cryo-EM structures extending the number of processed entries. The entry page has been restyled to highlight functional aspects of disorder and all graphical modules have been completely reimplemented for better flexibility and faster rendering. The server has been improved to optimise bulk downloads. Annotation provenance has been standardised by adopting ECO terms. Finally, we propagated disorder function (IDPO and GO terms) from the DisProt database exploiting sequence similarity and protein embeddings. These improvements, along with the addition of comprehensive training material, offer a more intuitive interface and novel functional knowledge about intrinsic disorder.
Collapse
Affiliation(s)
- Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padua 35131, Italy
| | - Alessio Del Conte
- Department of Biomedical Sciences, University of Padova, Padua 35131, Italy
| | - Mahta Mehdiabadi
- Department of Biomedical Sciences, University of Padova, Padua 35131, Italy
| | | | - Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Giulio Tesei
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Sören von Bülow
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padua 35131, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari, Italy
| |
Collapse
|
38
|
Szklarczyk D, Nastou K, Koutrouli M, Kirsch R, Mehryary F, Hachilif R, Hu D, Peluso ME, Huang Q, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res 2025; 53:D730-D737. [PMID: 39558183 PMCID: PMC11701646 DOI: 10.1093/nar/gkae1113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 10/18/2024] [Accepted: 10/29/2024] [Indexed: 11/20/2024] Open
Abstract
Proteins cooperate, regulate and bind each other to achieve their functions. Understanding the complex network of their interactions is essential for a systems-level description of cellular processes. The STRING database compiles, scores and integrates protein-protein association information drawn from experimental assays, computational predictions and prior knowledge. Its goal is to create comprehensive and objective global networks that encompass both physical and functional interactions. Additionally, STRING provides supplementary tools such as network clustering and pathway enrichment analysis. The latest version, STRING 12.5, introduces a new 'regulatory network', for which it gathers evidence on the type and directionality of interactions using curated pathway databases and a fine-tuned language model parsing the literature. This update enables users to visualize and access three distinct network types-functional, physical and regulatory-separately, each applicable to distinct research needs. In addition, the pathway enrichment detection functionality has been updated, with better false discovery rate corrections, redundancy filtering and improved visual displays. The resource now also offers improved annotations of clustered networks and provides users with downloadable network embeddings, which facilitate the use of STRING networks in machine learning and allow cross-species transfer of protein information. The STRING database is available online at https://string-db.org/.
Collapse
Affiliation(s)
- Damian Szklarczyk
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| | - Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Rebecca Kirsch
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Farrokh Mehryary
- TurkuNLP Lab, Department of Computing, University of Turku, Vesilinnantie 5, 20014 Turku, Finland
| | - Radja Hachilif
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| | - Dewei Hu
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Matteo E Peluso
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| | - Qingyao Huang
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| | - Tao Fang
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| | - Nadezhda T Doncheva
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Sampo Pyysalo
- TurkuNLP Lab, Department of Computing, University of Turku, Vesilinnantie 5, 20014 Turku, Finland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, Robert-Rössle-Strasse 10, 13125 Berlin, Germany
- Department of Bioinformatics, Biozentrum, University of Würzburg, Am Hubland, 97074 Würzburg, Germany
| | - Lars J Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, 2200 Copenhagen N, Denmark
| | - Christian von Mering
- Department of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, 8057 Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, 1015 Lausanne, Switzerland
| |
Collapse
|
39
|
Shen H, Li Y, Pi Q, Tian J, Xu X, Huang Z, Huang J, Pian C, Mao S. Unveiling novel antimicrobial peptides from the ruminant gastrointestinal microbiomes: A deep learning-driven approach yields an anti-MRSA candidate. J Adv Res 2025:S2090-1232(25)00005-0. [PMID: 39756573 DOI: 10.1016/j.jare.2025.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 01/01/2025] [Accepted: 01/02/2025] [Indexed: 01/07/2025] Open
Abstract
INTRODUCTION Antimicrobial peptides (AMPs) present a promising avenue to combat the growing threat of antibiotic resistance. The ruminant gastrointestinal microbiome serves as a unique ecosystem that offers untapped potential for AMP discovery. OBJECTIVES The aims of this study are to develop an effective methodology for the identification of novel AMPs from ruminant gastrointestinal microbiomes, followed by evaluating their antimicrobial efficacy and elucidating the mechanisms underlying their activity. METHODS We developed a deep learning-based model to identify AMP candidates from a dataset comprising 120 metagenomes and 10,373 metagenome-assembled genomes derived from the ruminant gastrointestinal tract. Both in vivo and in vitro experiments were performed to examine and validate the antimicrobial activities of the AMP candidates that were selected through bioinformatic analysis and subsequently synthesized chemically. Additionally, molecular dynamics simulations were conducted to explore the action mechanism of the most potent AMP candidate. RESULTS The deep learning model identified 27,192 potential secretory AMP candidates. Following bioinformatic analysis, 39 candidates were synthesized and tested. Remarkably, all synthesized peptides demonstrated antimicrobial activity against Staphylococcus aureus, with 79.5% showing effectiveness against multiple pathogens. Notably, Peptide 4, which exhibited the highest antimicrobial activity against methicillin-resistant Staphylococcus aureus (MRSA), confirmed this effect in a mouse model with wound infection, exhibiting a low propensity for resistance development and minimal cytotoxicity and hemolysis towards mammalian cells. Molecular dynamics simulations provided insights into the mechanism of Peptide 4, primarily its ability to disrupt bacterial cell membranes, leading to cell death. CONCLUSION This study highlights the power of combining deep learning with microbiome research to uncover novel therapeutic candidates, paving the way for the development of next-generation antimicrobials like Peptide 4 to combat the growing threat of MRSA would infections. It also underscores the value of utilizing ruminant microbial resources.
Collapse
Affiliation(s)
- Hong Shen
- Bioinformatics Center, Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Yanru Li
- College of Agriculture, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Qingjie Pi
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Junru Tian
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Xianghan Xu
- College of Veterinary Medicine, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China
| | - Zan Huang
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| | - Jinghu Huang
- College of Veterinary Medicine, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| | - Cong Pian
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, Jiangsu, China.
| | - Shengyong Mao
- Ruminant Nutrition and Feed Engineering Technology Research Center, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China; Laboratory of Gastrointestinal Microbiology, Jiangsu Key Laboratory of Gastrointestinal Nutrition and Animal Health, National Center for International Research on Animal Gut Nutrition, College of Animal Science and Technology, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China.
| |
Collapse
|
40
|
Hennig J, Paulino C. 4D structural biology-The 9 th Murnau Conference on structural biology. Structure 2025; 33:1-5. [PMID: 39753099 DOI: 10.1016/j.str.2024.11.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 11/13/2024] [Accepted: 11/18/2024] [Indexed: 01/11/2025]
Abstract
The data presented at the 9th International Murnau Conference on September 18-21, 2024, the largest recurring structural biology meeting in Central Europe, illustrated the thriving state of the structural biology community. This is largely attributed to the ground-breaking developments over the last decade, which were intensely discussed during the meeting.
Collapse
Affiliation(s)
- Janosch Hennig
- Chair Biochemistry IV, Biophysical Chemistry, University of Bayreuth, 95448 Bayreuth, Germany; Molecular Systems Biology Unit, European Molecular Biology Laboratory (EMBL) Heidelberg, 69117 Heidelberg, Germany.
| | - Cristina Paulino
- Biochemistry Center Heidelberg, Heidelberg University, 69120 Heidelberg, Germany.
| |
Collapse
|
41
|
Chatzimiltis S, Agathocleous M, Promponas VJ, Christodoulou C. Post-processing enhances protein secondary structure prediction with second order deep learning and embeddings. Comput Struct Biotechnol J 2025; 27:243-251. [PMID: 39866664 PMCID: PMC11764030 DOI: 10.1016/j.csbj.2024.12.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 12/20/2024] [Accepted: 12/21/2024] [Indexed: 01/28/2025] Open
Abstract
Protein Secondary Structure Prediction (PSSP) is regarded as a challenging task in bioinformatics, and numerous approaches to achieve a more accurate prediction have been proposed. Accurate PSSP can be instrumental in inferring protein tertiary structure and their functions. Machine Learning and in particular Deep Learning approaches show promising results for the PSSP problem. In this paper, we deploy a Convolutional Neural Network (CNN) trained with the Subsampled Hessian Newton (SHN) method (a Hessian Free Optimisation variant), with a two- dimensional input representation of embeddings extracted from a language model pretrained with protein sequences. Utilising a CNN trained with the SHN method and the input embeddings, we achieved on average a 79.96% per residue (Q3) accuracy on the CB513 dataset and 81.45% Q3 accuracy on the PISCES dataset (without any post-processing techniques applied). The application of ensembles and filtering techniques to the results of the CNN improved the overall prediction performance. The Q3 accuracy on the CB513 increased to 93.65% and for the PISCES dataset to 87.13%. Moreover, our method was evaluated using the CASP13 dataset where we showed that as the post-processing window size increased, the prediction performance increased as well. In fact, with the biggest post-processing window size (limited by the smallest CASP13 protein), we achieved a Q3 accuracy of 98.12% and a Segment Overlap (SOV) score of 96.98 on the CASP13 dataset when the CNNs were trained with the PISCES dataset. Finally, we showed that input representations from embeddings can perform equally well as representations extracted from multiple sequence alignments.
Collapse
Affiliation(s)
- Sotiris Chatzimiltis
- University of Cyprus, Department of Computer Science, Nicosia, Cyprus
- 5G/6GIC, Institute for Communication Systems (ICS), University of Surrey, Guildford, United Kingdom
| | - Michalis Agathocleous
- University of Cyprus, Department of Computer Science, Nicosia, Cyprus
- University of Nicosia, Department of Computer Science, Nicosia, Cyprus
| | | | | |
Collapse
|
42
|
Daoud A, Ben-Hur A. The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models. PLoS Comput Biol 2025; 21:e1012755. [PMID: 39792954 PMCID: PMC11756788 DOI: 10.1371/journal.pcbi.1012755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 01/23/2025] [Accepted: 12/30/2024] [Indexed: 01/12/2025] Open
Abstract
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.
Collapse
Affiliation(s)
- Ahmed Daoud
- Department of Computer Science, Colorado State University, Fort Collins, Colorado, United States of America
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, Colorado, United States of America
| |
Collapse
|
43
|
Song J, Kurgan L. Two decades of advances in sequence-based prediction of MoRFs, disorder-to-order transitioning binding regions. Expert Rev Proteomics 2025; 22:1-9. [PMID: 39789785 DOI: 10.1080/14789450.2025.2451715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 12/20/2024] [Accepted: 12/26/2024] [Indexed: 01/12/2025]
Abstract
INTRODUCTION Molecular recognition features (MoRFs) are regions in protein sequences that undergo induced folding upon binding partner molecules. MoRFs are common in nature and can be predicted from sequences based on their distinctive sequence signatures. AREAS COVERED We overview 20 years of progress in the sequence-based prediction of MoRFs which resulted in the development of 25 predictors of MoRFs that interact with proteins, peptides, and lipids. These methods range from simple discriminant analysis to sophisticated deep transformer networks that use protein language models. They generate relatively accurate predictions as evidenced by the results of a recently published community-driven assessment. EXPERT OPINION MoRFs prediction is a mature field of research that is poised to continue at a steady pace in the foreseeable future. We anticipate further expansion of the scope of MoRF predictions to additional partner molecules, such as nucleic acids, and continued use of recent machine learning advances. Other future efforts should concentrate on improving availability of MoRF predictions by releasing, maintaining, and popularizing web servers and by depositing MoRF predictions to large databases of protein structure and function predictions. Furthermore, accurate MoRF predictions should be coupled with the equally accurate prediction and modeling of the resulting structures of complexes.
Collapse
Affiliation(s)
- Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
44
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2025; 25:e2300471. [PMID: 38996351 PMCID: PMC11735672 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| |
Collapse
|
45
|
Pratyush P, Pokharel S, Ismail HD, Bahmani S, Kc DB. LMPTMSite: A Platform for PTM Site Prediction in Proteins Leveraging Transformer-Based Protein Language Models. Methods Mol Biol 2025; 2867:261-297. [PMID: 39576587 DOI: 10.1007/978-1-0716-4196-5_16] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Protein post-translational modifications (PTMs) introduce new functionalities and play a critical role in the regulation of protein functions. Characterizing these modifications, especially PTM sites, is essential for unraveling complex biological systems. However, traditional experimental approaches, such as mass spectrometry, are time-consuming and expensive. Machine learning and deep learning techniques offer promising alternatives for predicting PTM sites. In this chapter, we introduce our LMPTMSite (language model-based post-translational modification site predictor) platform, which emphasizes two transformer-based protein language model (pLM) approaches: pLMSNOSite and LMSuccSite, for the prediction of S-nitrosylation sites and succinylation sites in proteins, respectively. We highlight the various methods of using pLM-based sequence encoding, explain the underlying deep learning architectures, and discuss the superior efficacy of these tools compared to other state-of-the-art tools. Subsequently, we present an analysis of runtime and memory usage for pLMSNOSite, with a focus on CPU and RAM usage as the input sequence length is scaled up. Finally, we showcase a case study predicting succinylation sites in proteins active within the tricarboxylic acid (TCA) cycle pathway using LMSuccSite, demonstrating its potential utility and efficiency in real-world biological contexts. The LMPTMSite platform, inclusive of pLMSNOSite and LMSuccSite, is freely available both as a web server ( http://kcdukkalab.org/pLMSNOSite/ and http://kcdukkalab.org/LMSuccSite/ ) and as standalone packages ( https://github.com/KCLabMTU/pLMSNOSite and https://github.com/KCLabMTU/LMSuccSite ), providing valuable tools for researchers in the field.
Collapse
Affiliation(s)
- Pawel Pratyush
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
| | - Suresh Pokharel
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
| | - Hamid D Ismail
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
- North Carolina A&T State University, Computational Data Science and Engineering, Greensboro, NC, USA
| | - Soufia Bahmani
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA
- Michigan Technological University, Comptuer Science Department, Houghton, MI, USA
| | - Dukka B Kc
- Computer Science Department, Rochester Institute of Technology, Rochester, NY, USA.
| |
Collapse
|
46
|
Brizuela CA, Liu G, Stokes JM, de la Fuente‐Nunez C. AI Methods for Antimicrobial Peptides: Progress and Challenges. Microb Biotechnol 2025; 18:e70072. [PMID: 39754551 PMCID: PMC11702388 DOI: 10.1111/1751-7915.70072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 11/18/2024] [Accepted: 12/16/2024] [Indexed: 01/06/2025] Open
Abstract
Antimicrobial peptides (AMPs) are promising candidates to combat multidrug-resistant pathogens. However, the high cost of extensive wet-lab screening has made AI methods for identifying and designing AMPs increasingly important, with machine learning (ML) techniques playing a crucial role. AI approaches have recently revolutionised this field by accelerating the discovery of new peptides with anti-infective activity, particularly in preclinical mouse models. Initially, classical ML approaches dominated the field, but recently there has been a shift towards deep learning (DL) models. Despite significant contributions, existing reviews have not thoroughly explored the potential of large language models (LLMs), graph neural networks (GNNs) and structure-guided AMP discovery and design. This review aims to fill that gap by providing a comprehensive overview of the latest advancements, challenges and opportunities in using AI methods, with a particular emphasis on LLMs, GNNs and structure-guided design. We discuss the limitations of current approaches and highlight the most relevant topics to address in the coming years for AMP discovery and design.
Collapse
Affiliation(s)
| | - Gary Liu
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic DiscoveryMcMaster UniversityHamiltonOntarioCanada
| | - Jonathan M. Stokes
- Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic DiscoveryMcMaster UniversityHamiltonOntarioCanada
| | - Cesar de la Fuente‐Nunez
- Machine Biology Group, Department of Psychiatry and Microbiology, Institute for Biomedical Informatics, Institute for Translational Medicine and Therapeutics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Bioengineering and Chemical and Biomolecular Engineering, School of Engineering and Applied ScienceUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Chemistry, School of Arts and SciencesUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Penn Institute for Computational ScienceUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| |
Collapse
|
47
|
Gao M, Song C, Liu T. PLM-T3SE: Accurate Prediction of Type III Secretion Effectors Using Protein Language Model Embeddings. J Cell Biochem 2025; 126:e30642. [PMID: 39164870 DOI: 10.1002/jcb.30642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Revised: 08/04/2024] [Accepted: 08/07/2024] [Indexed: 08/22/2024]
Abstract
The Type III secretion effectors (T3SEs) are bacterial proteins synthesized by Gram-negative pathogens and delivered into host cells via the Type III secretion system (T3SS). These effectors usually play a pivotal role in the interactions between bacteria and hosts. Hence, the precise identification of T3SEs aids researchers in exploring the pathogenic mechanisms of bacterial infections. Since the diversity and complexity of T3SE sequences often make traditional experimental methods time-consuming, it is imperative to explore more efficient and convenient computational approaches for T3SE prediction. Inspired by the promising potential exhibited by pre-trained language models in protein recognition tasks, we proposed a method called PLM-T3SE that utilizes protein language models (PLMs) for effective recognition of T3SEs. First, we utilized PLM embeddings and evolutionary features from the position-specific scoring matrix (PSSM) profiles to transform protein sequences into fixed-length vectors for model training. Second, we employed the extreme gradient boosting (XGBoost) algorithm to rank these features based on their importance. Finally, a MLP neural network model was used to predict T3SEs based on the selected optimal feature set. Experimental results from the cross-validation and independent test demonstrated that our model exhibited superior performance compared to the existing models. Specifically, our model achieved an accuracy of 98.1%, which is 1.8%-42.4% higher than the state-of-the-art predictors based on the same independent data set test. These findings highlight the superiority of the PLM-T3SE and the remarkable characterization ability of PLM embeddings for T3SE prediction.
Collapse
Affiliation(s)
- Mengru Gao
- College of Information Technology, Shanghai Ocean University, Shanghai, China
| | - Chen Song
- College of Information Technology, Shanghai Ocean University, Shanghai, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, China
| |
Collapse
|
48
|
Pratyush P, Kc DB. Advances in Prediction of Posttranslational Modification Sites Known to Localize in Protein Supersecondary Structures. Methods Mol Biol 2025; 2870:117-151. [PMID: 39543034 DOI: 10.1007/978-1-0716-4213-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
Posttranslational modifications (PTMs) play a crucial role in modulating the structure, function, localization, and interactions of proteins, with many PTMs being localized within supersecondary structures, such as helical pairs. These modifications can significantly influence the conformation and stability of these structures. For instance, phosphorylation introduces negative charges that alter electrostatic interactions, while acetylation or methylation of lysine residues affects the stability and interactions of alpha helices or beta strands. Given the pivotal role of supersecondary structures in the overall protein architecture, their modulation by PTMs is essential for protein functionality. This chapter explores the latest advancements in predicting sites for the five PTMs (phosphorylation, acetylation, glycosylation, methylation, and ubiquitination) known to be localized within supersecondary structures. The chapter highlights the recent advances in the prediction of these PTM sites, including the use of global contextualized embeddings from protein language models, integration of structural information, utilization of reliable positive and negative sites, and application of contrastive learning. These methodologies and emerging trends offer a roadmap for novel innovations in addressing PTM prediction challenges, particularly those linked to supersecondary structures.
Collapse
Affiliation(s)
- Pawel Pratyush
- Computer Science Department, Michigan Technological University, Houghton, MI, USA
- Computer Science Department, Rochester Institute of Technology, Henrietta, NY, USA
| | - Dukka B Kc
- Computer Science Department, Michigan Technological University, Houghton, MI, USA.
- Computer Science Department, Rochester Institute of Technology, Henrietta, NY, USA.
| |
Collapse
|
49
|
Zhou Y, Liu W, Luo C, Huang Z, Samarappuli Mudiyanselage Savini G, Zhao L, Wang R, Huang J. Ab-Amy 2.0: Predicting light chain amyloidogenic risk of therapeutic antibodies based on antibody language model. Methods 2025; 233:11-18. [PMID: 39550021 DOI: 10.1016/j.ymeth.2024.11.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 10/28/2024] [Accepted: 11/06/2024] [Indexed: 11/18/2024] Open
Abstract
Therapeutic antibodies have emerged as a promising treatment option for a wide range of diseases. However, the light chain of antibodies can potentially induce amyloidosis, a condition characterized by protein misfolding and aggregation, posing a significant safety concern. Therefore, it is crucial to assess the amyloidogenic risk of therapeutic antibodies during the early stages of drug development. In this study, we introduce AB-Amy 2.0, a new computational model with enhanced performance for assessing the light chain amyloidogenic risk of therapeutic antibodies. By employing pretrained protein language models (PLMs) embeddings, AB-Amy 2.0 achieves higher accuracy in amyloidogenic risk prediction compared with traditional features offering a crucial tool for early-stage identification of antibodies with low aggregation propensity. The AB-Amy 2.0 was trained on antiBERTy embeddings and utilizes the SVM algorithm, resulting in superior performance metrics. On an independent test dataset, the model achieved high sensitivity, specificity, ACC, MCC and AUC of 93.47%, 89.23%, 91.92%, 0.8261 and 0.9739, respectively. These results highlight the effectiveness and robustness of AB-Amy 2.0 in predicting light chain amyloidogenic risk accurately. To facilitate user-friendly access, we have developed an online web server (http://i.uestc.edu.cn/AB-Amy2) and a command line tool (https://github.com/zzyywww/ABAmy2). These resources enable the broader application of this advanced model and promise to enhance the development of safer therapeutic antibodies.
Collapse
Affiliation(s)
- Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenwen Liu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Chunmei Luo
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | | | - Lening Zhao
- Yingcai Honors College, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Rong Wang
- Sichuan Academy of Medical Sciences and Sichuan Provincial People's Hospital, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China; School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611731, China.
| |
Collapse
|
50
|
Zhang M, Zhang Y, Dong K, Lin J, Cui X, Zhang Y. Identification of Critical Phosphorylation Sites Enhancing Kinase Activity With a Bimodal Fusion Framework. Mol Cell Proteomics 2025; 24:100889. [PMID: 39617062 PMCID: PMC11774822 DOI: 10.1016/j.mcpro.2024.100889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 11/26/2024] [Accepted: 11/28/2024] [Indexed: 01/12/2025] Open
Abstract
Phosphorylation is an indispensable regulatory mechanism in cells, with specific sites on kinases that can significantly enhance their activity. Although several such critical phosphorylation sites (phos-sites) have been experimentally identified, many more remain to be explored. To date, no computational method exists to systematically identify these critical phos-sites on kinases. In this study, we introduce PhoSiteformer, a transformer-inspired foundational model designed to generate embeddings of phos-sites using phosphorylation mass spectrometry data. Recognizing the complementary insights offered by protein sequence data and phosphorylation mass spectrometry data, we developed a classification model, CSPred, which employs a bimodal fusion strategy. CSPred combines embeddings from PhoSiteformer with those from the protein language model ProtT5. Our approach successfully identified 77 critical phos-sites on 58 human kinases. Two of these sites, T517 on PKG1 and T735 on PRKD3, have been experimentally verified. This study presents the first systematic and computational approach to identify critical phos-sites that enhance kinase activity.
Collapse
Affiliation(s)
- Menghuan Zhang
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Yizhi Zhang
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Keqin Dong
- Department of Urology, School of Medicine, Xinhua Hospital Affiliated to Shanghai Jiao Tong University, Shanghai, China
| | - Jin Lin
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Xingang Cui
- Department of Urology, School of Medicine, Xinhua Hospital Affiliated to Shanghai Jiao Tong University, Shanghai, China.
| | - Yong Zhang
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, Shanghai, China.
| |
Collapse
|