1
|
Gao J, Liu H, Zhuo C, Zeng C, Zhao Y. Predicting Small Molecule Binding Nucleotides in RNA Structures Using RNA Surface Topography. J Chem Inf Model 2024. [PMID: 39230508 DOI: 10.1021/acs.jcim.4c01264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
RNA small molecule interactions play a crucial role in drug discovery and inhibitor design. Identifying RNA small molecule binding nucleotides is essential and requires methods that exhibit a high predictive ability to facilitate drug discovery and inhibitor design. Existing methods can predict the binding nucleotides of simple RNA structures, but it is hard to predict binding nucleotides in complex RNA structures with junctions. To address this limitation, we developed a new deep learning model based on spatial correlation, ZHmolReSTasite, which can accurately predict binding nucleotides of small and large RNA with junctions. We utilize RNA surface topography to consider the spatial correlation, characterizing nucleotides from sequence and tertiary structures to learn a high-level representation. Our method outperforms existing methods for benchmark test sets composed of simple RNA structures, achieving precision values of 72.9% on TE18 and 76.7% on RB9 test sets. For a challenging test set composed of RNA structures with junctions, our method outperforms the second best method by 11.6% in precision. Moreover, ZHmolReSTasite demonstrates robustness regarding the predicted RNA structures. In summary, ZHmolReSTasite successfully incorporates spatial correlation, outperforms previous methods on small and large RNA structures using RNA surface topography, and can provide valuable insights into RNA small molecule prediction and accelerate RNA inhibitor design.
Collapse
Affiliation(s)
- Jiaming Gao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Haoquan Liu
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chen Zhuo
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chengwei Zeng
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Yunjie Zhao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| |
Collapse
|
2
|
Li P, Liu ZP. MuToN Quantifies Binding Affinity Changes upon Protein Mutations by Geometric Deep Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2402918. [PMID: 38995072 PMCID: PMC11425207 DOI: 10.1002/advs.202402918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/04/2024] [Indexed: 07/13/2024]
Abstract
Assessing changes in protein-protein binding affinity due to mutations helps understanding a wide range of crucial biological processes within cells. Despite significant efforts to create accurate computational models, predicting how mutations affect affinity remains challenging due to the complexity of the biological mechanisms involved. In the present work, a geometric deep learning framework called MuToN is introduced for quantifying protein binding affinity change upon residue mutations. The method, designed with geometric attention networks, is mechanism-aware. It captures changes in the protein binding interfaces of mutated complexes and assesses the allosteric effects of amino acids. Experimental results highlight MuToN's superiority compared to existing methods. Additionally, MuToN's flexibility and effectiveness are illustrated by its precise predictions of binding affinity changes between SARS-CoV-2 variants and the ACE2 complex.
Collapse
Affiliation(s)
- Pengpai Li
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong, 250061, China
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong, 250061, China
| |
Collapse
|
3
|
Wang B, Li W. Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction. Genes (Basel) 2024; 15:1090. [PMID: 39202449 PMCID: PMC11353971 DOI: 10.3390/genes15081090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 08/13/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open
Abstract
Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of the applicability of protein language models in predicting protein and nucleic acid binding sites. Various approaches have explored this potential. This paper first describes the development of protein language models. Then, a systematic review of the latest methods for predicting protein and nucleic acid binding sites is conducted by covering benchmark sets, feature generation methods, performance comparisons, and feature ablation studies. These comparisons demonstrate the importance of protein language models for the prediction task. Finally, the paper discusses the challenges of protein and nucleic acid binding site prediction and proposes possible research directions and future trends. The purpose of this survey is to furnish researchers with actionable suggestions for comprehending the methodologies used in predicting protein-nucleic acid binding sites, fostering the creation of protein-centric language models, and tackling real-world obstacles encountered in this field.
Collapse
Affiliation(s)
| | - Wenjin Li
- Institute for Advanced Study, Shenzhen University, Shenzhen 518061, China;
| |
Collapse
|
4
|
Yuan Q, Tian C, Song Y, Ou P, Zhu M, Zhao H, Yang Y. GPSFun: geometry-aware protein sequence function predictions with language models. Nucleic Acids Res 2024; 52:W248-W255. [PMID: 38738636 PMCID: PMC11223820 DOI: 10.1093/nar/gkae381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 04/22/2024] [Accepted: 04/26/2024] [Indexed: 05/14/2024] Open
Abstract
Knowledge of protein function is essential for elucidating disease mechanisms and discovering new drug targets. However, there is a widening gap between the exponential growth of protein sequences and their limited function annotations. In our prior studies, we have developed a series of methods including GraphPPIS, GraphSite, LMetalSite and SPROF-GO for protein function annotations at residue or protein level. To further enhance their applicability and performance, we now present GPSFun, a versatile web server for Geometry-aware Protein Sequence Function annotations, which equips our previous tools with language models and geometric deep learning. Specifically, GPSFun employs large language models to efficiently predict 3D conformations of the input protein sequences and extract informative sequence embeddings. Subsequently, geometric graph neural networks are utilized to capture the sequence and structure patterns in the protein graphs, facilitating various downstream predictions including protein-ligand binding sites, gene ontologies, subcellular locations and protein solubility. Notably, GPSFun achieves superior performance to state-of-the-art methods across diverse tasks without requiring multiple sequence alignments or experimental protein structures. GPSFun is freely available to all users at https://bio-web1.nscc-gz.cn/app/GPSFun with user-friendly interfaces and rich visualizations.
Collapse
Affiliation(s)
- Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Chong Tian
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Yidong Song
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Peihua Ou
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Mingming Zhu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong 510000, China
| |
Collapse
|
5
|
Sagendorf JM, Mitra R, Huang J, Chen XS, Rohs R. Structure-based prediction of protein-nucleic acid binding using graph neural networks. Biophys Rev 2024; 16:297-314. [PMID: 39345796 PMCID: PMC11427629 DOI: 10.1007/s12551-024-01201-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 05/28/2024] [Indexed: 10/01/2024] Open
Abstract
Protein-nucleic acid (PNA) binding plays critical roles in the transcription, translation, regulation, and three-dimensional organization of the genome. Structural models of proteins bound to nucleic acids (NA) provide insights into the chemical, electrostatic, and geometric properties of the protein structure that give rise to NA binding but are scarce relative to models of unbound proteins. We developed a deep learning approach for predicting PNA binding given the unbound structure of a protein that we call PNAbind. Our method utilizes graph neural networks to encode the spatial distribution of physicochemical and geometric properties of protein structures that are predictive of NA binding. Using global physicochemical encodings, our models predict the overall binding function of a protein, and using local encodings, they predict the location of individual NA binding residues. Our models can discriminate between specificity for DNA or RNA binding, and we show that predictions made on computationally derived protein structures can be used to gain mechanistic understanding of chemical and structural features that determine NA recognition. Binding site predictions were validated against benchmark datasets, achieving AUROC scores in the range of 0.92-0.95. We applied our models to the HIV-1 restriction factor APOBEC3G and showed that our model predictions are consistent with and help explain experimental RNA binding data. Supplementary information The online version contains supplementary material available at 10.1007/s12551-024-01201-w.
Collapse
Affiliation(s)
- Jared M. Sagendorf
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089 USA
- Present Address: Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158 USA
| | - Raktim Mitra
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089 USA
| | - Jiawei Huang
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089 USA
| | - Xiaojiang S. Chen
- Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089 USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089 USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089 USA
- Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089 USA
- Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA 90089 USA
| |
Collapse
|
6
|
Yuan Q, Tian C, Yang Y. Genome-scale annotation of protein binding sites via language model and geometric deep learning. eLife 2024; 13:RP93695. [PMID: 38630609 PMCID: PMC11023698 DOI: 10.7554/elife.93695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
Revealing protein binding sites with other molecules, such as nucleic acids, peptides, or small ligands, sheds light on disease mechanism elucidation and novel drug design. With the explosive growth of proteins in sequence databases, how to accurately and efficiently identify these binding sites from sequences becomes essential. However, current methods mostly rely on expensive multiple sequence alignments or experimental protein structures, limiting their genome-scale applications. Besides, these methods haven't fully explored the geometry of the protein structures. Here, we propose GPSite, a multi-task network for simultaneously predicting binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions on proteins. GPSite was trained on informative sequence embeddings and predicted structures from protein language models, while comprehensively extracting residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even when the structures are not well-predicted. The low computational cost of GPSite enables rapid genome-scale binding residue annotations for over 568,000 sequences, providing opportunities to unveil unexplored associations of binding sites with molecular functions, biological processes, and genetic variants. The GPSite webserver and annotation database can be freely accessed at https://bio-web1.nscc-gz.cn/app/GPSite.
Collapse
Affiliation(s)
- Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen UniversityGuangzhouChina
| | - Chong Tian
- School of Computer Science and Engineering, Sun Yat-sen UniversityGuangzhouChina
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen UniversityGuangzhouChina
| |
Collapse
|
7
|
Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 2024; 52:e27. [PMID: 38281252 PMCID: PMC10954458 DOI: 10.1093/nar/gkae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 12/22/2023] [Accepted: 01/11/2024] [Indexed: 01/30/2024] Open
Abstract
Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein-nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here, we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein-nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein-DNA and protein-RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Bernard Moussad
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | - Sumit Tarafder
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, USA
| | | |
Collapse
|
8
|
Sagendorf JM, Mitra R, Huang J, Chen XS, Rohs R. PNAbind: Structure-based prediction of protein-nucleic acid binding using graph neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582387. [PMID: 38529493 PMCID: PMC10962711 DOI: 10.1101/2024.02.27.582387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
The recognition and binding of nucleic acids (NAs) by proteins depends upon complementary chemical, electrostatic and geometric properties of the protein-NA binding interface. Structural models of protein-NA complexes provide insights into these properties but are scarce relative to models of unbound proteins. We present a deep learning approach for predicting protein-NA binding given the apo structure of a protein (PNAbind). Our method utilizes graph neural networks to encode spatial distributions of physicochemical and geometric properties of the protein molecular surface that are predictive of NA binding. Using global physicochemical encodings, our models predict the overall binding function of a protein and can discriminate between specificity for DNA or RNA binding. We show that such predictions made on protein structures modeled with AlphaFold2 can be used to gain mechanistic understanding of chemical and structural features that determine NA recognition. Using local encodings, our models predict the location of NA binding sites at the level of individual binding residues. Binding site predictions were validated against benchmark datasets, achieving AUROC scores in the range of 0.92-0.95. We applied our models to the HIV-1 restriction factor APOBEC3G and show that our predictions are consistent with experimental RNA binding data.
Collapse
|
9
|
Rao B, Yu X, Bai J, Hu J. E2EATP: Fast and High-Accuracy Protein-ATP Binding Residue Prediction via Protein Language Model Embedding. J Chem Inf Model 2024; 64:289-300. [PMID: 38127815 DOI: 10.1021/acs.jcim.3c01298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
Identifying the ATP-binding sites of proteins is fundamentally important to uncover the mechanisms of protein functions and explore drug discovery. Many computational methods are proposed to predict ATP-binding sites. However, due to the limitation of the quality of feature representation, the prediction performance still has a big room for improvement. In this study, we propose an end-to-end deep learning model, E2EATP, to dig out more discriminative information from a protein sequence for improving the ATP-binding site prediction performance. Concretely, we employ a pretrained deep learning-based protein language model (ESM2) to automatically extract high-latent discriminative representations of protein sequences relevant for protein functions. Based on ESM2, we design a residual convolutional neural network to train a protein-ATP binding site prediction model. Furthermore, a weighted focal loss function is used to reduce the negative impact of imbalanced data on the model training stage. Experimental results on the two independent testing data sets demonstrate that E2EATP could achieve higher Matthew's correlation coefficient and AUC values than most existing state-of-the-art prediction methods. The speed (about 0.05 s per protein) of E2EATP is much faster than the other existing prediction methods. Detailed data analyses show that the major advantage of E2EATP lies at the utilization of the pretrained protein language model that extracts more discriminative information from the protein sequence only. The standalone package of E2EATP is freely available for academic at https://github.com/jun-csbio/e2eatp/.
Collapse
Affiliation(s)
- Bing Rao
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China
| | - Xuan Yu
- Glasgow College, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jie Bai
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou 310015, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|