1
|
Shen X, Zhang S, Long J, Chen C, Wang M, Cui Z, Chen B, Tan T. A Highly Sensitive Model Based on Graph Neural Networks for Enzyme Key Catalytic Residue Prediction. J Chem Inf Model 2023; 63:4277-4290. [PMID: 37399293 DOI: 10.1021/acs.jcim.3c00273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023]
Abstract
Determining the catalytic site of enzymes is a great help for understanding the relationship between protein sequence, structure, and function, which provides the basis and targets for designing, modifying, and enhancing enzyme activity. The unique local spatial configuration bound to the substrate at the active center of the enzyme determines the catalytic ability of enzymes and plays an important role in the catalytic site prediction. As a suitable tool, the graph neural network can better understand and identify the residue sites with unique local spatial configurations due to its remarkable ability to characterize the three-dimensional structural features of proteins. Consequently, a novel model for predicting enzyme catalytic sites has been developed, which incorporates a uniquely designed adaptive edge-gated graph attention neural network (AEGAN). This model is capable of effectively handling sequential and structural characteristics of proteins at various levels, and the extracted features enable an accurate description of the local spatial configuration of the enzyme active site by sampling the local space around candidate residues and special design of amino acid physical and chemical properties. To evaluate its performance, the model was compared with existing catalytic site prediction models using different benchmark datasets and achieved the best results on each benchmark dataset. The model exhibited a sensitivity of 0.9659, accuracy of 0.9226, and area under the precision-recall curve (AUPRC) of 0.9241 on the independent test set constructed for evaluation. Furthermore, the F1-score of this model is nearly four times higher than that of the best-performing similar model in previous studies. This research can serve as a valuable tool to help researchers understand protein sequence-structure-function relationships while facilitating the characterization of novel enzymes of unknown function.
Collapse
Affiliation(s)
- Xiaowei Shen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Shiding Zhang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Jianyu Long
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Changjing Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Meng Wang
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Ziheng Cui
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Biqiang Chen
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| | - Tianwei Tan
- National Energy R&D Center for Biorefinery, Beijing University of Chemical Technology, 100029, Beijing, China
| |
Collapse
|
2
|
Amirkhani A, Kolahdoozi M, Wang C, Kurgan LA. Prediction of DNA-Binding Residues in Local Segments of Protein Sequences with Fuzzy Cognitive Maps. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1372-1382. [PMID: 30602422 DOI: 10.1109/tcbb.2018.2890261] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
While protein-DNA interactions are crucial for a wide range of cellular functions, only a small fraction of these interactions was annotated to date. One solution to close this annotation gap is to employ computational methods that accurately predict protein-DNA interactions from widely available protein sequences. We present and empirically test first-of-its-kind predictor of DNA-binding residues in local segments of protein sequences that relies on the Fuzzy Cognitive Map (FCM) model. The FCM model uses information about putative solvent accessibility, evolutionary conservation, and relative propensities of amino acid to interact with DNA to generate putative DNA-binding residues. Empirical tests on a benchmark dataset reveal that the FCM model secures AUC = 0.72 and outperforms recently released hybridNAP predictor and several popular machine learning methods including Support Vector Machines, Naïve Bayes, and k-Nearest Neighbor. The improvements in the predictive performance result from an intrinsic feature of FCMs that incorporate relations between the input features, besides the relations between the inputs and output that are modelled by other algorithms. We also empirically demonstrate that use of a short sliding window results in further improvements in the predictive quality. The funDNApred webserver that implements the FCM predictor is available at http://biomine.cs.vcu.edu/servers/funDNApred/.
Collapse
|
3
|
Predicting Apoptosis Protein Subcellular Locations based on the Protein Overlapping Property Matrix and Tri-Gram Encoding. Int J Mol Sci 2019; 20:ijms20092344. [PMID: 31083553 PMCID: PMC6539631 DOI: 10.3390/ijms20092344] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 04/25/2019] [Accepted: 05/08/2019] [Indexed: 12/22/2022] Open
Abstract
To reveal the working pattern of programmed cell death, knowledge of the subcellular location of apoptosis proteins is essential. Besides the costly and time-consuming method of experimental determination, research into computational locating schemes, focusing mainly on the innovation of representation techniques on protein sequences and the selection of classification algorithms, has become popular in recent decades. In this study, a novel tri-gram encoding model is proposed, which is based on using the protein overlapping property matrix (POPM) for predicting apoptosis protein subcellular location. Next, a 1000-dimensional feature vector is built to represent a protein. Finally, with the help of support vector machine-recursive feature elimination (SVM-RFE), we select the optimal features and put them into a support vector machine (SVM) classifier for predictions. The results of jackknife tests on two benchmark datasets demonstrate that our proposed method can achieve satisfactory prediction performance level with less computing capacity required and could work as a promising tool to predict the subcellular locations of apoptosis proteins.
Collapse
|
4
|
Song J, Li F, Takemoto K, Haffari G, Akutsu T, Chou KC, Webb GI. PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. J Theor Biol 2018; 443:125-137. [DOI: 10.1016/j.jtbi.2018.01.023] [Citation(s) in RCA: 95] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 01/17/2018] [Accepted: 01/18/2018] [Indexed: 10/18/2022]
|
5
|
Prediction of Protein Phosphorylation Sites by Integrating Secondary Structure Information and Other One-Dimensional Structural Properties. Methods Mol Biol 2018; 1484:265-274. [PMID: 27787832 DOI: 10.1007/978-1-4939-6406-2_18] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Studies on phosphorylation are important but challenging for both wet-bench experiments and computational studies, and accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wide variety of species. Here, we describe a phosphorylation site prediction webserver, PhosphoSVM, that employs Support Vector Machine to combine protein secondary structure information and seven other one-dimensional structural properties, including Shannon entropy, relative entropy, predicted protein disorder information, predicted solvent accessible area, amino acid overlapping properties, averaged cumulative hydrophobicity, and subsequence k-nearest neighbor profiles. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test data set were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, respectively. This algorithm with the optimally trained model was implemented as a webserver. The webserver, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM .
Collapse
|
6
|
Zhang J, Ma Z, Kurgan L. Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 2017; 20:1250-1268. [DOI: 10.1093/bib/bbx168] [Citation(s) in RCA: 60] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Revised: 11/15/2017] [Indexed: 11/13/2022] Open
Abstract
Abstract
Proteins interact with a variety of molecules including proteins and nucleic acids. We review a comprehensive collection of over 50 studies that analyze and/or predict these interactions. While majority of these studies address either solely protein–DNA or protein–RNA binding, only a few have a wider scope that covers both protein–protein and protein–nucleic acid binding. Our analysis reveals that binding residues are typically characterized with three hallmarks: relative solvent accessibility (RSA), evolutionary conservation and propensity of amino acids (AAs) for binding. Motivated by drawbacks of the prior studies, we perform a large-scale analysis to quantify and contrast the three hallmarks for residues that bind DNA-, RNA-, protein- and (for the first time) multi-ligand-binding residues that interact with DNA and proteins, and with RNA and proteins. Results generated on a well-annotated data set of over 23 000 proteins show that conservation of binding residues is higher for nucleic acid- than protein-binding residues. Multi-ligand-binding residues are more conserved and have higher RSA than single-ligand-binding residues. We empirically show that each hallmark discriminates between binding and nonbinding residues, even predicted RSA, and that combining them improves discriminatory power for each of the five types of interactions. Linear scoring functions that combine these hallmarks offer good predictive performance of residue-level propensity for binding and provide intuitive interpretation of predictions. Better understanding of these residue-level interactions will facilitate development of methods that accurately predict binding in the exponentially growing databases of protein sequences.
Collapse
|
7
|
Ismail HD, Newman RH, Kc DB. RF-Hydroxysite: a random forest based predictor for hydroxylation sites. MOLECULAR BIOSYSTEMS 2017; 12:2427-35. [PMID: 27292874 DOI: 10.1039/c6mb00179c] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Protein hydroxylation is an emerging posttranslational modification involved in both normal cellular processes and a growing number of pathological states, including several cancers. Protein hydroxylation is mediated by members of the hydroxylase family of enzymes, which catalyze the conversion of an alkyne group at select lysine or proline residues on their target substrates to a hydroxyl. Traditionally, hydroxylation has been identified using expensive and time-consuming experimental methods, such as tandem mass spectrometry. Therefore, to facilitate identification of putative hydroxylation sites and to complement existing experimental approaches, computational methods designed to predict the hydroxylation sites in protein sequences have recently been developed. Building on these efforts, we have developed a new method, termed RF-hydroxysite, that uses random forest to identify putative hydroxylysine and hydroxyproline residues in proteins using only the primary amino acid sequence as input. RF-Hydroxysite integrates features previously shown to contribute to hydroxylation site prediction with several new features that we found to augment the performance remarkably. These include features that capture physicochemical, structural, sequence-order and evolutionary information from the protein sequences. The features used in the final model were selected based on their contribution to the prediction. Physicochemical information was found to contribute the most to the model. The present study also sheds light on the contribution of evolutionary, sequence order, and protein disordered region information to hydroxylation site prediction. The web server for RF-hydroxysite is available online at .
Collapse
Affiliation(s)
- Hamid D Ismail
- Department of Computational Science and Engineering, NCA&T State University, Greensboro, NC 27411, USA.
| | - Robert H Newman
- Department of Biology, NCA&T State University, Greensboro, NC 27411, USA
| | - Dukka B Kc
- Department of Computational Science and Engineering, NCA&T State University, Greensboro, NC 27411, USA.
| |
Collapse
|
8
|
RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest. BIOMED RESEARCH INTERNATIONAL 2016; 2016:3281590. [PMID: 27066500 PMCID: PMC4811047 DOI: 10.1155/2016/3281590] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Revised: 01/13/2016] [Accepted: 01/31/2016] [Indexed: 01/17/2023]
Abstract
Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.
Collapse
|
9
|
Xiao X, Hui MJ, Liu Z, Qiu WR. iCataly-PseAAC: Identification of Enzymes Catalytic Sites Using Sequence Evolution Information with Grey Model GM (2,1). J Membr Biol 2015; 248:1033-41. [DOI: 10.1007/s00232-015-9815-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Accepted: 06/06/2015] [Indexed: 11/25/2022]
|
10
|
Dou Y, Yao B, Zhang C. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino Acids 2014; 46:1459-69. [DOI: 10.1007/s00726-014-1711-5] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2013] [Accepted: 02/21/2014] [Indexed: 02/01/2023]
|
11
|
Han L, Zhang YJ, Song J, Liu MS, Zhang Z. Identification of catalytic residues using a novel feature that integrates the microenvironment and geometrical location properties of residues. PLoS One 2012; 7:e41370. [PMID: 22829945 PMCID: PMC3400608 DOI: 10.1371/journal.pone.0041370] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2012] [Accepted: 06/20/2012] [Indexed: 11/18/2022] Open
Abstract
Enzymes play a fundamental role in almost all biological processes and identification of catalytic residues is a crucial step for deciphering the biological functions and understanding the underlying catalytic mechanisms. In this work, we developed a novel structural feature called MEDscore to identify catalytic residues, which integrated the microenvironment (ME) and geometrical properties of amino acid residues. Firstly, we converted a residue's ME into a series of spatially neighboring residue pairs, whose likelihood of being located in a catalytic ME was deduced from a benchmark enzyme dataset. We then calculated an ME-based score, termed as MEscore, by summing up the likelihood of all residue pairs. Secondly, we defined a parameter called Dscore to measure the relative distance of a residue to the center of the protein, provided that catalytic residues are typically located in the center of the protein structure. Finally, we defined the MEDscore feature based on an effective nonlinear integration of MEscore and Dscore. When evaluated on a well-prepared benchmark dataset using five-fold cross-validation tests, MEDscore achieved a robust performance in identifying catalytic residues with an AUC1.0 of 0.889. At a ≤ 10% false positive rate control, MEDscore correctly identified approximately 70% of the catalytic residues. Remarkably, MEDscore achieved a competitive performance compared with the residue conservation score (e.g. CONscore), the most informative singular feature predominantly employed to identify catalytic residues. To the best of our knowledge, MEDscore is the first singular structural feature exhibiting such an advantage. More importantly, we found that MEDscore is complementary with CONscore and a significantly improved performance can be achieved by combining CONscore with MEDscore in a linear manner. As an implementation of this work, MEDscore has been made freely accessible at http://protein.cau.edu.cn/mepi/.
Collapse
Affiliation(s)
- Lei Han
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, People's Republic of China
| | - Yong-Jun Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing, People's Republic of China
| | - Jiangning Song
- National Engineering Laboratory for Industrial Enzymes and Key Laboratory of Systems Microbial Biotechnology, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, People's Republic of China
- Department of Biochemistry and Molecular Biology, Faculty of Medicine, Monash University, Melbourne, Victoria, Australia
| | - Ming S. Liu
- CSIRO - Mathematics, Informatics and Statistics, Clayton, Victoria, Australia
- * E-mail: (MSL); (ZZ)
| | - Ziding Zhang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, People's Republic of China
- * E-mail: (MSL); (ZZ)
| |
Collapse
|
12
|
Dou Y, Wang J, Yang J, Zhang C. L1pred: a sequence-based prediction tool for catalytic residues in enzymes with the L1-logreg classifier. PLoS One 2012; 7:e35666. [PMID: 22558194 PMCID: PMC3338704 DOI: 10.1371/journal.pone.0035666] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 03/19/2012] [Indexed: 12/01/2022] Open
Abstract
To understand enzyme functions, identifying the catalytic residues is a usual first step. Moreover, knowledge about catalytic residues is also useful for protein engineering and drug-design. However, to experimentally identify catalytic residues remains challenging for reasons of time and cost. Therefore, computational methods have been explored to predict catalytic residues. Here, we developed a new algorithm, L1pred, for catalytic residue prediction, by using the L1-logreg classifier to integrate eight sequence-based scoring functions. We tested L1pred and compared it against several existing sequence-based methods on carefully designed datasets Data604 and Data63. With ten-fold cross-validation, L1pred showed the area under precision-recall curve (AUPR) and the area under ROC curve (AUC) of 0.2198 and 0.9494 on the training dataset, Data604, respectively. In addition, on the independent test dataset, Data63, it showed the AUPR and AUC values of 0.2636 and 0.9375, respectively. Compared with other sequence-based methods, L1pred showed the best performance on both datasets. We also analyzed the importance of each attribute in the algorithm, and found that all the scores contributed more or less equally to the L1pred performance.
Collapse
Affiliation(s)
- Yongchao Dou
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
| | - Jun Wang
- Scientific Computing Key Laboratory of Shanghai Universities, Shanghai, People’s Republic of China
- Department of Mathematics, Shanghai Normal University, Shanghai, People’s Republic of China
| | - Jialiang Yang
- MPI-Institute of Computational Biology, Chinese Academy of Sciences, Shanghai, People’s Republic of China
| | - Chi Zhang
- School of Biological Sciences, Center for Plant Science and Innovation, University of Nebraska, Lincoln, Nebraska, United States of America
- * E-mail:
| |
Collapse
|
13
|
Dou Y, Geng X, Gao H, Yang J, Zheng X, Wang J. Sequence Conservation in the Prediction of Catalytic Sites. Protein J 2011; 30:229-39. [PMID: 21465136 DOI: 10.1007/s10930-011-9324-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|