1
|
Zhang Y, Li S, Meng K, Sun S. Machine Learning for Sequence and Structure-Based Protein-Ligand Interaction Prediction. J Chem Inf Model 2024; 64:1456-1472. [PMID: 38385768 DOI: 10.1021/acs.jcim.3c01841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Developing new drugs is too expensive and time -consuming. Accurately predicting the interaction between drugs and targets will likely change how the drug is discovered. Machine learning-based protein-ligand interaction prediction has demonstrated significant potential. In this paper, computational methods, focusing on sequence and structure to study protein-ligand interactions, are examined. Therefore, this paper starts by presenting an overview of the data sets applied in this area, as well as the various approaches applied for representing proteins and ligands. Then, sequence-based and structure-based classification criteria are subsequently utilized to categorize and summarize both the classical machine learning models and deep learning models employed in protein-ligand interaction studies. Moreover, the evaluation methods and interpretability of these models are proposed. Furthermore, delving into the diverse applications of protein-ligand interaction models in drug research is presented. Lastly, the current challenges and future directions in this field are addressed.
Collapse
Affiliation(s)
- Yunjiang Zhang
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shuyuan Li
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Kong Meng
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shaorui Sun
- Beijing Key Laboratory for Green Catalysis and Separation, The Faculty of Environment and Life, Beijing University of Technology, Beijing 100124, P. R. China
| |
Collapse
|
2
|
Dhakal A, McKay C, Tanner JJ, Cheng J. Artificial intelligence in the prediction of protein-ligand interactions: recent advances and future directions. Brief Bioinform 2022; 23:bbab476. [PMID: 34849575 PMCID: PMC8690157 DOI: 10.1093/bib/bbab476] [Citation(s) in RCA: 72] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2021] [Revised: 09/28/2021] [Accepted: 10/15/2021] [Indexed: 12/13/2022] Open
Abstract
New drug production, from target identification to marketing approval, takes over 12 years and can cost around $2.6 billion. Furthermore, the COVID-19 pandemic has unveiled the urgent need for more powerful computational methods for drug discovery. Here, we review the computational approaches to predicting protein-ligand interactions in the context of drug discovery, focusing on methods using artificial intelligence (AI). We begin with a brief introduction to proteins (targets), ligands (e.g. drugs) and their interactions for nonexperts. Next, we review databases that are commonly used in the domain of protein-ligand interactions. Finally, we survey and analyze the machine learning (ML) approaches implemented to predict protein-ligand binding sites, ligand-binding affinity and binding pose (conformation) including both classical ML algorithms and recent deep learning methods. After exploring the correlation between these three aspects of protein-ligand interaction, it has been proposed that they should be studied in unison. We anticipate that our review will aid exploration and development of more accurate ML-based prediction strategies for studying protein-ligand interactions.
Collapse
Affiliation(s)
- Ashwin Dhakal
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| | - Cole McKay
- Department of Biochemistry, University of Missouri, Columbia, MO, 65211, USA
| | - John J Tanner
- Department of Biochemistry, University of Missouri, Columbia, MO, 65211, USA
- Department of Chemistry, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
| |
Collapse
|
3
|
Hu X, Feng Z, Zhang X, Liu L, Wang S. The Identification of Metal Ion Ligand-Binding Residues by Adding the Reclassified Relative Solvent Accessibility. Front Genet 2020; 11:214. [PMID: 32265982 PMCID: PMC7096583 DOI: 10.3389/fgene.2020.00214] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 02/24/2020] [Indexed: 11/13/2022] Open
Abstract
Many proteins realize their special functions by binding with specific metal ion ligands during a cell's life cycle. The ability to correctly identify metal ion ligand-binding residues is valuable for the human health and the design of molecular drug. Precisely identifying these residues, however, remains challenging work. We have presented an improved computational approach for predicting the binding residues of 10 metal ion ligands (Zn2+, Cu2+, Fe2+, Fe3+, Co2+, Ca2+, Mg2+, Mn2+, Na+, and K+) by adding reclassified relative solvent accessibility (RSA). The best accuracy of fivefold cross-validation was higher than 77.9%, which was about 16% higher than the previous result on the same dataset. It was found that different reclassification of the RSA information can make different contributions to the identification of specific ligand binding residues. Our study has provided an additional understanding of the effect of the RSA on the identification of metal ion ligand binding residues.
Collapse
Affiliation(s)
| | - Zhenxing Feng
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | - Xiaojin Zhang
- College of Sciences, Inner Mongolla University of Technology, Hohhot, China
| | | | | |
Collapse
|
4
|
Pirzada RH, Javaid N, Choi S. The Roles of the NLRP3 Inflammasome in Neurodegenerative and Metabolic Diseases and in Relevant Advanced Therapeutic Interventions. Genes (Basel) 2020; 11:E131. [PMID: 32012695 PMCID: PMC7074480 DOI: 10.3390/genes11020131] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Revised: 01/22/2020] [Accepted: 01/22/2020] [Indexed: 02/07/2023] Open
Abstract
Inflammasomes are intracellular multiprotein complexes in the cytoplasm that regulate inflammation activation in the innate immune system in response to pathogens and to host self-derived molecules. Recent advances greatly improved our understanding of the activation of nucleotide-binding oligomerization domain-like receptor (NLR) family pyrin domain containing 3 (NLRP3) inflammasomes at the molecular level. The NLRP3 belongs to the subfamily of NLRP which activates caspase 1, thus causing the production of proinflammatory cytokines (interleukin 1β and interleukin 18) and pyroptosis. This inflammasome is involved in multiple neurodegenerative and metabolic disorders including Alzheimer's disease, multiple sclerosis, type 2 diabetes mellitus, and gout. Therefore, therapeutic targeting to the NLRP3 inflammasome complex is a promising way to treat these diseases. Recent research advances paved the way toward drug research and development using a variety of machine learning-based and artificial intelligence-based approaches. These state-of-the-art approaches will lead to the discovery of better drugs after the training of such a system.
Collapse
Affiliation(s)
| | | | - Sangdun Choi
- Department of Molecular Science and Technology, Ajou University, Suwon 16499, Korea; (R.H.P.); (N.J.)
| |
Collapse
|
5
|
Yan R, Wang X, Tian Y, Xu J, Xu X, Lin J. Prediction of zinc-binding sites using multiple sequence profiles and machine learning methods. Mol Omics 2019; 15:205-215. [PMID: 31046040 DOI: 10.1039/c9mo00043g] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
The zinc (Zn2+) cofactor has been proven to be involved in numerous biological mechanisms and the zinc-binding site is recognized as one of the most important post-translation modifications in proteins. Therefore, accurate knowledge of zinc ions in protein structures can provide potential clues for elucidation of protein folding and functions. However, determining zinc-binding residues by experimental means is usually lab-intensive and associated with high cost in most cases. In this context, the development of computational tools for identifying zinc-binding sites is highly desired, especially in the current post-genomic era. In this work, we developed a novel zinc-binding site prediction method by combining several intensively-trained machine learning models. To establish an accurate and generative method, we downloaded all zinc-binding proteins from the Protein Data Bank and prepared a non-redundant dataset. Meanwhile, a well-prepared dataset by other groups was also used. Then, effective and complementary features were extracted from sequences and three-dimensional structures of these proteins. Moreover, several well-designed machine learning models were intensively trained to construct accurate models. To assess the performance, the obtained predictors were stringently benchmarked using the diverse zinc-binding sites. Furthermore, several state-of-the-art in silico methods developed specifically for zinc-binding sites were also evaluated and compared. The results confirmed that our method is very competitive in real world applications and could become a complementary tool to wet lab experiments. To facilitate research in the community, a web server and stand-alone program implementing our method were constructed and are publicly available at . The downloadable program of our method can be easily used for the high-throughput screening of potential zinc-binding sites across proteomes.
Collapse
Affiliation(s)
- Renxiang Yan
- School of Biological Sciences and Engineering, Fuzhou University, Fuzhou 350002, China. and Fujian Key Laboratory of Marine Enzyme Engineering, Fuzhou 350002, China
| | - Xiaofeng Wang
- College of Mathematics and Computer Science, Shanxi Normal University, Linfen 041004, China
| | - Yarong Tian
- Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, 40530, Sweden
| | - Jing Xu
- School of Biological Sciences and Engineering, Fuzhou University, Fuzhou 350002, China. and Fujian Key Laboratory of Marine Enzyme Engineering, Fuzhou 350002, China
| | - Xiaoli Xu
- School of Biological Sciences and Engineering, Fuzhou University, Fuzhou 350002, China.
| | - Juan Lin
- School of Biological Sciences and Engineering, Fuzhou University, Fuzhou 350002, China. and Fujian Key Laboratory of Marine Enzyme Engineering, Fuzhou 350002, China
| |
Collapse
|
6
|
Srivastava A, Kumar M. Prediction of zinc binding sites in proteins using sequence derived information. J Biomol Struct Dyn 2018; 36:4413-4423. [PMID: 29241411 DOI: 10.1080/07391102.2017.1417910] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Zinc is one the most abundant catalytic cofactor and also an important structural component of a large number of metallo-proteins. Hence prediction of zinc metal binding sites in proteins can be a significant step in annotation of molecular function of a large number of proteins. Majority of existing methods for zinc-binding site predictions are based on a data-set of proteins, which has been compiled nearly a decade ago. Hence there is a need to develop zinc-binding site prediction system using the current updated data to include recently added proteins. Herein, we propose a support vector machine-based method, named as ZincBinder, for prediction of zinc metal-binding site in a protein using sequence profile information. The predictor was trained using fivefold cross validation approach and achieved 85.37% sensitivity with 86.20% specificity during training. Benchmarking on an independent non-redundant data-set, which was not used during training, showed better performance of ZincBinder vis-à-vis existing methods. Executable versions, source code, sample datasets, and usage instructions are available at http://proteininformatics.org/mkumar/znbinder/.
Collapse
Affiliation(s)
- Abhishikha Srivastava
- a Department of Biophysics , University of Delhi South Campus , Benito Juarez Road, New Delhi 110021 , India
| | - Manish Kumar
- a Department of Biophysics , University of Delhi South Campus , Benito Juarez Road, New Delhi 110021 , India
| |
Collapse
|
7
|
Zhang J, Chai H, Gao B, Yang G, Ma Z. HEMEsPred: Structure-Based Ligand-Specific Heme Binding Residues Prediction by Using Fast-Adaptive Ensemble Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:147-156. [PMID: 28029626 DOI: 10.1109/tcbb.2016.2615010] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Heme is an essential biomolecule that widely exists in numerous extant organisms. Accurately identifying heme binding residues (HEMEs) is of great importance in disease progression and drug development. In this study, a novel predictor named HEMEsPred was proposed for predicting HEMEs. First, several sequence- and structure-based features, including amino acid composition, motifs, surface preferences, and secondary structure, were collected to construct feature matrices. Second, a novel fast-adaptive ensemble learning scheme was designed to overcome the serious class-imbalance problem as well as to enhance the prediction performance. Third, we further developed ligand-specific models considering that different heme ligands varied significantly in their roles, sizes, and distributions. Statistical test proved the effectiveness of ligand-specific models. Experimental results on benchmark datasets demonstrated good robustness of our proposed method. Furthermore, our method also showed good generalization capability and outperformed many state-of-art predictors on two independent testing datasets. HEMEsPred web server was available at http://www.inforstation.com/HEMEsPred/ for free academic use.
Collapse
|
8
|
Lima AN, Philot EA, Trossini GHG, Scott LPB, Maltarollo VG, Honorio KM. Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov 2016; 11:225-39. [PMID: 26814169 DOI: 10.1517/17460441.2016.1146250] [Citation(s) in RCA: 138] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
INTRODUCTION The use of computational tools in the early stages of drug development has increased in recent decades. Machine learning (ML) approaches have been of special interest, since they can be applied in several steps of the drug discovery methodology, such as prediction of target structure, prediction of biological activity of new ligands through model construction, discovery or optimization of hits, and construction of models that predict the pharmacokinetic and toxicological (ADMET) profile of compounds. AREAS COVERED This article presents an overview on some applications of ML techniques in drug design. These techniques can be employed in ligand-based drug design (LBDD) and structure-based drug design (SBDD) studies, such as similarity searches, construction of classification and/or prediction models of biological activity, prediction of secondary structures and binding sites docking and virtual screening. EXPERT OPINION Successful cases have been reported in the literature, demonstrating the efficiency of ML techniques combined with traditional approaches to study medicinal chemistry problems. Some ML techniques used in drug design are: support vector machine, random forest, decision trees and artificial neural networks. Currently, an important application of ML techniques is related to the calculation of scoring functions used in docking and virtual screening assays from a consensus, combining traditional and ML techniques in order to improve the prediction of binding sites and docking solutions.
Collapse
Affiliation(s)
- Angélica Nakagawa Lima
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | - Eric Allison Philot
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Luis Paulo Barbour Scott
- c Centro de Matemática, Computação e Cognição , Universidade Federal do ABC , São Paulo , Brazil
| | | | - Kathia Maria Honorio
- a Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , São Paulo , Brazil.,d Escola de Artes, Ciências e Humanidades , Universidade de São Paulo , São Paulo , Brazil
| |
Collapse
|
9
|
Zhou W, Tang GW, Altman RB. High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE. J Chem Inf Model 2015; 55:1663-72. [PMID: 26226489 PMCID: PMC4731830 DOI: 10.1021/acs.jcim.5b00367] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Metal-binding proteins are ubiquitous in biological systems ranging from enzymes to cell surface receptors. Among the various biologically active metal ions, calcium plays a large role in regulating cellular and physiological changes. With the increasing number of high-quality crystal structures of proteins associated with their metal ion ligands, many groups have built models to identify Ca(2+) sites in proteins, utilizing information such as structure, geometry, or homology to do the inference. We present a FEATURE-based approach in building such a model and show that our model is able to discriminate between nonsites and calcium-binding sites with a very high precision of more than 98%. We demonstrate the high specificity of our model by applying it to test sets constructed from other ions. We also introduce an algorithm to convert high scoring regions into specific site predictions and demonstrate the usage by scanning a test set of 91 calcium-binding protein structures (190 calcium sites). The algorithm has a recall of more than 93% on the test set with predictions found within 3 Å of the actual sites.
Collapse
Affiliation(s)
- Weizhuang Zhou
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| | - Grace W Tang
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| | - Russ B Altman
- Department of Bioengineering, Stanford University , 443 Via Ortega, Stanford, California 94305-4145, United States
| |
Collapse
|
10
|
Morshed N, Echols N, Adams PD. Using support vector machines to improve elemental ion identification in macromolecular crystal structures. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2015; 71:1147-58. [PMID: 25945580 PMCID: PMC4427199 DOI: 10.1107/s1399004715004241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Accepted: 03/01/2015] [Indexed: 11/11/2022]
Abstract
In the process of macromolecular model building, crystallographers must examine electron density for isolated atoms and differentiate sites containing structured solvent molecules from those containing elemental ions. This task requires specific knowledge of metal-binding chemistry and scattering properties and is prone to error. A method has previously been described to identify ions based on manually chosen criteria for a number of elements. Here, the use of support vector machines (SVMs) to automatically classify isolated atoms as either solvent or one of various ions is described. Two data sets of protein crystal structures, one containing manually curated structures deposited with anomalous diffraction data and another with automatically filtered, high-resolution structures, were constructed. On the manually curated data set, an SVM classifier was able to distinguish calcium from manganese, zinc, iron and nickel, as well as all five of these ions from water molecules, with a high degree of accuracy. Additionally, SVMs trained on the automatically curated set of high-resolution structures were able to successfully classify most common elemental ions in an independent validation test set. This method is readily extensible to other elemental ions and can also be used in conjunction with previous methods based on a priori expectations of the chemical environment and X-ray scattering.
Collapse
Affiliation(s)
- Nader Morshed
- College of Letters and Science, University of California, Berkeley, CA 94720, USA
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Nathaniel Echols
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Paul D. Adams
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
11
|
Krivák R, Hoksza D. Improving protein-ligand binding site prediction accuracy by classification of inner pocket points using local features. J Cheminform 2015; 7:12. [PMID: 25932051 PMCID: PMC4414931 DOI: 10.1186/s13321-015-0059-5] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Accepted: 02/24/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-ligand binding site prediction from a 3D protein structure plays a pivotal role in rational drug design and can be helpful in drug side-effects prediction or elucidation of protein function. Embedded within the binding site detection problem is the problem of pocket ranking - how to score and sort candidate pockets so that the best scored predictions correspond to true ligand binding sites. Although there exist multiple pocket detection algorithms, they mostly employ a fairly simple ranking function leading to sub-optimal prediction results. RESULTS We have developed a new pocket scoring approach (named PRANK) that prioritizes putative pockets according to their probability to bind a ligand. The method first carefully selects pocket points and labels them by physico-chemical characteristics of their local neighborhood. Random Forests classifier is subsequently applied to assign a ligandability score to each of the selected pocket point. The ligandability scores are finally merged into the resulting pocket score to be used for prioritization of the putative pockets. With the used of multiple datasets the experimental results demonstrate that the application of our method as a post-processing step greatly increases the quality of the prediction of Fpocket and ConCavity, two state of the art protein-ligand binding site prediction algorithms. CONCLUSIONS The positive experimental results show that our method can be used to improve the success rate, validity and applicability of existing protein-ligand binding site prediction tools. The method was implemented as a stand-alone program that currently contains support for Fpocket and Concavity out of the box, but is easily extendible to support other tools. PRANK is made freely available at http://siret.ms.mff.cuni.cz/prank.
Collapse
Affiliation(s)
- Radoslav Krivák
- Department of Software Engineering, Charles University in Prague, Prague, Czech Republic
| | - David Hoksza
- Department of Software Engineering, Charles University in Prague, Prague, Czech Republic
| |
Collapse
|
12
|
He W, Liang Z, Teng M, Niu L. mFASD: a structure-based algorithm for discriminating different types of metal-binding sites. Bioinformatics 2015; 31:1938-44. [DOI: 10.1093/bioinformatics/btv044] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2014] [Accepted: 01/17/2015] [Indexed: 11/13/2022] Open
|
13
|
Zhou Y, Xue S, Yang JJ. Calciomics: integrative studies of Ca2+-binding proteins and their interactomes in biological systems. Metallomics 2013; 5:29-42. [PMID: 23235533 DOI: 10.1039/c2mt20009k] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Calcium ion (Ca(2+)), the fifth most common chemical element in the earth's crust, represents the most abundant mineral in the human body. By binding to a myriad of proteins distributed in different cellular organelles, Ca(2+) impacts nearly every aspect of cellular life. In prokaryotes, Ca(2+) plays an important role in bacterial movement, chemotaxis, survival reactions and sporulation. In eukaryotes, Ca(2+) has been chosen through evolution to function as a universal and versatile intracellular signal. Viruses, as obligate intracellular parasites, also develop smart strategies to manipulate the host Ca(2+) signaling machinery to benefit their own life cycles. This review focuses on recent advances in applying both bioinformatic and experimental approaches to predict and validate Ca(2+)-binding proteins and their interactomes in biological systems on a genome-wide scale (termed "calciomics"). Calmodulin is used as an example of Ca(2+)-binding protein (CaBP) to demonstrate the role of CaBPs on the regulation of biological functions. This review is anticipated to rekindle interest in investigating Ca(2+)-binding proteins and Ca(2+)-modulated functions at the systems level in the post-genomic era.
Collapse
Affiliation(s)
- Yubin Zhou
- Center for Translational Cancer Research, Institute of Biosciences and Technology, Texas A&M University System Health Science Center, Houston, TX 77030, USA
| | | | | |
Collapse
|
14
|
Liu Z, Wang Y, Zhou C, Xue Y, Zhao W, Liu H. Computationally characterizing and comprehensive analysis of zinc-binding sites in proteins. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1844:171-80. [PMID: 23499845 DOI: 10.1016/j.bbapap.2013.03.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2012] [Revised: 03/02/2013] [Accepted: 03/04/2013] [Indexed: 10/27/2022]
Abstract
Zinc is one of the most essential metals utilized by organisms, and zinc-binding proteins play an important role in a variety of biological processes such as transcription regulation, cell metabolism and apoptosis. Thus, characterizing the precise zinc-binding sites is fundamental to an elucidation of the biological functions and molecular mechanisms of zinc-binding proteins. Using systematic analyses of structural characteristics, we observed that 4-residue and 3-residue zinc-binding sites have distinctly specific geometric features. Based on the results, we developed the novel computational program Geometric REstriction for Zinc-binding (GRE4Zn) to characterize the zinc-binding sites in protein structures, by restricting the distances between zinc and its coordinating atoms. The comparison between GRE4Zn and analogous tools revealed that it achieved a superior performance. A large-scale prediction for structurally characterized proteins was performed with this powerful predictor, and statistical analyses for the results indicated zinc-binding proteins have come to be significantly involved in more complicated biological processes in higher species than simpler species during the course of evolution. Further analyses suggested that zinc-binding proteins are preferentially implicated in a variety of diseases and highly enriched in known drug targets, and the prediction of zinc-binding sites can be helpful for the investigation of molecular mechanisms. In this regard, these prediction and analysis results should prove to be highly useful be helpful for further biomedical study and drug design. The online service of GRE4Zn is freely available at: http://biocomp.ustc.edu.cn/gre4zn/. This article is part of a Special Issue entitled: Computational Proteomics, Systems Biology & Clinical Implications. Guest Editor: Yudong Cai.
Collapse
Affiliation(s)
- Zexian Liu
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science & Technology of China, Hefei, Anhui 230027, China
| | | | | | | | | | | |
Collapse
|
15
|
Chen Z, Wang Y, Zhai YF, Song J, Zhang Z. ZincExplorer: an accurate hybrid method to improve the prediction of zinc-binding sites from protein sequences. MOLECULAR BIOSYSTEMS 2013; 9:2213-22. [DOI: 10.1039/c3mb70100j] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
16
|
An integrative computational framework based on a two-step random forest algorithm improves prediction of zinc-binding sites in proteins. PLoS One 2012; 7:e49716. [PMID: 23166753 PMCID: PMC3499040 DOI: 10.1371/journal.pone.0049716] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2012] [Accepted: 10/12/2012] [Indexed: 11/30/2022] Open
Abstract
Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.
Collapse
|
17
|
Zhao K, Wang X, Wong HC, Wohlhueter R, Kirberger MP, Chen G, Yang JJ. Predicting Ca2+ -binding sites using refined carbon clusters. Proteins 2012; 80:2666-79. [PMID: 22821762 DOI: 10.1002/prot.24149] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2012] [Revised: 06/14/2012] [Accepted: 07/11/2012] [Indexed: 12/13/2022]
Abstract
Identifying Ca(2+) -binding sites in proteins is the first step toward understanding the molecular basis of diseases related to Ca(2+) -binding proteins. Currently, these sites are identified in structures either through X-ray crystallography or NMR analysis. However, Ca(2+) -binding sites are not always visible in X-ray structures due to flexibility in the binding region or low occupancy in a Ca(2+) -binding site. Similarly, both Ca(2+) and its ligand oxygens are not directly observed in NMR structures. To improve our ability to predict Ca(2+) -binding sites in both X-ray and NMR structures, we report a new graph theory algorithm (MUG(C) ) to predict Ca(2+) -binding sites. Using carbon atoms covalently bonded to the chelating oxygen atoms, and without explicit reference to side-chain oxygen ligand co-ordinates, MUG(C) is able to achieve 94% sensitivity with 76% selectivity on a dataset of X-ray structures composed of 43 Ca(2+) -binding proteins. Additionally, prediction of Ca(2+) -binding sites in NMR structures was obtained by MUG(C) using a different set of parameters, which were determined by the analysis of both Ca(2+) -constrained and unconstrained Ca(2+) -loaded structures derived from NMR data. MUG(C) identified 20 of 21 Ca(2+) -binding sites in NMR structures inferred without the use of Ca(2+) constraints. MUG(C) predictions are also highly selective for Ca(2+) -binding sites as analyses of binding sites for Mg(2+) , Zn(2+) , and Pb(2+) were not identified as Ca(2+) -binding sites. These results indicate that the geometric arrangement of the second-shell carbon cluster is sufficient not only for accurate identification of Ca(2+) -binding sites in NMR and X-ray structures but also for selective differentiation between Ca(2+) and other relevant divalent cations.
Collapse
Affiliation(s)
- Kun Zhao
- Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303, USA
| | | | | | | | | | | | | |
Collapse
|
18
|
Passerini A, Lippi M, Frasconi P. Predicting metal-binding sites from protein sequence. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:203-213. [PMID: 21606549 DOI: 10.1109/tcbb.2011.94] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Prediction of binding sites from sequence can significantly help toward determining the function of uncharacterized proteins on a genomic scale. The task is highly challenging due to the enormous amount of alternative candidate configurations. Previous research has only considered this prediction problem starting from 3D information. When starting from sequence alone, only methods that predict the bonding state of selected residues are available. The sole exception consists of pattern-based approaches, which rely on very specific motifs and cannot be applied to discover truly novel sites. We develop new algorithmic ideas based on structured-output learning for determining transition-metal-binding sites coordinated by cysteines and histidines. The inference step (retrieving the best scoring output) is intractable for general output types (i.e., general graphs). However, under the assumption that no residue can coordinate more than one metal ion, we prove that metal binding has the algebraic structure of a matroid, allowing us to employ a very efficient greedy algorithm. We test our predictor in a highly stringent setting where the training set consists of protein chains belonging to SCOP folds different from the ones used for accuracy estimation. In this setting, our predictor achieves 56 percent precision and 60 percent recall in the identification of ligand-ion bonds.
Collapse
|
19
|
Tang Y, Sheng X, Stumpf MPH. The roles of contact residue disorder and domain composition in characterizing protein-ligand binding specificity and promiscuity. MOLECULAR BIOSYSTEMS 2011; 7:3280-6. [PMID: 22002096 DOI: 10.1039/c1mb05325f] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Most protein chains interact with only one ligand but a small number of protein chains can bind several ligands, and many examples are available in the protein-ligand complex database of PDB. Among these proteins, some show preferences for the ligands or types of ligands they bind; however, so far we have only poor understanding of what determines protein-ligand binding and its specificity. Here we investigate the structural and functional properties of proteins in protein-ligand complexes. Analysis of the protein-ligand complex dataset from the PDB structure database reveals that proteins with more interactions have more disordered contact residues. Those proteins containing few disordered contact residues that bind multiple ligands have a tendency to consist of several domains. Analysis of physicochemical properties of hub contact residues binding multiple ligands indicates that they are enriched for hydrophilic, charged, polar and His-Asp catalytic triad residues. Finally, in order to differentiate proteins binding different classes of ligands, we mapped the three most prominent classes of ligands onto different superfamily domains. Our results demonstrate that contact residue disorder and ordered multiple domains are complementary factors that play a crucial role in determining ligand binding specificity and promiscuity.
Collapse
Affiliation(s)
- Yurong Tang
- State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing, China
| | | | | |
Collapse
|
20
|
Kochańczyk M. Prediction of functionally important residues in globular proteins from unusual central distances of amino acids. BMC STRUCTURAL BIOLOGY 2011; 11:34. [PMID: 21923943 PMCID: PMC3188475 DOI: 10.1186/1472-6807-11-34] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/22/2011] [Accepted: 09/18/2011] [Indexed: 12/12/2022]
Abstract
BACKGROUND Well-performing automated protein function recognition approaches usually comprise several complementary techniques. Beside constructing better consensus, their predictive power can be improved by either adding or refining independent modules that explore orthogonal features of proteins. In this work, we demonstrated how the exploration of global atomic distributions can be used to indicate functionally important residues. RESULTS Using a set of carefully selected globular proteins, we parametrized continuous probability density functions describing preferred central distances of individual protein atoms. Relative preferred burials were estimated using mixture models of radial density functions dependent on the amino acid composition of a protein under consideration. The unexpectedness of extraordinary locations of atoms was evaluated in the information-theoretic manner and used directly for the identification of key amino acids. In the validation study, we tested capabilities of a tool built upon our approach, called SurpResi, by searching for binding sites interacting with ligands. The tool indicated multiple candidate sites achieving success rates comparable to several geometric methods. We also showed that the unexpectedness is a property of regions involved in protein-protein interactions, and thus can be used for the ranking of protein docking predictions. The computational approach implemented in this work is freely available via a Web interface at http://www.bioinformatics.org/surpresi. CONCLUSIONS Probabilistic analysis of atomic central distances in globular proteins is capable of capturing distinct orientational preferences of amino acids as resulting from different sizes, charges and hydrophobic characters of their side chains. When idealized spatial preferences can be inferred from the sole amino acid composition of a protein, residues located in hydrophobically unfavorable environments can be easily detected. Such residues turn out to be often directly involved in binding ligands or interfacing with other proteins.
Collapse
Affiliation(s)
- Marek Kochańczyk
- Faculty of Physics, Jagiellonian University, ul, Reymonta 4, 30-059 Krakow, Poland.
| |
Collapse
|
21
|
Regad L, Martin J, Camproux AC. Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs. BMC Bioinformatics 2011; 12:247. [PMID: 21689388 PMCID: PMC3158783 DOI: 10.1186/1471-2105-12-247] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2010] [Accepted: 06/20/2011] [Indexed: 12/24/2022] Open
Abstract
Background One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function. Results Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM. Conclusions Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.
Collapse
|
22
|
Regad L, Saladin A, Maupetit J, Geneix C, Camproux AC. SA-Mot: a web server for the identification of motifs of interest extracted from protein loops. Nucleic Acids Res 2011; 39:W203-9. [PMID: 21665924 PMCID: PMC3125790 DOI: 10.1093/nar/gkr410] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The detection of functional motifs is an important step for the determination of protein functions. We present here a new web server SA-Mot (Structural Alphabet Motif) for the extraction and location of structural motifs of interest from protein loops. Contrary to other methods, SA-Mot does not focus only on functional motifs, but it extracts recurrent and conserved structural motifs involved in structural redundancy of loops. SA-Mot uses the structural word notion to extract all structural motifs from uni-dimensional sequences corresponding to loop structures. Then, SA-Mot provides a description of these structural motifs using statistics computed in the loop data set and in SCOP superfamily, sequence and structural parameters. SA-Mot results correspond to an interactive table listing all structural motifs extracted from a target structure and their associated descriptors. Using this information, the users can easily locate loop regions that are important for the protein folding and function. The SA-Mot web server is available at http://sa-mot.mti.univ-paris-diderot.fr.
Collapse
Affiliation(s)
- Leslie Regad
- INSERM, U973, Université Paris 7-Paris Diderot, UMR-S973, MTi F-75013 Paris, France.
| | | | | | | | | |
Collapse
|
23
|
Liu R, Hu J. HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information. BMC Bioinformatics 2011; 12:207. [PMID: 21612668 PMCID: PMC3124436 DOI: 10.1186/1471-2105-12-207] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2010] [Accepted: 05/26/2011] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND Accurate prediction of binding residues involved in the interactions between proteins and small ligands is one of the major challenges in structural bioinformatics. Heme is an essential and commonly used ligand that plays critical roles in electron transfer, catalysis, signal transduction and gene expression. Although much effort has been devoted to the development of various generic algorithms for ligand binding site prediction over the last decade, no algorithm has been specifically designed to complement experimental techniques for identification of heme binding residues. Consequently, an urgent need is to develop a computational method for recognizing these important residues. RESULTS Here we introduced an efficient algorithm HemeBIND for predicting heme binding residues by integrating structural and sequence information. We systematically investigated the characteristics of binding interfaces based on a non-redundant dataset of heme-protein complexes. It was found that several sequence and structural attributes such as evolutionary conservation, solvent accessibility, depth and protrusion clearly illustrate the differences between heme binding and non-binding residues. These features can then be separately used or combined to build the structure-based classifiers using support vector machine (SVM). The results showed that the information contained in these features is largely complementary and their combination achieved the best performance. To further improve the performance, an attempt has been made to develop a post-processing procedure to reduce the number of false positives. In addition, we built a sequence-based classifier based on SVM and sequence profile as an alternative when only sequence information can be used. Finally, we employed a voting method to combine the outputs of structure-based and sequence-based classifiers, which demonstrated remarkably better performance than the individual classifier alone. CONCLUSIONS HemeBIND is the first specialized algorithm used to predict binding residues in protein structures for heme ligands. Extensive experiments indicated that both the structure-based and sequence-based methods have effectively identified heme binding residues while the complementary relationship between them can result in a significant improvement in prediction performance. The value of our method is highlighted through the development of HemeBIND web server that is freely accessible at http://mleg.cse.sc.edu/hemeBIND/.
Collapse
Affiliation(s)
- Rong Liu
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
| | | |
Collapse
|
24
|
Zhao W, Xu M, Liang Z, Ding B, Niu L, Liu H, Teng M. Structure-based de novo prediction of zinc-binding sites in proteins of unknown function. ACTA ACUST UNITED AC 2011; 27:1262-8. [PMID: 21414989 DOI: 10.1093/bioinformatics/btr133] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Zinc-binding proteins are the most abundant metallo-proteins in Protein Data Bank (PDB). Accurate prediction of zinc-binding sites in proteins of unknown function may provide important clues for the inference of protein function. As zinc binding is often associated with characteristic 3D arrangements of zinc ligand residues, its prediction may benefit from using not only the sequence information but also the structure information of proteins. RESULTS In this work, we present a structure-based method, TEMSP (3D TEmplate-based Metal Site Prediction), to predict zinc-binding sites. TEMSP significantly improves over previously reported best methods in predicting as many as possible true ligand residues for zinc with minimum overpredictions: if only those results in which all zinc ligand residues have been correctly predicted are defined as true positives, our method improves sensitivity from less than 30% to above 60%, and selectivity from around 25% to 80%. These results are for predictions based on apo state structures. In addition, the method can predict the zinc-bound local structures reliably, generating predictions useful for function inference. We applied TEMSP to 1888 protein structures of the 'Unknown Function' class in the PDB database. A number of zinc-binding sites have been discovered de novo, i.e. based solely on the protein structures. Using the predicted local structures of these sites, possible functional roles were analyzed. AVAILABILITY TEMSP is freely available from http://netalign.ustc.edu.cn/temsp/.
Collapse
Affiliation(s)
- Wei Zhao
- Hefei National Laboratory for Physical Sciences at Microscale and School of Life Sciences, University of Science and Technology of China and Key Laboratory of Structural Biology, Chinese Academy of Sciences, 96 Jinzhai Road, Hefei, Anhui, China
| | | | | | | | | | | | | |
Collapse
|
25
|
Harding MM, Nowicki MW, Walkinshaw MD. Metals in protein structures: a review of their principal features. CRYSTALLOGR REV 2010. [DOI: 10.1080/0889311x.2010.485616] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Marjorie M. Harding
- a Centre for Translational and Chemical Biology, University of Edinburgh, Michael Swann Building , Mayfield Road, Edinburgh , EH9 3JR , UK
| | - Matthew W. Nowicki
- a Centre for Translational and Chemical Biology, University of Edinburgh, Michael Swann Building , Mayfield Road, Edinburgh , EH9 3JR , UK
| | - Malcolm D. Walkinshaw
- a Centre for Translational and Chemical Biology, University of Edinburgh, Michael Swann Building , Mayfield Road, Edinburgh , EH9 3JR , UK
| |
Collapse
|
26
|
Wang X, Zhao K, Kirberger M, Wong H, Chen G, Yang JJ. Analysis and prediction of calcium-binding pockets from apo-protein structures exhibiting calcium-induced localized conformational changes. Protein Sci 2010; 19:1180-90. [PMID: 20512971 DOI: 10.1002/pro.394] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Calcium binding in proteins exhibits a wide range of polygonal geometries that relate directly to an equally diverse set of biological functions. The binding process stabilizes protein structures and typically results in local conformational change and/or global restructuring of the backbone. Previously, we established the MUG program, which utilized multiple geometries in the Ca(2+)-binding pockets of holoproteins to identify such pockets, ignoring possible Ca(2+)-induced conformational change. In this article, we first report our progress in the analysis of Ca(2+)-induced conformational changes followed by improved prediction of Ca(2+)-binding sites in the large group of Ca(2+)-binding proteins that exhibit only localized conformational changes. The MUG(SR) algorithm was devised to incorporate side chain torsional rotation as a predictor. The output from MUG(SR) presents groups of residues where each group, typically containing two to five residues, is a potential binding pocket. MUG(SR) was applied to both X-ray apo structures and NMR holo structures, which did not use calcium distance constraints in structure calculations. Predicted pockets were validated by comparison with homologous holo structures. Defining a "correct hit" as a group of residues containing at least two true ligand residues, the sensitivity was at least 90%; whereas for a "correct hit" defined as a group of residues containing at least three true ligand residues, the sensitivity was at least 78%. These data suggest that Ca(2+)-binding pockets are at least partially prepositioned to chelate the ion in the apo form of the protein.
Collapse
Affiliation(s)
- Xue Wang
- Department of Computer Science, Georgia State University, Atlanta, Georgia 30303, USA
| | | | | | | | | | | |
Collapse
|
27
|
López G, Ezkurdia I, Tress ML. Assessment of ligand binding residue predictions in CASP8. Proteins 2010; 77 Suppl 9:138-46. [PMID: 19714771 DOI: 10.1002/prot.22557] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Here we detail the assessment process for the binding site prediction category of the eighth Critical Assessment of Protein Structure Prediction experiment (CASP8). Predictions were only evaluated for those targets that bound biologically relevant ligands and were assessed using the Matthews Correlation Coefficient. The results of the analysis clearly demonstrate that three predictors from two groups (Lee and Sternberg) stand out from the rest. A further two groups perform well over subsets of metal binding or nonmetal ligand binding targets. The best methods were able to make consistently reliable predictions based on model structures, though it was noticeable that the two targets that were not well predicted were also the hardest targets. The number of predictors that submitted new methods in this category was highly encouraging and suggests that current technology is at the level that experimental biochemists and structural biologists could benefit from what is clearly a growing field.
Collapse
Affiliation(s)
- Gonzalo López
- Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | | | | |
Collapse
|
28
|
Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA. Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 2009; 5:e1000585. [PMID: 19997483 PMCID: PMC2777313 DOI: 10.1371/journal.pcbi.1000585] [Citation(s) in RCA: 285] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2009] [Accepted: 10/30/2009] [Indexed: 11/20/2022] Open
Abstract
Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalytic sites and drug binding pockets. Overall, the algorithms and analysis presented here significantly improve our ability to identify ligand binding sites and further advance our understanding of the relationship between evolutionary sequence conservation and structural and functional attributes of proteins. Data, source code, and prediction visualizations are available on the ConCavity web site (http://compbio.cs.princeton.edu/concavity/). Protein molecules are ubiquitous in the cell; they perform thousands of functions crucial for life. Proteins accomplish nearly all of these functions by interacting with other molecules. These interactions are mediated by specific amino acid positions in the proteins. Knowledge of these “functional sites” is crucial for understanding the molecular mechanisms by which proteins carry out their functions; however, functional sites have not been identified in the vast majority of proteins. Here, we present ConCavity, a computational method that predicts small molecule binding sites in proteins by combining analysis of evolutionary sequence conservation and protein 3D structure. ConCavity provides significant improvement over previous approaches, especially on large, multi-chain proteins. In contrast to earlier methods which only predict entire binding sites, ConCavity makes specific predictions of positions in space that are likely to overlap ligand atoms and of residues that are likely to contact bound ligands. These predictions can be used to aid computational function prediction, to guide experimental protein analysis, and to focus computationally intensive techniques used in drug discovery.
Collapse
Affiliation(s)
- John A. Capra
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Roman A. Laskowski
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Janet M. Thornton
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Mona Singh
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (MS); (TAF)
| | - Thomas A. Funkhouser
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (MS); (TAF)
| |
Collapse
|
29
|
Geurts P, Irrthum A, Wehenkel L. Supervised learning with decision tree-based methods in computational and systems biology. MOLECULAR BIOSYSTEMS 2009; 5:1593-605. [PMID: 20023720 DOI: 10.1039/b907946g] [Citation(s) in RCA: 124] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
At the intersection between artificial intelligence and statistics, supervised learning allows algorithms to automatically build predictive models from just observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the review is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the review provides a survey of their applications in the context of computational and systems biology.
Collapse
Affiliation(s)
- Pierre Geurts
- Department of EE and CS & GIGA-Research, University of Liège, Belgium.
| | | | | |
Collapse
|
30
|
Bordner AJ. Predicting protein-protein binding sites in membrane proteins. BMC Bioinformatics 2009; 10:312. [PMID: 19778442 PMCID: PMC2761413 DOI: 10.1186/1471-2105-10-312] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Accepted: 09/24/2009] [Indexed: 01/15/2023] Open
Abstract
Background Many integral membrane proteins, like their non-membrane counterparts, form either transient or permanent multi-subunit complexes in order to carry out their biochemical function. Computational methods that provide structural details of these interactions are needed since, despite their importance, relatively few structures of membrane protein complexes are available. Results We present a method for predicting which residues are in protein-protein binding sites within the transmembrane regions of membrane proteins. The method uses a Random Forest classifier trained on residue type distributions and evolutionary conservation for individual surface residues, followed by spatial averaging of the residue scores. The prediction accuracy achieved for membrane proteins is comparable to that for non-membrane proteins. Also, like previous results for non-membrane proteins, the accuracy is significantly higher for residues distant from the binding site boundary. Furthermore, a predictor trained on non-membrane proteins was found to yield poor accuracy on membrane proteins, as expected from the different distribution of surface residue types between the two classes of proteins. Thus, although the same procedure can be used to predict binding sites in membrane and non-membrane proteins, separate predictors trained on each class of proteins are required. Finally, the contribution of each residue property to the overall prediction accuracy is analyzed and prediction examples are discussed. Conclusion Given a membrane protein structure and a multiple alignment of related sequences, the presented method gives a prioritized list of which surface residues participate in intramembrane protein-protein interactions. The method has potential applications in guiding the experimental verification of membrane protein interactions, structure-based drug discovery, and also in constraining the search space for computational methods, such as protein docking or threading, that predict membrane protein complex structures.
Collapse
Affiliation(s)
- Andrew J Bordner
- Mayo Clinic, 13400 East Shea Boulevard, Scottsdale, AZ 85259, USA.
| |
Collapse
|
31
|
Masso M, Mathe E, Parvez N, Hijazi K, Vaisman II. Modeling the functional consequences of single residue replacements in bacteriophage f1 gene V protein. Protein Eng Des Sel 2009; 22:665-71. [PMID: 19690089 DOI: 10.1093/protein/gzp050] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A computational mutagenesis methodology utilizing a four-body, knowledge-based, statistical contact potential is applied toward globally quantifying relative environmental perturbations (residual scores) in bacteriophage f1 gene V protein (GVP) due to single amino acid substitutions. We show that residual scores correlate well with experimentally measured relative changes in protein function upon mutation. Residual scores also distinguish between GVP amino acid positions grouped according to protein structural or functional roles or based on similarities in physicochemical characteristics. For each mutant, the in silico mutagenesis additionally yields local measures of environmental change (EC scores) occurring at every residue position (residual profile) relative to the native protein. Implementation of the random forest (RF) algorithm, utilizing experimental GVP mutants whose feature vector components include EC scores at the mutated position and at six structurally nearest neighbors, correctly classifies mutants based on function with up to 77% cross-validation accuracy while achieving 0.82 area under the receiver operating characteristic curve. A control experiment highlights the effectiveness of mutant feature vector signals, and a variety of learning curves are generated to analyze the impact of GVP mutant data set size on performance measures. An optimally trained RF model is subsequently used for inferring function for all the remaining unexplored GVP mutants.
Collapse
Affiliation(s)
- Majid Masso
- Laboratory for Structural Bioinformatics, Department of Bioinformatics and Computational Biology, George Mason University, Manassas, VA 20110, USA.
| | | | | | | | | |
Collapse
|