1
|
Derry A, Altman RB. COLLAPSE: A representation learning framework for identification and characterization of protein structural sites. Protein Sci 2023; 32:e4541. [PMID: 36519247 PMCID: PMC9847082 DOI: 10.1002/pro.4541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Revised: 12/02/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to functionally annotate them. Existing methods for function prediction either do not operate on local sites, suffer from high false positive or false negative rates, or require large site-specific training datasets, necessitating the development of new computational methods for annotating functional sites at scale. We present COLLAPSE (Compressed Latents Learned from Aligned Protein Structural Environments), a framework for learning deep representations of protein sites. COLLAPSE operates directly on the 3D positions of atoms surrounding a site and uses evolutionary relationships between homologous proteins as a self-supervision signal, enabling learned embeddings to implicitly capture structure-function relationships within each site. Our representations generalize across disparate tasks in a transfer learning context, achieving state-of-the-art performance on standardized benchmarks (protein-protein interactions and mutation stability) and on the prediction of functional sites from the Prosite database. We use COLLAPSE to search for similar sites across large protein datasets and to annotate proteins based on a database of known functional sites. These methods demonstrate that COLLAPSE is computationally efficient, tunable, and interpretable, providing a general-purpose platform for computational protein analysis.
Collapse
Affiliation(s)
- Alexander Derry
- Department of Biomedical Data ScienceStanford UniversityStanfordCaliforniaUSA
| | - Russ B. Altman
- Department of Biomedical Data ScienceStanford UniversityStanfordCaliforniaUSA
- Departments of Bioengineering, Genetics, and MedicineStanford UniversityStanfordCaliforniaUSA
| |
Collapse
|
2
|
Riziotis IG, Thornton JM. Capturing the geometry, function, and evolution of enzymes with 3D templates. Protein Sci 2022; 31:e4363. [PMID: 35762726 PMCID: PMC9207746 DOI: 10.1002/pro.4363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 05/06/2022] [Accepted: 05/14/2022] [Indexed: 11/05/2022]
Abstract
Structural templates are 3D signatures representing protein functional sites, such as ligand binding cavities, metal coordination motifs, or catalytic sites. Here we explore methods to generate template libraries and algorithms to query structures for conserved 3D motifs. Applications of templates are discussed, as well as some exemplar cases for examining evolutionary links in enzymes. We also introduce the concept of using more than one template per structure to represent flexible sites, as an approach to better understand catalysis through snapshots captured in enzyme structures. Functional annotation from structure is an important topic that has recently resurfaced due to the new more accurate methods of protein structure prediction. Therefore, we anticipate that template-based functional site detection will be a powerful tool in the task of characterizing a vast number of new protein models.
Collapse
|
3
|
Torng W, Altman RB. High precision protein functional site detection using 3D convolutional neural networks. Bioinformatics 2020; 35:1503-1512. [PMID: 31051039 PMCID: PMC6499237 DOI: 10.1093/bioinformatics/bty813] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 08/14/2018] [Accepted: 09/19/2018] [Indexed: 12/02/2022] Open
Abstract
Motivation Accurate annotation of protein functions is fundamental for understanding molecular and cellular physiology. Data-driven methods hold promise for systematically deriving rules underlying the relationship between protein structure and function. However, the choice of protein structural representation is critical. Pre-defined biochemical features emphasize certain aspects of protein properties while ignoring others, and therefore may fail to capture critical information in complex protein sites. Results In this paper, we present a general framework that applies 3D convolutional neural networks (3DCNNs) to structure-based protein functional site detection. The framework can extract task-dependent features automatically from the raw atom distributions. We benchmarked our method against other methods and demonstrate better or comparable performance for site detection. Our deep 3DCNNs achieved an average recall of 0.955 at a precision threshold of 0.99 on PROSITE families, detected 98.89 and 92.88% of nitric oxide synthase and TRYPSIN-like enzyme sites in Catalytic Site Atlas, and showed good performance on challenging cases where sequence motifs are absent but a function is known to exist. Finally, we inspected the individual contributions of each atom to the classification decisions and show that our models successfully recapitulate known 3D features within protein functional sites. Availability and implementation The 3DCNN models described in this paper are available at https://simtk.org/projects/fscnn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wen Torng
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
4
|
Sagar A, Xue B. Recent Advances in Machine Learning Based Prediction of RNA-protein Interactions. Protein Pept Lett 2019; 26:601-619. [PMID: 31215361 DOI: 10.2174/0929866526666190619103853] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 04/04/2019] [Accepted: 06/01/2019] [Indexed: 12/18/2022]
Abstract
The interactions between RNAs and proteins play critical roles in many biological processes. Therefore, characterizing these interactions becomes critical for mechanistic, biomedical, and clinical studies. Many experimental methods can be used to determine RNA-protein interactions in multiple aspects. However, due to the facts that RNA-protein interactions are tissuespecific and condition-specific, as well as these interactions are weak and frequently compete with each other, those experimental techniques can not be made full use of to discover the complete spectrum of RNA-protein interactions. To moderate these issues, continuous efforts have been devoted to developing high quality computational techniques to study the interactions between RNAs and proteins. Many important progresses have been achieved with the application of novel techniques and strategies, such as machine learning techniques. Especially, with the development and application of CLIP techniques, more and more experimental data on RNA-protein interaction under specific biological conditions are available. These CLIP data altogether provide a rich source for developing advanced machine learning predictors. In this review, recent progresses on computational predictors for RNA-protein interaction were summarized in the following aspects: dataset, prediction strategies, and input features. Possible future developments were also discussed at the end of the review.
Collapse
Affiliation(s)
- Amit Sagar
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| | - Bin Xue
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| |
Collapse
|
5
|
Han M, Song Y, Qian J, Ming D. Sequence-based prediction of physicochemical interactions at protein functional sites using a function-and-interaction-annotated domain profile database. BMC Bioinformatics 2018; 19:204. [PMID: 29859055 PMCID: PMC5984826 DOI: 10.1186/s12859-018-2206-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Accepted: 05/15/2018] [Indexed: 01/16/2023] Open
Abstract
Background Identifying protein functional sites (PFSs) and, particularly, the physicochemical interactions at these sites is critical to understanding protein functions and the biochemical reactions involved. Several knowledge-based methods have been developed for the prediction of PFSs; however, accurate methods for predicting the physicochemical interactions associated with PFSs are still lacking. Results In this paper, we present a sequence-based method for the prediction of physicochemical interactions at PFSs. The method is based on a functional site and physicochemical interaction-annotated domain profile database, called fiDPD, which was built using protein domains found in the Protein Data Bank. This method was applied to 13 target proteins from the very recent Critical Assessment of Structure Prediction (CASP10/11), and our calculations gave a Matthews correlation coefficient (MCC) value of 0.66 for PFS prediction and an 80% recall in the prediction of the associated physicochemical interactions. Conclusions Our results show that, in addition to the PFSs, the physical interactions at these sites are also conserved in the evolution of proteins. This work provides a valuable sequence-based tool for rational drug design and side-effect assessment. The method is freely available and can be accessed at http://202.119.249.49.
Collapse
Affiliation(s)
- Min Han
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Yifan Song
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Jiaqiang Qian
- Department of Physiology and Biophysics, School of Life Science, Fudan University, Shanghai, 200438, People's Republic of China
| | - Dengming Ming
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Biotech Building Room B1-404, 30 South Puzhu Road, Jiangsu, 211816, Nanjing, People's Republic of China.
| |
Collapse
|
6
|
Affiliation(s)
- Jacquelyn S. Fetrow
- Office of the President, Albright College, Reading, Pennsylvania, United States of America
- * E-mail:
| | - Patricia C. Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
| |
Collapse
|
7
|
Pradhan D, Padhy S, Sahoo B. Enzyme classification using multiclass support vector machine and feature subset selection. Comput Biol Chem 2017; 70:211-219. [PMID: 28934693 DOI: 10.1016/j.compbiolchem.2017.08.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 07/15/2017] [Accepted: 08/15/2017] [Indexed: 10/19/2022]
Abstract
Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein-protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al. in 2003 have used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function.
Collapse
Affiliation(s)
- Debasmita Pradhan
- Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India.
| | - Sudarsan Padhy
- Department of Computer Scienceing and Engineering, Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, India
| | - Biswajit Sahoo
- School of Computer Engineering, KIIT University, Bhubaneswar, 751024, India
| |
Collapse
|
8
|
Abstract
Motivation: Comparing protein tertiary structures is a fundamental procedure in structural biology and protein bioinformatics. Structure comparison is important particularly for evaluating computational protein structure models. Most of the model structure evaluation methods perform rigid body superimposition of a structure model to its crystal structure and measure the difference of the corresponding residue or atom positions between them. However, these methods neglect intrinsic flexibility of proteins by treating the native structure as a rigid molecule. Because different parts of proteins have different levels of flexibility, for example, exposed loop regions are usually more flexible than the core region of a protein structure, disagreement of a model to the native needs to be evaluated differently depending on the flexibility of residues in a protein. Results: We propose a score named FlexScore for comparing protein structures that consider flexibility of each residue in the native state of proteins. Flexibility information may be extracted from experiments such as NMR or molecular dynamics simulation. FlexScore considers an ensemble of conformations of a protein described as a multivariate Gaussian distribution of atomic displacements and compares a query computational model with the ensemble. We compare FlexScore with other commonly used structure similarity scores over various examples. FlexScore agrees with experts’ intuitive assessment of computational models and provides information of practical usefulness of models. Availability and implementation:https://bitbucket.org/mjamroz/flexscore Contact:dkihara@purdue.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michal Jamroz
- Department of Chemistry, University of Warsaw, Warsaw, 02-093, Poland
| | - Andrzej Kolinski
- Department of Chemistry, University of Warsaw, Warsaw, 02-093, Poland
| | - Daisuke Kihara
- Department of Biological Sciences Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
| |
Collapse
|
9
|
Kaiser F, Eisold A, Labudde D. A Novel Algorithm for Enhanced Structural Motif Matching in Proteins. J Comput Biol 2015; 22:698-713. [PMID: 25695840 DOI: 10.1089/cmb.2014.0263] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
As widely discussed in literature, spatial patterns of amino acids, so-called structural motifs, play an important role in protein function. The functionally responsible part of proteins often lies in an evolutionarily highly conserved spatial arrangement of only a few amino acids, which are held in place tightly by the rest of the structure. Those recurring amino acid arrangements can be seen as patterns in the three-dimensional space and are known as structural motifs. In general, these motifs can mediate various functional interactions, such as DNA/RNA targeting and binding, ligand interactions, substrate catalysis, and stabilization of the protein structure. Hence, characterizing and identifying such conserved structural motifs can contribute to the understanding of structure-function relationships. Therefore, and because of the rapidly increasing number of solved protein structures, it is highly desirable to identify, understand, and moreover to search for structurally scattered amino acid motifs. This work aims at the development and the implementation of a novel and robust matching algorithm to detect structural motifs in large sets of target structures. The proposed methods were combined and implemented to a feature-rich and easy-to-use command line software tool written in Java.
Collapse
Affiliation(s)
- Florian Kaiser
- Department of Bioinformatics, University of Applied Sciences Mittweida , Mittweida, Germany
| | - Alexander Eisold
- Department of Bioinformatics, University of Applied Sciences Mittweida , Mittweida, Germany
| | - Dirk Labudde
- Department of Bioinformatics, University of Applied Sciences Mittweida , Mittweida, Germany
| |
Collapse
|
10
|
Buturovic L, Wong M, Tang GW, Altman RB, Petkovic D. High precision prediction of functional sites in protein structures. PLoS One 2014; 9:e91240. [PMID: 24632601 PMCID: PMC3954699 DOI: 10.1371/journal.pone.0091240] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 02/11/2014] [Indexed: 11/29/2022] Open
Abstract
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
Collapse
Affiliation(s)
- Ljubomir Buturovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- * E-mail:
| | - Mike Wong
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| | - Grace W. Tang
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Russ B. Altman
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Dragutin Petkovic
- Department of Computer Science, San Francisco State University, San Francisco, California, United States of America
- Center for Computing for Life Sciences, San Francisco State University, San Francisco, California, United States of America
| |
Collapse
|
11
|
Yu D, Kim M, Xiao G, Hwang TH. Review of biological network data and its applications. Genomics Inform 2013; 11:200-10. [PMID: 24465231 PMCID: PMC3897847 DOI: 10.5808/gi.2013.11.4.200] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Revised: 11/20/2013] [Accepted: 11/21/2013] [Indexed: 12/16/2022] Open
Abstract
Studying biological networks, such as protein-protein interactions, is key to understanding complex biological activities. Various types of large-scale biological datasets have been collected and analyzed with high-throughput technologies, including DNA microarray, next-generation sequencing, and the two-hybrid screening system, for this purpose. In this review, we focus on network-based approaches that help in understanding biological systems and identifying biological functions. Accordingly, this paper covers two major topics in network biology: reconstruction of gene regulatory networks and network-based applications, including protein function prediction, disease gene prioritization, and network-based genome-wide association study.
Collapse
Affiliation(s)
- Donghyeon Yu
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Minsoo Kim
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Guanghua Xiao
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Tae Hyun Hwang
- Department of Clinical Sciences, Quantitative Biomedical Research Center, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| |
Collapse
|
12
|
Van Voorst JR, Finzel BC. Searching for likeness in a database of macromolecular complexes. J Chem Inf Model 2013; 53:2634-47. [PMID: 24047445 DOI: 10.1021/ci4002537] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
A software tool and workflow based on distance geometry is presented that can be used to search for local similarity in substructures in a comprehensive database of experimentally derived macromolecular structure. The method does not rely on fold annotation, specific secondary structure assignments, or sequence homology and may be used to locate compound substructures of multiple segments spanning different macromolecules that share a queried backbone geometry. This generalized substructure searching capability is intended to allow users to play an active part in exploring the role specific substructures play in larger protein domains, quaternary assemblies of proteins, and macromolecular complexes of proteins and polynucleotides. The user may select any portion or portions of an existing structure or complex to serve as a template for searching, and other structures that share the same structural features are identified, retrieved and overlaid to emphasize substructural likeness. Matching structures may be compared using a variety of integrated tools including molecular graphics for structure visualization and matching substructure sequence logos. A number of examples are provided that illustrate how generalized substructure searching may be used to understand both the similarity, and individuality of specific macromolecular structures. Web-based access to our substructure searching services is freely available at https://drugsite.msi.umn.edu.
Collapse
Affiliation(s)
- Jeffrey R Van Voorst
- Department of Medicinal Chemistry, University of Minnesota College of Pharmacy , Minneapolis, Minnesota 55455, United States
| | | |
Collapse
|
13
|
Structure prediction of partial-length protein sequences. Int J Mol Sci 2013; 14:14892-907. [PMID: 23867606 PMCID: PMC3742278 DOI: 10.3390/ijms140714892] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2013] [Revised: 07/01/2013] [Accepted: 07/02/2013] [Indexed: 12/17/2022] Open
Abstract
Protein structure information is essential to understand protein function. Computational methods to accurately predict protein structure from the sequence have primarily been evaluated on protein sequences representing full-length native proteins. Here, we demonstrate that top-performing structure prediction methods can accurately predict the partial structures of proteins encoded by sequences that contain approximately 50% or more of the full-length protein sequence. We hypothesize that structure prediction may be useful for predicting functions of proteins whose corresponding genes are mapped expressed sequence tags (ESTs) that encode partial-length amino acid sequences. Additionally, we identify a confidence score representing the quality of a predicted structure as a useful means of predicting the likelihood that an arbitrary polypeptide sequence represents a portion of a foldable protein sequence (“foldability”). This work has ramifications for the prediction of protein structure with limited or noisy sequence information, as well as genome annotation.
Collapse
|
14
|
Kirshner DA, Nilmeier JP, Lightstone FC. Catalytic site identification--a web server to identify catalytic site structural matches throughout PDB. Nucleic Acids Res 2013; 41:W256-65. [PMID: 23680785 PMCID: PMC3692059 DOI: 10.1093/nar/gkt403] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
The catalytic site identification web server provides the innovative capability to find structural matches to a user-specified catalytic site among all Protein Data Bank proteins rapidly (in less than a minute). The server also can examine a user-specified protein structure or model to identify structural matches to a library of catalytic sites. Finally, the server provides a database of pre-calculated matches between all Protein Data Bank proteins and the library of catalytic sites. The database has been used to derive a set of hypothesized novel enzymatic function annotations. In all cases, matches and putative binding sites (protein structure and surfaces) can be visualized interactively online. The website can be accessed at http://catsid.llnl.gov.
Collapse
Affiliation(s)
| | | | - Felice C. Lightstone
- *To whom correspondence should be addressed. Tel: +1 925 423 8657; Fax: +1 925 423 0785;
| |
Collapse
|
15
|
Manoharan M, Sankar K, Offmann B, Ramanathan S. Association of Putative Members to Family of Mosquito Odorant Binding Proteins: Scoring Scheme Using Fuzzy Functional Templates and Cys Residue Positions. Bioinform Biol Insights 2013; 7:231-51. [PMID: 23908587 PMCID: PMC3728099 DOI: 10.4137/bbi.s11096] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Proteins may be related to each other very specifically as homologous subfamilies. Proteins can also be related to diverse proteins at the super family level. It has become highly important to characterize the existing sequence databases by their signatures to facilitate the function annotation of newly added sequences. The algorithm described here uses a scheme for the classification of odorant binding proteins on the basis of functional residues and Cys-pairing. The cysteine-based scoring scheme not only helps in unambiguously identifying families like odorant binding proteins (OBPs), but also aids in their classification at the subfamily level with reliable accuracy. The algorithm was also applied to yet another cysteine-rich family, where similar accuracy was observed that ensures the application of the protocol to other families.
Collapse
Affiliation(s)
- Malini Manoharan
- Université de La Reunion, DSIMB, INSERM UMR-S 665, La Reunion, France
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
- Manipal University, Madhav Nagar, Manipal, Karnataka, India
| | - Kannan Sankar
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
- Birla Institute of Technology, Pilani, Rajasthan, India
- Current address: Iowa State University, Ames, IA, USA
| | - Bernard Offmann
- Université de La Reunion, DSIMB, INSERM UMR-S 665, La Reunion, France
- Université de Nantes, UFIP CNRS FRE 3478, Nantes, France
| | - Sowdhamini Ramanathan
- National Centre for Biological Sciences, Tata Institute for Fundamental Research, GKVK campus, Bangalore, INDIA
| |
Collapse
|
16
|
Abstract
A computational pipeline PocketAnnotate for functional annotation of proteins at the level of binding sites has been proposed in this study. The pipeline integrates three in-house algorithms for site-based function annotation: PocketDepth, for prediction of binding sites in protein structures; PocketMatch, for rapid comparison of binding sites and PocketAlign, to obtain detailed alignment between pair of binding sites. A novel scheme has been developed to rapidly generate a database of non-redundant binding sites. For a given input protein structure, putative ligand-binding sites are identified, matched in real time against the database and the query substructure aligned with the promising hits, to obtain a set of possible ligands that the given protein could bind to. The input can be either whole protein structures or merely the substructures corresponding to possible binding sites. Structure-based function annotation at the level of binding sites thus achieved could prove very useful for cases where no obvious functional inference can be obtained based purely on sequence or fold-level analyses. An attempt has also been made to analyse proteins of no known function from Protein Data Bank. PocketAnnotate would be a valuable tool for the scientific community and contribute towards structure-based functional inference. The web server can be freely accessed at http://proline.biochem.iisc.ernet.in/pocketannotate/.
Collapse
Affiliation(s)
- Praveen Anand
- Department of Biochemistry, Indian Institute of Science, Bangalore 560012, Karnataka, India
| | | | | |
Collapse
|
17
|
Fomenko DE, Gladyshev VN. Comparative genomics of thiol oxidoreductases reveals widespread and essential functions of thiol-based redox control of cellular processes. Antioxid Redox Signal 2012; 16:193-201. [PMID: 21902454 PMCID: PMC3234660 DOI: 10.1089/ars.2011.3980] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
AIMS Redox regulation of cellular processes is an important mechanism that operates in organisms from bacteria to mammals. Much of the redox control is provided by thiol oxidoreductases: proteins that employ cysteine residues for redox catalysis. We wanted to identify thiol oxidoreductases on a genome-wide scale and use this information to obtain insights into the general principles of thiol-based redox control. RESULTS Thiol oxidoreductases were identified by three independent methods that took advantage of the occurrence of selenocysteine homologs of these proteins and functional linkages among thiol oxidoreductases revealed by comparative genomics. Based on these searches, we describe thioredoxomes, which are sets of thiol oxidoreductases in organisms. Their analyses revealed that these proteins are present in all living organisms, generally account for 0.5%-1% of the proteome and that their use correlates with proteome size, distinguishing these proteins from those involved in core metabolic functions. We further describe thioredoxomes of Saccharomyces cerevisiae and humans, including proteins which have not been characterized previously. Thiol oxidoreductases occur in various cellular compartments and are enriched in the endoplasmic reticulum and cytosol. INNOVATION We developed bioinformatics methods and used them to characterize thioredoxomes on a genome-wide scale, which in turn revealed properties of thioredoxomes. CONCLUSION These data provide information about organization and properties of thiol-based redox control, whose use is increased with the increase in complexity of organisms. Our data also show an essential combined function of a set of thiol oxidoreductases, and of thiol-based redox regulation in general, in all living organisms.
Collapse
Affiliation(s)
- Dmitri E Fomenko
- Department of Biochemistry and Redox Biology Center, University of Nebraska-Lincoln, USA.
| | | |
Collapse
|
18
|
Tang GW, Altman RB. Remote thioredoxin recognition using evolutionary conservation and structural dynamics. Structure 2011; 19:461-70. [PMID: 21481770 DOI: 10.1016/j.str.2011.02.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2010] [Revised: 02/06/2011] [Accepted: 02/16/2011] [Indexed: 12/25/2022]
Abstract
The thioredoxin family of oxidoreductases plays an important role in redox signaling and control of protein function. Not only are thioredoxins linked to a variety of disorders, but their stable structure has also seen application in protein engineering. Both sequence-based and structure-based tools exist for thioredoxin identification, but remote homolog detection remains a challenge. We developed a thioredoxin predictor using the approach of integrating sequence with structural information. We combined a sequence-based Hidden Markov Model (HMM) with a molecular dynamics enhanced structure-based recognition method (dynamic FEATURE, DF). This hybrid method (HMMDF) has high precision and recall (0.90 and 0.95, respectively) compared with HMM (0.92 and 0.87, respectively) and DF (0.82 and 0.97, respectively). Dynamic FEATURE is sensitive but struggles to resolve closely related protein families, while HMM identifies these evolutionary differences by compromising sensitivity. Our method applied to structural genomics targets makes a strong prediction of a novel thioredoxin.
Collapse
Affiliation(s)
- Grace W Tang
- Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| | | |
Collapse
|
19
|
Kato T, Nagano N. Discriminative structural approaches for enzyme active-site prediction. BMC Bioinformatics 2011; 12 Suppl 1:S49. [PMID: 21342581 PMCID: PMC3044306 DOI: 10.1186/1471-2105-12-s1-s49] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Predicting enzyme active-sites in proteins is an important issue not only for protein sciences but also for a variety of practical applications such as drug design. Because enzyme reaction mechanisms are based on the local structures of enzyme active-sites, various template-based methods that compare local structures in proteins have been developed to date. In comparing such local sites, a simple measurement, RMSD, has been used so far. RESULTS This paper introduces new machine learning algorithms that refine the similarity/deviation for comparison of local structures. The similarity/deviation is applied to two types of applications, single template analysis and multiple template analysis. In the single template analysis, a single template is used as a query to search proteins for active sites, whereas a protein structure is examined as a query to discover the possible active-sites using a set of templates in the multiple template analysis. CONCLUSIONS This paper experimentally illustrates that the machine learning algorithms effectively improve the similarity/deviation measurements for both the analyses.
Collapse
Affiliation(s)
- Tsuyoshi Kato
- Graduate school of Engineering, Gunma University, Tenjin-cho 1-5-1, Kiryu, Gunma 376-8515, Japan.
| | | |
Collapse
|
20
|
Horst JA, Samudrala R. A protein sequence meta-functional signature for calcium binding residue prediction. Pattern Recognit Lett 2010; 31:2103-2112. [PMID: 20824111 PMCID: PMC2932634 DOI: 10.1016/j.patrec.2010.04.012] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The diversity of characterized protein functions found amongst experimentally interrogated proteins suggests that a vast array of unknown functions remains undiscovered. These protein functions are imparted by specific geometric distributions of amino acid residue chemical moieties, each contributing a functional interaction. We hypothesize that individual residue function contributions are predictable through sequence analytic knowledge based algorithms, and that they can be recombined to understand composite protein function by predicting spatial relation in tertiary structure. We assess the former by training a meta-functional signature algorithm to specifically predict calcium ion binding residues from protein sequence. We estimate the latter by testing for match between predictive contribution of positions in predicted secondary structures and patterns of side chain proximity forced by secondary structure moieties. Specific training for calcium binding results in 83% area under the receiver operator characteristic curve added value over random (AUCoR) and p<10(-300) significance as measured by Kendall's τ in ten fold cross validation for parallel sets of 811 residues in 336 proteins and 696 residues in 299 proteins. Training for generalized function results in 63% AUCoR and p≅10(-221) for the same tests. Including inference of side chain proximity improves predictive ability by 2% AUCoR consistently. The results demonstrate that protein meta-functional signatures can be trained to predict specific protein functions by considering amino acid identity and structural features accessible from sequence, laying the groundwork for composite sequence based function site prediction.
Collapse
Affiliation(s)
- Jeremy A Horst
- Department of Oral Biology, School of Dentistry, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
- Department of Microbiology, School of Medicine, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
| | - Ram Samudrala
- Department of Oral Biology, School of Dentistry, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
- Department of Microbiology, School of Medicine, University of Washington, 1959 NE Pacific St #357132, Seattle, WA 98195
| |
Collapse
|
21
|
Abstract
Motivation: Finding functionally analogous enzymes based on the local structures of active sites is an important problem. Conventional methods use templates of local structures to search for analogous sites, but their performance depends on the selection of atoms for inclusion in the templates. Results: The automatic selection of atoms so that site matches can be discriminated from mismatches. The algorithm provides not only good predictions, but also some insights into which atoms are important for the prediction. Our experimental results suggest that the metric learning automatically provides more effective templates than those whose atoms are selected manually. Availability: Online software is available at http://www.net-machine.net/∼kato/lpmetric1/ Contact:kato-tsuyoshi@k.u-tokyo.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tsuyoshi Kato
- GSFS, University of Tokyo, 5-1-5 Kashiwahoha, Kashiwa, Chiba, Japan.
| | | |
Collapse
|
22
|
Li GH, Huang JF. CMASA: an accurate algorithm for detecting local protein structural similarity and its application to enzyme catalytic site annotation. BMC Bioinformatics 2010; 11:439. [PMID: 20796320 PMCID: PMC2936402 DOI: 10.1186/1471-2105-11-439] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2009] [Accepted: 08/27/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The rapid development of structural genomics has resulted in many "unknown function" proteins being deposited in Protein Data Bank (PDB), thus, the functional prediction of these proteins has become a challenge for structural bioinformatics. Several sequence-based and structure-based methods have been developed to predict protein function, but these methods need to be improved further, such as, enhancing the accuracy, sensitivity, and the computational speed. Here, an accurate algorithm, the CMASA (Contact MAtrix based local Structural Alignment algorithm), has been developed to predict unknown functions of proteins based on the local protein structural similarity. This algorithm has been evaluated by building a test set including 164 enzyme families, and also been compared to other methods. RESULTS The evaluation of CMASA shows that the CMASA is highly accurate (0.96), sensitive (0.86), and fast enough to be used in the large-scale functional annotation. Comparing to both sequence-based and global structure-based methods, not only the CMASA can find remote homologous proteins, but also can find the active site convergence. Comparing to other local structure comparison-based methods, the CMASA can obtain the better performance than both FFF (a method using geometry to predict protein function) and SPASM (a local structure alignment method); and the CMASA is more sensitive than PINTS and is more accurate than JESS (both are local structure alignment methods). The CMASA was applied to annotate the enzyme catalytic sites of the non-redundant PDB, and at least 166 putative catalytic sites have been suggested, these sites can not be observed by the Catalytic Site Atlas (CSA). CONCLUSIONS The CMASA is an accurate algorithm for detecting local protein structural similarity, and it holds several advantages in predicting enzyme active sites. The CMASA can be used in large-scale enzyme active site annotation. The CMASA can be available by the mail-based server (http://159.226.149.45/other1/CMASA/CMASA.htm).
Collapse
Affiliation(s)
- Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan, China
| | | |
Collapse
|
23
|
Bandyopadhyay D, Huan J, Liu J, Prins J, Snoeyink J, Wang W, Tropsha A. Functional neighbors: inferring relationships between nonhomologous protein families using family-specific packing motifs. ACTA ACUST UNITED AC 2010; 14:1137-43. [PMID: 20570776 DOI: 10.1109/titb.2010.2053550] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We describe a new approach for inferring the functional relationships between nonhomologous protein families by looking at statistical enrichment of alternative function predictions in classification hierarchies such as Gene Ontology (GO) and Structural Classification of Proteins (SCOP). Protein structures are represented by robust graph representations, and the fast frequent subgraph mining algorithm is applied to protein families to generate sets of family-specific packing motifs, i.e., amino acid residue-packing patterns shared by most family members but infrequent in other proteins. The function of a protein is inferred by identifying in it motifs characteristic of a known family. We employ these family-specific motifs to elucidate functional relationships between families in the GO and SCOP hierarchies. Specifically, we postulate that two families are functionally related if one family is statistically enriched by motifs characteristic of another family, i.e., if the number of proteins in a family containing a motif from another family is greater than expected by chance. This function-inference method can help annotate proteins of unknown function, establish functional neighbors of existing families, and help specify alternate functions for known proteins.
Collapse
Affiliation(s)
- Deepak Bandyopadhyay
- Department of Computational and Structural Chemistry, GlaxoSmithKline, Collegeville, PA UP12-210, USA.
| | | | | | | | | | | | | |
Collapse
|
24
|
Molecular surface mesh generation by filtering electron density map. Int J Biomed Imaging 2010; 2010:923780. [PMID: 20414352 PMCID: PMC2856016 DOI: 10.1155/2010/923780] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2009] [Revised: 11/23/2009] [Accepted: 01/06/2010] [Indexed: 11/17/2022] Open
Abstract
Bioinformatics applied to macromolecules are now widely spread and in continuous expansion. In this context, representing external molecular surface such as the Van der Waals Surface or the Solvent Excluded Surface can be useful for several applications. We propose a fast and parameterizable algorithm giving good visual quality meshes representing molecular surfaces. It is obtained by isosurfacing a filtered electron density map. The density map is the result of the maximum of Gaussian functions placed around atom centers. This map is filtered by an ideal low-pass filter applied on the Fourier Transform of the density map. Applying the marching cubes algorithm on the inverse transform provides a mesh representation of the molecular surface.
Collapse
|
25
|
Vacic V, Iakoucheva LM, Lonardi S, Radivojac P. Graphlet kernels for prediction of functional residues in protein structures. J Comput Biol 2010; 17:55-72. [PMID: 20078397 DOI: 10.1089/cmb.2009.0029] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We introduce a novel graph-based kernel method for annotating functional residues in protein structures. A structure is first modeled as a protein contact graph, where nodes correspond to residues and edges connect spatially neighboring residues. Each vertex in the graph is then represented as a vector of counts of labeled non-isomorphic subgraphs (graphlets), centered on the vertex of interest. A similarity measure between two vertices is expressed as the inner product of their respective count vectors and is used in a supervised learning framework to classify protein residues. We evaluated our method on two function prediction problems: identification of catalytic residues in proteins, which is a well-studied problem suitable for benchmarking, and a much less explored problem of predicting phosphorylation sites in protein structures. The performance of the graphlet kernel approach was then compared against two alternative methods, a sequence-based predictor and our implementation of the FEATURE framework. On both tasks, the graphlet kernel performed favorably; however, the margin of difference was considerably higher on the problem of phosphorylation site prediction. While there is data that phosphorylation sites are preferentially positioned in intrinsically disordered regions, we provide evidence that for the sites that are located in structured regions, neither the surface accessibility alone nor the averaged measures calculated from the residue microenvironments utilized by FEATURE were sufficient to achieve high accuracy. The key benefit of the graphlet representation is its ability to capture neighborhood similarities in protein structures via enumerating the patterns of local connectivity in the corresponding labeled graphs.
Collapse
Affiliation(s)
- Vladimir Vacic
- Department of Computer Science and Engineering, University of California, Riverside, California, USA
| | | | | | | |
Collapse
|
26
|
Cammer S, Carter CW. Six Rossmannoid folds, including the Class I aminoacyl-tRNA synthetases, share a partial core with the anti-codon-binding domain of a Class II aminoacyl-tRNA synthetase. Bioinformatics 2010; 26:709-14. [PMID: 20130031 DOI: 10.1093/bioinformatics/btq039] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Similarities in core residue packing provide evidence for divergence or convergence not reported using other methods. RESULTS We apply a new method for rapid structure comparison based on Simplicial Neighborhood Analysis of Protein Packing (SNAPP) to the diverse structural classification of proteins (SCOP) alpha/beta-class of protein folds. The procedure identifies inter-residue packing motifs shared by protein pairs from different folds. A threshold of 0.67 A RMSD for all atoms of corresponding residues ensures inclusion of only highly significant similarities comparable with those observed for identical catalytic residues in homologues. Many tertiary packing motifs are shared among the three classical Rossmannoid folds, as well as thousands of other motifs that occur in at least two distinct folds. Merging of neighboring packing motifs facilitated recognition of larger, recurrent substructures or cores. The anti-codon-binding domain of an archeal aminoacyl-tRNA synthetase (aaRS) was discovered to possess a packed core in which eight identical amino acid residues are within 0.55 A RMSD of the comparable structure in the FixJ receiver, a member of the Rossmannoid family that also includes the CheY signaling protein and flavodoxin-like proteins. Further investigation identified close variants of this core in five other Rossmannoid folds, including a functionally relevant core in Class Ia aminoacyl-tRNA synthetases. Although it is possible that the two essentially identical cores in the ProRS anti-codon-binding domain and the FixJ receiver converged to the same structure, the consensus core obtained from the structural and sequence alignments suggests that all the implicated protein folds descended from a simpler ancestral protein in which this core provided nucleotide binding and proto-allosteric functions. AVAILABILITY Programs are available at http://staff.vbi.vt.edu/cammer/snapp/download/ IMPLEMENTATION Programs were written in Perl and c and run under Linux. CONTACT cammer@vbi.vt.edu.
Collapse
Affiliation(s)
- Stephen Cammer
- Virginia Bioinformatics Institute at Virginia Tech, Blacksburg, VA 24061, USA.
| | | |
Collapse
|
27
|
Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjölander K. Active site prediction using evolutionary and structural information. ACTA ACUST UNITED AC 2010; 26:617-24. [PMID: 20080507 PMCID: PMC2828116 DOI: 10.1093/bioinformatics/btq008] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Motivation: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites. Results: In cross-validation experiments on two benchmark datasets from the Catalytic Site Atlas and CATRES resources containing a total of 437 manually curated enzymes spanning 487 SCOP families, Discern increases catalytic site recall between 12% and 20% over methods that combine information from both sequence and structure, and by ≥50% over methods that make use of sequence conservation signal only. Controlled experiments show that Discern's improvement in catalytic residue prediction is derived from the combination of three ingredients: the use of the INTREPID phylogenomic method to extract conservation information; the use of 3D structure data, including features computed for residues that are proximal in the structure; and a statistical regularization procedure to prevent overfitting. Contact:kimmen@berkeley.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|
28
|
Giard J, Ambroise J, Gala JL, Macq B. Regression applied to protein binding site prediction and comparison with classification. BMC Bioinformatics 2009; 10:276. [PMID: 19728868 PMCID: PMC2749839 DOI: 10.1186/1471-2105-10-276] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Accepted: 09/03/2009] [Indexed: 11/13/2022] Open
Abstract
Background The structural genomics centers provide hundreds of protein structures of unknown function. Therefore, developing methods enabling the determination of a protein function automatically is imperative. The determination of a protein function can be achieved by studying the network of its physical interactions. In this context, identifying a potential binding site between proteins is of primary interest. In the literature, methods for predicting a potential binding site location generally are based on classification tools. The aim of this paper is to show that regression tools are more efficient than classification tools for patches based binding site predictors. For this purpose, we developed a patches based binding site localization method usable with either regression or classification tools. Results We compared predictive performances of regression tools with performances of machine learning classifiers. Using leave-one-out cross-validation, we showed that regression tools provide better predictions than classification ones. Among regression tools, Multilayer Perceptron ranked highest in the quality of predictions. We compared also the predictive performance of our patches based method using Multilayer Perceptron with the performance of three other methods usable through a web server. Our method performed similarly to the other methods. Conclusion Regression is more efficient than classification when applied to our binding site localization method. When it is possible, using regression instead of classification for other existing binding site predictors will probably improve results. Furthermore, the method presented in this work is flexible because the size of the predicted binding site is adjustable. This adaptability is useful when either false positive or negative rates have to be limited.
Collapse
Affiliation(s)
- Joachim Giard
- Communications and Remote Sensing Laboratory, Université Catholique de Louvain, Place du Levant 2, 1348 Louvain-la-Neuve, Belgium.
| | | | | | | |
Collapse
|
29
|
Kelley LA, Shrimpton PJ, Muggleton SH, Sternberg MJE. Discovering rules for protein-ligand specificity using support vector inductive logic programming. Protein Eng Des Sel 2009; 22:561-7. [PMID: 19574295 DOI: 10.1093/protein/gzp035] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Structural genomics initiatives are rapidly generating vast numbers of protein structures. Comparative modelling is also capable of producing accurate structural models for many protein sequences. However, for many of the known structures, functions are not yet determined, and in many modelling tasks, an accurate structural model does not necessarily tell us about function. Thus, there is a pressing need for high-throughput methods for determining function from structure. The spatial arrangement of key amino acids in a folded protein, on the surface or buried in clefts, is often the determinants of its biological function. A central aim of molecular biology is to understand the relationship between such substructures or surfaces and biological function, leading both to function prediction and to function design. We present a new general method for discovering the features of binding pockets that confer specificity for particular ligands. Using a recently developed machine-learning technique which couples the rule-discovery approach of inductive logic programming with the statistical learning power of support vector machines, we are able to discriminate, with high precision (90%) and recall (86%) between pockets that bind FAD and those that bind NAD on a large benchmark set given only the geometry and composition of the backbone of the binding pocket without the use of docking. In addition, we learn rules governing this specificity which can feed into protein functional design protocols. An analysis of the rules found suggests that key features of the binding pocket may be tied to conformational freedom in the ligand. The representation is sufficiently general to be applicable to any discriminatory binding problem. All programs and data sets are freely available to non-commercial users at http://www.sbg.bio.ic.ac.uk/svilp_ligand/.
Collapse
Affiliation(s)
- Lawrence A Kelley
- Structural Bioinformatics Group, Division of Molecular Biosciences, Imperial College London, London, UK.
| | | | | | | |
Collapse
|
30
|
Identification of family-specific residue packing motifs and their use for structure-based protein function prediction: I. Method development. J Comput Aided Mol Des 2009; 23:773-84. [DOI: 10.1007/s10822-009-9273-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2008] [Accepted: 04/15/2009] [Indexed: 12/12/2022]
|
31
|
Tang K, Pugalenthi G, Suganthan PN, Lanczycki CJ, Chakrabarti S. Prediction of functionally important sites from protein sequences using sparse kernel least squares classifiers. Biochem Biophys Res Commun 2009; 384:155-9. [PMID: 19394310 DOI: 10.1016/j.bbrc.2009.04.096] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2009] [Accepted: 04/20/2009] [Indexed: 11/25/2022]
Abstract
Identification of functionally important sites (FIS) in proteins is a critical problem and can have profound importance where protein structural information is limited. Machine learning techniques have been very useful in successful classification of many important biological problems. In this paper, we adopt the sparse kernel least squares classifiers (SKLSC) approach for classification and/or prediction of FIS using protein sequence derived features. The SKLSC algorithm was applied to 5435 FIS that have been extracted from 312 reliable alignments for a wide range of protein families. We obtained 68.28% sensitivity and 68.66% specificity for training dataset and 65.34% sensitivity and 66.88% specificity for testing dataset. Further, large scale benchmarking study using alignments of 101 protein families containing 1899 FIS showed that our method achieved an average approximately 70% sensitivity in predicting different types of FIS, such as active sites, metal, ligand or protein binding sites. Our findings also indicate that active sites and metal binding sites are comparably easier to predict compared to the ligand and protein binding sites. Despite moderate success, our results suggest the usefulness and potential of SKLSC approach in prediction of FIS using only protein sequence derived information.
Collapse
Affiliation(s)
- Ke Tang
- NICAL, Department of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, China
| | | | | | | | | |
Collapse
|
32
|
Skolnick J, Brylinski M. FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinform 2009; 10:378-91. [PMID: 19324930 DOI: 10.1093/bib/bbp017] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology 250 14th St NW, Atlanta, GA 30318, USA.
| | | |
Collapse
|
33
|
Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics 2008; 9 Suppl 2:S1. [PMID: 18831774 PMCID: PMC2559873 DOI: 10.1186/1471-2164-9-s2-s1] [Citation(s) in RCA: 386] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND Our first predictor of protein disorder was published just over a decade ago in the Proceedings of the IEEE International Conference on Neural Networks (Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK (1997) Identifying disordered regions in proteins from amino acid sequence. Proceedings of the IEEE International Conference on Neural Networks, 1: 90-95). By now more than twenty other laboratory groups have joined the efforts to improve the prediction of protein disorder. While the various prediction methodologies used for protein intrinsic disorder resemble those methodologies used for secondary structure prediction, the two types of structures are entirely different. For example, the two structural classes have very different dynamic properties, with the irregular secondary structure class being much less mobile than the disorder class. The prediction of secondary structure has been useful. On the other hand, the prediction of intrinsic disorder has been revolutionary, leading to major modifications of the more than 100 year-old views relating protein structure and function. Experimentalists have been providing evidence over many decades that some proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. In addition, experimentalists are also showing that, for many proteins, their functions depend on the unstructured rather than structured state; such results are in marked contrast to the greater than hundred year old views such as the lock and key hypothesis. Despite extensive data on many important examples, including disease-associated proteins, the importance of disorder for protein function has been largely ignored. Indeed, to our knowledge, current biochemistry books don't present even one acknowledged example of a disorder-dependent function, even though some reports of disorder-dependent functions are more than 50 years old. The results from genome-wide predictions of intrinsic disorder and the results from other bioinformatics studies of intrinsic disorder are demanding attention for these proteins. RESULTS Disorder prediction has been important for showing that the relatively few experimentally characterized examples are members of a very large collection of related disordered proteins that are wide-spread over all three domains of life. Many significant biological functions are now known to depend directly on, or are importantly associated with, the unfolded or partially folded state. Here our goal is to review the key discoveries and to weave these discoveries together to support novel approaches for understanding sequence-function relationships. CONCLUSION Intrinsically disordered protein is common across the three domains of life, but especially common among the eukaryotic proteomes. Signaling sequences and sites of posttranslational modifications are frequently, or very likely most often, located within regions of intrinsic disorder. Disorder-to-order transitions are coupled with the adoption of different structures with different partners. Also, the flexibility of intrinsic disorder helps different disordered regions to bind to a common binding site on a common partner. Such capacity for binding diversity plays important roles in both protein-protein interaction networks and likely also in gene regulation networks. Such disorder-based signaling is further modulated in multicellular eukaryotes by alternative splicing, for which such splicing events map to regions of disorder much more often than to regions of structure. Associating alternative splicing with disorder rather than structure alleviates theoretical and experimentally observed problems associated with the folding of different length, isomeric amino acid sequences. The combination of disorder and alternative splicing is proposed to provide a mechanism for easily "trying out" different signaling pathways, thereby providing the mechanism for generating signaling diversity and enabling the evolution of cell differentiation and multicellularity. Finally, several recent small molecules of interest as potential drugs have been shown to act by blocking protein-protein interactions based on intrinsic disorder of one of the partners. Study of these examples has led to a new approach for drug discovery, and bioinformatics analysis of the human proteome suggests that various disease-associated proteins are very rich in such disorder-based drug discovery targets.
Collapse
Affiliation(s)
- A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University Schools of Medicine and Informatics, Indianapolis, IN 46202, USA
| | - Christopher J Oldfield
- Center for Computational Biology and Bioinformatics, Indiana University School of Informatics, Indianapolis, IN 46202, USA
| | - Jingwei Meng
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Pedro Romero
- Center for Computational Biology and Bioinformatics, Indiana University School of Informatics, Indianapolis, IN 46202, USA
| | - Jack Y Yang
- Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Jessica Walton Chen
- Center for Computational Biology and Bioinformatics, Indiana University School of Informatics, Indianapolis, IN 46202, USA
| | - Vladimir Vacic
- Center for Computational Biology and Bioinformatics, Indiana University School of Informatics, Indianapolis, IN 46202, USA
| | - Zoran Obradovic
- Center for Information Science and Technology, Temple University, Philadelphia, PA 19122, USA
| | - Vladimir N Uversky
- Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Institute for Intrinsically Disordered Protein Research, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Institute for Biological Instrumentation, Russian Academy of Sciences, 142290 Pushchino, Moscow Region, Russia
| |
Collapse
|
34
|
Dunker AK, Oldfield CJ, Meng J, Romero P, Yang JY, Chen JW, Vacic V, Obradovic Z, Uversky VN. The unfoldomics decade: an update on intrinsically disordered proteins. BMC Genomics 2008. [PMID: 18831774 DOI: 10.1186/1471‐2164‐9‐s2‐s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Our first predictor of protein disorder was published just over a decade ago in the Proceedings of the IEEE International Conference on Neural Networks (Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK (1997) Identifying disordered regions in proteins from amino acid sequence. Proceedings of the IEEE International Conference on Neural Networks, 1: 90-95). By now more than twenty other laboratory groups have joined the efforts to improve the prediction of protein disorder. While the various prediction methodologies used for protein intrinsic disorder resemble those methodologies used for secondary structure prediction, the two types of structures are entirely different. For example, the two structural classes have very different dynamic properties, with the irregular secondary structure class being much less mobile than the disorder class. The prediction of secondary structure has been useful. On the other hand, the prediction of intrinsic disorder has been revolutionary, leading to major modifications of the more than 100 year-old views relating protein structure and function. Experimentalists have been providing evidence over many decades that some proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. In addition, experimentalists are also showing that, for many proteins, their functions depend on the unstructured rather than structured state; such results are in marked contrast to the greater than hundred year old views such as the lock and key hypothesis. Despite extensive data on many important examples, including disease-associated proteins, the importance of disorder for protein function has been largely ignored. Indeed, to our knowledge, current biochemistry books don't present even one acknowledged example of a disorder-dependent function, even though some reports of disorder-dependent functions are more than 50 years old. The results from genome-wide predictions of intrinsic disorder and the results from other bioinformatics studies of intrinsic disorder are demanding attention for these proteins. RESULTS Disorder prediction has been important for showing that the relatively few experimentally characterized examples are members of a very large collection of related disordered proteins that are wide-spread over all three domains of life. Many significant biological functions are now known to depend directly on, or are importantly associated with, the unfolded or partially folded state. Here our goal is to review the key discoveries and to weave these discoveries together to support novel approaches for understanding sequence-function relationships. CONCLUSION Intrinsically disordered protein is common across the three domains of life, but especially common among the eukaryotic proteomes. Signaling sequences and sites of posttranslational modifications are frequently, or very likely most often, located within regions of intrinsic disorder. Disorder-to-order transitions are coupled with the adoption of different structures with different partners. Also, the flexibility of intrinsic disorder helps different disordered regions to bind to a common binding site on a common partner. Such capacity for binding diversity plays important roles in both protein-protein interaction networks and likely also in gene regulation networks. Such disorder-based signaling is further modulated in multicellular eukaryotes by alternative splicing, for which such splicing events map to regions of disorder much more often than to regions of structure. Associating alternative splicing with disorder rather than structure alleviates theoretical and experimentally observed problems associated with the folding of different length, isomeric amino acid sequences. The combination of disorder and alternative splicing is proposed to provide a mechanism for easily "trying out" different signaling pathways, thereby providing the mechanism for generating signaling diversity and enabling the evolution of cell differentiation and multicellularity. Finally, several recent small molecules of interest as potential drugs have been shown to act by blocking protein-protein interactions based on intrinsic disorder of one of the partners. Study of these examples has led to a new approach for drug discovery, and bioinformatics analysis of the human proteome suggests that various disease-associated proteins are very rich in such disorder-based drug discovery targets.
Collapse
Affiliation(s)
- A Keith Dunker
- Center for Computational Biology and Bioinformatics, Indiana University Schools of Medicine and Informatics, Indianapolis, IN 46202, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Halperin I, Glazer DS, Wu S, Altman RB. The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics 2008; 9 Suppl 2:S2. [PMID: 18831785 PMCID: PMC2559884 DOI: 10.1186/1471-2164-9-s2-s2] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts.
Collapse
Affiliation(s)
- Inbal Halperin
- Department of Genetics, 318 Campus Drive, Clark Center S240, Stanford, CA 94305, USA.
| | | | | | | |
Collapse
|
36
|
Watanabe RLA, Morett E, Vallejo EE. Inferring modules of functionally interacting proteins using the Bond Energy Algorithm. BMC Bioinformatics 2008; 9:285. [PMID: 18559112 PMCID: PMC2474619 DOI: 10.1186/1471-2105-9-285] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2008] [Accepted: 06/17/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Non-homology based methods such as phylogenetic profiles are effective for predicting functional relationships between proteins with no considerable sequence or structure similarity. Those methods rely heavily on traditional similarity metrics defined on pairs of phylogenetic patterns. Proteins do not exclusively interact in pairs as the final biological function of a protein in the cellular context is often hold by a group of proteins. In order to accurately infer modules of functionally interacting proteins, the consideration of not only direct but also indirect relationships is required. In this paper, we used the Bond Energy Algorithm (BEA) to predict functionally related groups of proteins. With BEA we create clusters of phylogenetic profiles based on the associations of the surrounding elements of the analyzed data using a metric that considers linked relationships among elements in the data set. RESULTS Using phylogenetic profiles obtained from the Cluster of Orthologous Groups of Proteins (COG) database, we conducted a series of clustering experiments using BEA to predict (upper level) relationships between profiles. We evaluated our results by comparing with COG's functional categories, And even more, with the experimentally determined functional relationships between proteins provided by the DIP and ECOCYC databases. Our results demonstrate that BEA is capable of predicting meaningful modules of functionally related proteins. BEA outperforms traditionally used clustering methods, such as k-means and hierarchical clustering by predicting functional relationships between proteins with higher accuracy. CONCLUSION This study shows that the linked relationships of phylogenetic profiles obtained by BEA is useful for detecting functional associations between profiles and extending functional modules not found by traditional methods. BEA is capable of detecting relationship among phylogenetic patterns by linking them through a common element shared in a group. Additionally, we discuss how the proposed method may become more powerful if other criteria to classify different levels of protein functional interactions, as gene neighborhood or protein fusion information, is provided.
Collapse
Affiliation(s)
- Ryosuke L A Watanabe
- ITESM Campus Estado de México, Carretera Lago de Guadalupe km 3,5, Atizapán de Zaragoza, 52926, México.
| | | | | |
Collapse
|
37
|
Fetrow JS. Active site profiling to identify protein functional sites in sequences and structures using the Deacon Active Site Profiler (DASP). CURRENT PROTOCOLS IN BIOINFORMATICS 2008; Chapter 8:8.10.1-8.10.16. [PMID: 18428769 DOI: 10.1002/0471250953.bi0810s14] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Methods for the annotation and analysis of functional sites in proteins are an area of active research, and those methods that allow detailed characterization of functional site features are much needed. A Web site application, DASP, which implements a previously described method (Cammer, et al., 2003) to allow users to create an active site profile for any protein family, is described. Two protocols for functional site analysis of protein families using DASP are presented: 1) creation of functional site signatures and a profile from proteins of known structure and 2) utilization of the active site profile to search sequences that contain fragments similar to those found in the functional site signatures. The active site profile produced by Basic Protocol 1 allows the user to analyze the features of the functional site, i.e., those characteristics that are common across the family and those that are unique to one or several members of the family. The characteristics that are unique to a subfamily might be described as specificity determinants i.e., features that impart specificity to a particular function. Basic Protocol 2 provides instructions for searching for sequences that might contain a similar functional site.
Collapse
|
38
|
Seo JH, Park HY, Kim J, Lee BS, Kim BG. Exploring sequence space: Profile analysis and protein-ligand docking to screen ω-aminotransferases with expanded substrate specificity. Biotechnol J 2008; 3:676-86. [DOI: 10.1002/biot.200700264] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
39
|
Goyal K, Mande SC. Exploiting 3D structural templates for detection of metal-binding sites in protein structures. Proteins 2008; 70:1206-18. [PMID: 17847089 DOI: 10.1002/prot.21601] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
High throughput structural genomics efforts have been making the structures of proteins available even before their function has been fully characterized. Therefore, methods that exploit the structural knowledge to provide evidence about the functions of proteins would be useful. Such methods would be needed to complement the sequence-based function annotation approaches. The current study describes generation of 3D-structural motifs for metal-binding sites from the known metalloproteins. It then scans all the available protein structures in the PDB database for putative metal-binding sites. Our analysis predicted more than 1000 novel metal-binding sites in proteins using three-residue templates, and more than 150 novel metal-binding sites using four-residue templates. Prediction of metal-binding site in a yeast protein YDR533c led to the hypothesis that it might function as metal-dependent amidopeptidase. The structural motifs identified by our method present novel metal-binding sites that reveal newer mechanisms for a few well-known proteins.
Collapse
Affiliation(s)
- Kshama Goyal
- Laboratory of Structural Biology, Center for DNA Fingerprinting and Diagnostics, Nacharam, Hyderabad 500076, Andhra Pradesh, India
| | | |
Collapse
|
40
|
Lanczycki CJ, Chakrabarti S. A tool for the prediction of functionally important sites in proteins using a library of functional templates. Bioinformation 2008; 2:279-83. [PMID: 18478080 PMCID: PMC2374371 DOI: 10.6026/97320630002279] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2008] [Accepted: 02/11/2008] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED Understanding and characterizing the biochemical and evolutionary information within the wealth of protein sequence and structural data, particularly at functionally important sites, is very important. A comprehensive analysis of physico-chemical properties and evolutionary conservation patterns at the molecular and biological function level is expected to yield important clues for identifying similar sites in as-yet uncharacterized proteins. We present a library of protein functional templates (PFTs) designed to represent the compositional and evolutionary conservation patterns of functional sites at the molecular and biological function level. Subsequently we developed LIMACS (LInear MAtching of Conservation Scores), a software tool that uses the template library for the prediction of functionally important sites in a multiple sequence alignment, transferring the molecular function annotation from the most-similar functional site in the template library to a predicted site. AVAILABILITY The PFT library, the LIMACS program and source code are available for PC, Mac and Linux operating systems from ftp://ftp.ncbi.nih.gov/pub/lanczyck/limacs.
Collapse
Affiliation(s)
- Christopher J Lanczycki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Saikat Chakrabarti
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
41
|
Wu S, Liang MP, Altman RB. The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation. Genome Biol 2008; 9:R8. [PMID: 18197987 PMCID: PMC2395245 DOI: 10.1186/gb-2008-9-1-r8] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2007] [Revised: 11/21/2007] [Accepted: 01/16/2008] [Indexed: 11/10/2022] Open
Abstract
Structural genomics efforts have led to increasing numbers of novel, uncharacterized protein structures with low sequence identity to known proteins, resulting in a growing need for structure-based function recognition tools. Our method, SeqFEATURE, robustly models protein functions described by sequence motifs using a structural representation. We built a library of models that shows good performance compared to other methods. In particular, SeqFEATURE demonstrates significant improvement over other methods when sequence and structural similarity are low.
Collapse
Affiliation(s)
- Shirley Wu
- Program in Biomedical Informatics, Stanford University, Stanford, CA, 94305 USA
| | | | | |
Collapse
|
42
|
A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation. Proc Natl Acad Sci U S A 2007; 105:129-34. [PMID: 18165317 DOI: 10.1073/pnas.0707684105] [Citation(s) in RCA: 240] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The detection of ligand-binding sites is often the starting point for protein function identification and drug discovery. Because of inaccuracies in predicted protein structures, extant binding pocket-detection methods are limited to experimentally solved structures. Here, FINDSITE, a method for ligand-binding site prediction and functional annotation based on binding-site similarity across groups of weakly homologous template structures identified from threading, is described. For crystal structures, considering a cutoff distance of 4 A as the hit criterion, the success rate is 70.9% for identifying the best of top five predicted ligand-binding sites with a ranking accuracy of 76.0%. Both high prediction accuracy and ability to correctly rank identified binding sites are sustained when approximate protein models (<35% sequence identity to the closest template structure) are used, showing a 67.3% success rate with 75.5% ranking accuracy. In practice, FINDSITE tolerates structural inaccuracies in protein models up to a rmsd from the crystal structure of 8-10 A. This is because analysis of weakly homologous protein models reveals that about half have a rmsd from the native binding site <2 A. Furthermore, the chemical properties of template-bound ligands can be used to select ligand templates associated with the binding site. In most cases, FINDSITE can accurately assign a molecular function to the protein model.
Collapse
|
43
|
Abstract
Metals play a variety of roles in biological processes, and hence their presence in a protein structure can yield vital functional information. Because the residues that coordinate a metal often undergo conformational changes upon binding, detection of binding sites based on simple geometric criteria in proteins without bound metal is difficult. However, aspects of the physicochemical environment around a metal binding site are often conserved even when this structural rearrangement occurs. We have developed a Bayesian classifier using known zinc binding sites as positive training examples and nonmetal binding regions that nonetheless contain residues frequently observed in zinc sites as negative training examples. In order to allow variation in the exact positions of atoms, we average a variety of biochemical and biophysical properties in six concentric spherical shells around the site of interest. At a specificity of 99.8%, this method achieves 75.5% sensitivity in unbound proteins at a positive predictive value of 73.6%. We also test its accuracy on predicted protein structures obtained by homology modeling using templates with 30%-50% sequence identity to the target sequences. At a specificity of 99.8%, we correctly identify at least one zinc binding site in 65.5% of modeled proteins. Thus, in many cases, our model is accurate enough to identify metal binding sites in proteins of unknown structure for which no high sequence identity homologs of known structure exist. Both the source code and a Web interface are available to the public at http://feature.stanford.edu/metals.
Collapse
Affiliation(s)
- Jessica C Ebert
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | | |
Collapse
|
44
|
Su D, Berndt C, Fomenko DE, Holmgren A, Gladyshev VN. A Conservedcis-Proline Precludes Metal Binding by the Active Site Thiolates in Members of the Thioredoxin Family of Proteins†. Biochemistry 2007; 46:6903-10. [PMID: 17503777 DOI: 10.1021/bi700152b] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Many thioredoxin-fold proteins possess a conserved cis-proline located in their C-terminal portions. This residue, as well as catalytic and resolving cysteines, is a key functional group in the active sites of these thiol-disulfide oxidoreductases. However, the specific function of the proline is poorly understood, and some thioredoxin-fold proteins lack this residue. Herein, we found that mutation of a cis-proline, Pro75, in human thioredoxin to serine, threonine, or alanine leads to the formation of an Fe2-S2 cluster in this protein. Further mutagenesis studies revealed that the first cysteine in the CxxC motif and a cysteine in the C-terminal region of the protein were responsible for metal binding. Replacement of Pro75 with arginine, a residue that occurs in place of Pro in peroxiredoxins, also led to the formation of the cluster in the thioredoxin. In addition, we found that mutation of the TxxC active site in a peroxiredoxin to the CxxC form could lead to coordination of an Fe2-S2 cluster in these proteins in vitro. Sco1, a distantly related thioredoxin-fold protein, has histidine in place of the cis-proline, and this residue binds copper. The Pro75His mutation led to increased copper binding by human thioredoxin when cells were grown in the presence of this trace element. Taken together, our data suggest that an important function of Pro75 in human thioredoxin, and likely other members of this superfamily, is to prevent metal binding by the reactive thiolate-based active site.
Collapse
Affiliation(s)
- Dan Su
- Department of Biochemistry, Univeristy of Nebraska-Lincoln, Lincoln, Nebraska 68588-0664, USA
| | | | | | | | | |
Collapse
|
45
|
Abstract
MOTIVATION All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. RESULTS We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In large-scale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. AVAILABILITY Data sets and code for all conservation measures evaluated are available at http://compbio.cs.princeton.edu/conservation/
Collapse
Affiliation(s)
- John A Capra
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | | |
Collapse
|
46
|
Mirkovic N, Li Z, Parnassa A, Murray D. Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization. Proteins 2007; 66:766-77. [PMID: 17154423 DOI: 10.1002/prot.21191] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.
Collapse
Affiliation(s)
- Nebojsa Mirkovic
- Department of Microbiology and Immunology, Weill Medical College of Cornell University, New York, New York 10021, USA
| | | | | | | |
Collapse
|
47
|
Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007; 8:119. [PMID: 17419878 PMCID: PMC1877815 DOI: 10.1186/1471-2105-8-119] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2006] [Accepted: 04/09/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Methods are now available for the prediction of interaction sites in protein 3D structures. While many of these methods report high success rates for site prediction, often these predictions are not very selective and have low precision. Precision in site prediction is addressed using Theoretical Microscopic Titration Curves (THEMATICS), a simple computational method for the identification of active sites in enzymes. Recall and precision are measured and compared with other methods for the prediction of catalytic sites. RESULTS Using a test set of 169 enzymes from the original Catalytic Residue Dataset (CatRes) it is shown that THEMATICS can deliver precise, localised site predictions. Furthermore, adjustment of the cut-off criteria can improve the recall rates for catalytic residues with only a small sacrifice in precision. Recall rates for CatRes/CSA annotated catalytic residues are 41.1%, 50.4%, and 54.2% for Z score cut-off values of 1.00, 0.99, and 0.98, respectively. The corresponding precision rates are 19.4%, 17.9%, and 16.4%. The success rate for catalytic sites is higher, with correct or partially correct predictions for 77.5%, 85.8%, and 88.2% of the enzymes in the test set, corresponding to the same respective Z score cut-offs, if only the CatRes annotations are used as the reference set. Incorporation of additional literature annotations into the reference set gives total success rates of 89.9%, 92.9%, and 94.1%, again for corresponding cut-off values of 1.00, 0.99, and 0.98. False positive rates for a 75-protein test set are 1.95%, 2.60%, and 3.12% for Z score cut-offs of 1.00, 0.99, and 0.98, respectively. CONCLUSION With a preferred cut-off value of 0.99, THEMATICS achieves a high success rate of interaction site prediction, about 86% correct or partially correct using CatRes/CSA annotations only and about 93% with an expanded reference set. Success rates for catalytic residue prediction are similar to those of other structure-based methods, but with substantially better precision and lower false positive rates. THEMATICS performs well across the spectrum of E.C. classes. The method requires only the structure of the query protein as input. THEMATICS predictions may be obtained via the web from structures in PDB format at: http://pfweb.chem.neu.edu/thematics/submit.html.
Collapse
|
48
|
Pettit FK, Bare E, Tsai A, Bowie JU. HotPatch: a statistical approach to finding biologically relevant features on protein surfaces. J Mol Biol 2007; 369:863-79. [PMID: 17451744 PMCID: PMC2034327 DOI: 10.1016/j.jmb.2007.03.036] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2006] [Revised: 03/10/2007] [Accepted: 03/15/2007] [Indexed: 10/23/2022]
Abstract
We describe a fully automated algorithm for finding functional sites on protein structures. Our method finds surface patches of unusual physicochemical properties on protein structures, and estimates the patches' probability of overlapping functional sites. Other methods for predicting the locations of specific types of functional sites exist, but in previous analyses, it has been difficult to compare methods when they are applied to different types of sites. Thus, we introduce a new statistical framework that enables rigorous comparisons of the usefulness of different physicochemical properties for predicting virtually any kind of functional site. The program's statistical models were trained for 11 individual properties (electrostatics, concavity, hydrophobicity, etc.) and for 15 neural network combination properties, all optimized and tested on 15 diverse protein functions. To simulate what to expect if the program were run on proteins of unknown function, as might arise from structural genomics, we tested it on 618 proteins of diverse mixed functions. In the higher-scoring top half of all predictions, a functional residue could typically be found within the first 1.7 residues chosen at random. The program may or may not use partial information about the protein's function type as an input, depending on which statistical model the user chooses to employ. If function type is used as an additional constraint, prediction accuracy usually increases, and is particularly good for enzymes, DNA-interacting sites, and oligomeric interfaces. The program can be accessed online (at http://hotpatch.mbi.ucla.edu).
Collapse
Affiliation(s)
- Frank K. Pettit
- UCLA-DOE Institute for Genomics and Proteomics, Molecular Biology Institute, UCLA, Los Angeles, CA.
| | - Emiko Bare
- Department of Biology, Massachusettes Institute of Technology, Cambridge, MA.
| | - Albert Tsai
- Department of Biochemistry & Molecular Biology, Keck School of Medicine, University of Southern California, Los Angeles, CA.
| | - James U. Bowie
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, CA.
| |
Collapse
|
49
|
Huff RG, Bayram E, Tan H, Knutson ST, Knaggs MH, Richon AB, Santago P, Fetrow JS. Chemical and structural diversity in cyclooxygenase protein active sites. Chem Biodivers 2007; 2:1533-52. [PMID: 17191953 DOI: 10.1002/cbdv.200590125] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A major pharmaceutical problem is designing diverse and selective lead compounds. The human genome sequence provides opportunities to discover compounds that are protein selective if we can develop methods to identify specificity determinants from sequence alone. We have analyzed sequence and structural diversity of sheep COX-1 and mouse COX-2 proteins by Active Site Profiling (ASP). Eleven residues that should serve as specificity determinants between COX-1 and COX-2 were identified; however, the literature suggests that only one has been utilized in structure-based discovery. ASP was used to create a position-specific scoring matrix, which was used to identify possible cross-reacting proteins from the human sequences. This method proved selective for cyclooxygenases, comparing well with results using BLAST. The methods identify a probable misannotation of a cyclooxygenase in which there is high sequence similarity scores using BLAST, but ASP shows it does not contain the residues necessary for cyclooxygenase function. ASP Analysis of human COX proteins suggests that some specificity determinants that distinguish COX-1 and COX-2 proteins are similar between sheep COX-1/mouse COX-2 and human COX-1/COX2; however, residue identities at those positions are not necessarily conserved. Our results lay groundwork for development of family-specific pattern recognition methods to selectively match compounds with proteins.
Collapse
Affiliation(s)
- Ryan G Huff
- Department of Computer Science, Wake Forest University, Winston-Salem, NC, USA
| | | | | | | | | | | | | | | |
Collapse
|
50
|
Abstract
The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation.
Collapse
Affiliation(s)
- Saikat Chakrabarti
- National Center for Biotechnology Information, National Libary of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|