1
|
Abstract
An overwhelming array of structural variants has evolved from a comparatively small number of protein structural domains; which has in turn facilitated an expanse of functional derivatives. Herein, I review the primary mechanisms which have contributed to the vastness of our existing, and expanding, protein repertoires. Protein function prediction strategies, both sequence and structure based, are also discussed and their associated strengths and weaknesses assessed.
Collapse
Affiliation(s)
- Roy D Sleator
- Department of Biological Sciences, Cork Institute of Technology, Cork, Ireland.
| |
Collapse
|
2
|
Abstract
The recent explosion in the number and diversity of novel proteins identified by the large-scale "omics" technologies poses new and important questions to the blossoming field of systems biology--what are all these proteins, how did they come about, and most importantly, what do they do? From a comparatively small number of protein structural domains a staggering array of structural variants has evolved, which has in turn facilitated an expanse of functional derivatives. This review considers the primary mechanisms that have contributed to the vastness of our existing, and expanding, protein repertoires, while also outlining the protocols available for elucidating their true biological function. The various function prediction programs available, both sequence and structure based, are discussed and their associated strengths and weaknesses outlined.
Collapse
Affiliation(s)
- Roy D Sleator
- Department of Biological Sciences, Cork Institute of Technology, Bishopstown, Cork, Ireland.
| |
Collapse
|
3
|
Wohlkönig A, Huet J, Looze Y, Wintjens R. Structural relationships in the lysozyme superfamily: significant evidence for glycoside hydrolase signature motifs. PLoS One 2010; 5:e15388. [PMID: 21085702 PMCID: PMC2976769 DOI: 10.1371/journal.pone.0015388] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2010] [Accepted: 08/31/2010] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Chitin is a polysaccharide that forms the hard, outer shell of arthropods and the cell walls of fungi and some algae. Peptidoglycan is a polymer of sugars and amino acids constituting the cell walls of most bacteria. Enzymes that are able to hydrolyze these cell membrane polymers generally play important roles for protecting plants and animals against infection with insects and pathogens. A particular group of such glycoside hydrolase enzymes share some common features in their three-dimensional structure and in their molecular mechanism, forming the lysozyme superfamily. RESULTS Besides having a similar fold, all known catalytic domains of glycoside hydrolase proteins of lysozyme superfamily (families and subfamilies GH19, GH22, GH23, GH24 and GH46) share in common two structural elements: the central helix of the all-α domain, which invariably contains the catalytic glutamate residue acting as general-acid catalyst, and a β-hairpin pointed towards the substrate binding cleft. The invariant β-hairpin structure is interestingly found to display the highest amino acid conservation in aligned sequences of a given family, thereby allowing to define signature motifs for each GH family. Most of such signature motifs are found to have promising performances for searching sequence databases. Our structural analysis further indicates that the GH motifs participate in enzymatic catalysis essentially by containing the catalytic water positioning residue of inverting mechanism. CONCLUSIONS The seven families and subfamilies of the lysozyme superfamily all have in common a β-hairpin structure which displays a family-specific sequence motif. These GH β-hairpin motifs contain potentially important residues for the catalytic activity, thereby suggesting the participation of the GH motif to catalysis and also revealing a common catalytic scheme utilized by enzymes of the lysozyme superfamily.
Collapse
Affiliation(s)
- Alexandre Wohlkönig
- Structural Biology Brussels and Molecular and Cellular Interactions, VIB, Brussels, Belgium
| | - Joëlle Huet
- Laboratoire de Chimie Générale, Institut de Pharmacie, Université Libre de Bruxelles, Brussels, Belgium
| | - Yvan Looze
- Laboratoire de Chimie Générale, Institut de Pharmacie, Université Libre de Bruxelles, Brussels, Belgium
| | - René Wintjens
- Laboratoire de Chimie Générale, Institut de Pharmacie, Université Libre de Bruxelles, Brussels, Belgium
- Interdisciplinary Research Institute, USR 3078 CNRS, Villeneuve d'Ascq, France
| |
Collapse
|
4
|
Min H, Yu S, Lee T, Yoon S. Support vector machine based classification of 3-dimensional protein physicochemical environments for automated function annotation. Arch Pharm Res 2010; 33:1451-9. [PMID: 20945145 DOI: 10.1007/s12272-010-0920-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2010] [Revised: 08/10/2010] [Accepted: 08/15/2010] [Indexed: 10/19/2022]
Abstract
The knowledge of protein functions as well as structures is critical for drug discovery and development. The FEATURE system developed at Stanford is an effective tool for characterizing and classifying local environments in proteins. FEATURE utilizes vectors of a fixed dimension to represent the physicochemical properties around a residue. Functional sites and non-sites are identified by classifying such vectors using the Naïve Bayes classifier. In this paper, we improve the FEATURE framework in several ways so that it can be more flexible, robust and accurate. The new tool can handle vectors of a user-specified dimension and can suppress noise effectively, with little loss of important signals, by employing dimensionality reduction. Furthermore, our approach utilizes the support vector machine for a more accurate classification. According to the results of our thorough experiments, the proposed new approach outperformed the original tool by 20.13% and 13.42% with respect to true and false positive rates, respectively.
Collapse
Affiliation(s)
- Hyeyoung Min
- College of Pharmacy, Chung-Ang University, Seoul, 156-756, Korea
| | | | | | | |
Collapse
|
5
|
An overview of in silico protein function prediction. Arch Microbiol 2010; 192:151-5. [DOI: 10.1007/s00203-010-0549-9] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2009] [Revised: 01/08/2010] [Accepted: 01/10/2010] [Indexed: 12/12/2022]
|
6
|
Hvidsten TR, Lægreid A, Kryshtafovych A, Andersson G, Fidelis K, Komorowski J. A comprehensive analysis of the structure-function relationship in proteins based on local structure similarity. PLoS One 2009; 4:e6266. [PMID: 19603073 PMCID: PMC2705683 DOI: 10.1371/journal.pone.0006266] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2008] [Accepted: 06/10/2009] [Indexed: 12/22/2022] Open
Abstract
Background Sequence similarity to characterized proteins provides testable functional hypotheses for less than 50% of the proteins identified by genome sequencing projects. With structural genomics it is believed that structural similarities may give functional hypotheses for many of the remaining proteins. Methodology/Principal Findings We provide a systematic analysis of the structure-function relationship in proteins using the novel concept of local descriptors of protein structure. A local descriptor is a small substructure of a protein which includes both short- and long-range interactions. We employ a library of commonly reoccurring local descriptors general enough to assemble most existing protein structures. We then model the relationship between these local shapes and Gene Ontology using rule-based learning. Our IF-THEN rule model offers legible, high resolution descriptions that combine local substructures and is able to discriminate functions even for functionally versatile folds such as the frequently occurring TIM barrel and Rossmann fold. By evaluating the predictive performance of the model, we provide a comprehensive quantification of the structure-function relationship based only on local structure similarity. Our findings are, among others, that conserved structure is a stronger prerequisite for enzymatic activity than for binding specificity, and that structure-based predictions complement sequence-based predictions. The model is capable of generating correct hypotheses, as confirmed by a literature study, even when no significant sequence similarity to characterized proteins exists. Conclusions/Significance Our approach offers a new and complete description and quantification of the structure-function relationship in proteins. By demonstrating how our predictions offer higher sensitivity than using global structure, and complement the use of sequence, we show that the presented ideas could advance the development of meta-servers in function prediction.
Collapse
Affiliation(s)
- Torgeir R. Hvidsten
- The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University for Agricultural Sciences, Uppsala, Sweden
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Astrid Lægreid
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, St. Olavs Hospital HF, Trondheim, Norway
| | | | - Gunnar Andersson
- The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University for Agricultural Sciences, Uppsala, Sweden
- Department of Chemistry, Environment and Feed Hygiene, National Veterinary Institute, Uppsala, Sweden
| | | | - Jan Komorowski
- The Linnaeus Centre for Bioinformatics, Uppsala University and The Swedish University for Agricultural Sciences, Uppsala, Sweden
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warszawa, Poland
- * E-mail:
| |
Collapse
|
7
|
Qiu JD, Luo SH, Huang JH, Liang RP. Using support vector machines to distinguish enzymes: approached by incorporating wavelet transform. J Theor Biol 2008; 256:625-31. [PMID: 19049810 DOI: 10.1016/j.jtbi.2008.10.026] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2008] [Revised: 09/26/2008] [Accepted: 10/20/2008] [Indexed: 10/21/2022]
Abstract
The enzymatic attributes of newly found protein sequences are usually determined either by biochemical analysis of eukaryotic and prokaryotic genomes or by microarray chips. These experimental methods are both time-consuming and costly. With the explosion of protein sequences registered in the databanks, it is highly desirable to develop an automated method to identify whether a given new sequence belongs to enzyme or non-enzyme. The discrete wavelet transform (DWT) and support vector machine (SVM) have been used in this study for distinguishing enzyme structures from non-enzymes. The networks have been trained and tested on two datasets of proteins with different wavelet basis functions, decomposition scales and hydrophobicity data types. Maximum accuracy has been obtained using SVM with a wavelet function of Bior2.4, a decomposition scale j=5, and Kyte-Doolittle hydrophobicity scales. The results obtained by the self-consistency test, jackknife test and independent dataset test are encouraging, which indicates that the proposed method can be employed as a useful assistant technique for distinguishing enzymes from non-enzymes.
Collapse
Affiliation(s)
- Jian-Ding Qiu
- Department of Chemistry, Nanchang University, Nanchang 330031, PR China.
| | | | | | | |
Collapse
|
8
|
Xu JR, Zhang JX, Han BC, Liang L, Ji ZL. CytoSVM: an advanced server for identification of cytokine-receptor interactions. Nucleic Acids Res 2007; 35:W538-42. [PMID: 17526528 PMCID: PMC1933174 DOI: 10.1093/nar/gkm254] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The interactions between cytokines and their complementary receptors are the gateways to properly understand a large variety of cytokine-specific cellular activities such as immunological responses and cell differentiation. To discover novel cytokine-receptor interactions, an advanced support vector machines (SVMs) model, CytoSVM, was constructed in this study. This model was iteratively trained using 449 mammal (except rat) cytokine-receptor interactions and about 1 million virtually generated positive and negative vectors in an enriched way. Final independent evaluation by rat's data received sensitivity of 97.4%, specificity of 99.2% and the Matthews correlation coefficient (MCC) of 0.89. This performance is better than normal SVM-based models. Upon this well-optimized model, a web-based server was created to accept primary protein sequence and present its probabilities to interact with one or several cytokines. Moreover, this model was applied to identify putative cytokine-receptor pairs in the whole genomes of human and mouse. Excluding currently known cytokine-receptor interactions, total 1609 novel cytokine-receptor pairs were discovered from human genome with probability ∼80% after further transmembrane analysis. These cover 220 novel receptors (excluding their isoforms) for 126 human cytokines. The screening results have been deposited in a database. Both the server and the database can be freely accessed at http://bioinf.xmu.edu.cn/software/cytosvm/cytosvm.php.
Collapse
Affiliation(s)
- Jin-Rui Xu
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Jing-Xian Zhang
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Bu-Cong Han
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Liang Liang
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
| | - Zhi-Liang Ji
- Key Laboratory for Cell Biology & Tumor Cell Engineering, the Ministry of Education of China, School of Life Sciences and The Key Laboratory for Chemical Biology of Fujian Province, Xiamen University, Xiamen 361005, FuJian Province, P R China
- *To whom correspondence should be addressed. 86-0592-218289786-0592-2181015;
| |
Collapse
|
9
|
Yoon S, Ebert JC, Chung EY, De Micheli G, Altman RB. Clustering protein environments for function prediction: finding PROSITE motifs in 3D. BMC Bioinformatics 2007; 8 Suppl 4:S10. [PMID: 17570144 PMCID: PMC1892080 DOI: 10.1186/1471-2105-8-s4-s10] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified. RESULTS We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs. CONCLUSION Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.
Collapse
Affiliation(s)
- Sungroh Yoon
- Computer Systems Laboratory, Stanford University, Stanford, CA 94305, USA
- Intel Corporation, 2200 Mission College Blvd., Santa Clara, CA 95054, USA
| | - Jessica C Ebert
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Eui-Young Chung
- School of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749, Republic of Korea
| | - Giovanni De Micheli
- Integrated Systems Center, Swiss Federal Institute of Technology (EPFL), Lausanne, CH-1015, Switzerland
| | - Russ B Altman
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
10
|
Espadaler J, Querol E, Aviles FX, Oliva B. Identification of function-associated loop motifs and application to protein function prediction. Bioinformatics 2006; 22:2237-43. [PMID: 16870939 DOI: 10.1093/bioinformatics/btl382] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The detection of function-related local 3D-motifs in protein structures can provide insights towards protein function in absence of sequence or fold similarity. Protein loops are known to play important roles in protein function and several loop classifications have been described, but the automated identification of putative functional 3D-motifs in such classifications has not yet been addressed. This identification can be used on sequence annotations. RESULTS We evaluated three different scoring methods for their ability to identify known motifs from the PROSITE database in ArchDB. More than 500 new putative function-related motifs not reported in PROSITE were identified. Sequence patterns derived from these motifs were especially useful at predicting precise annotations. The number of reliable sequence annotations could be increased up to 100% with respect to standard BLAST. CONTACT boliva@imim.es SUPPLEMENTARY INFORMATION Supplementary Data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jordi Espadaler
- Group de Bioinformàtica Estructural (GRIB-IMIM), Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra 08003 Barcelona, Catalonia, Spain
| | | | | | | |
Collapse
|
11
|
Wang K, Samudrala R. Automated functional classification of experimental and predicted protein structures. BMC Bioinformatics 2006; 7:278. [PMID: 16749925 PMCID: PMC1513613 DOI: 10.1186/1471-2105-7-278] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2006] [Accepted: 06/02/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Proteins that are similar in sequence or structure may perform different functions in nature. In such cases, function cannot be inferred from sequence or structural similarity. RESULTS We analyzed experimental structures belonging to the Structural Classification of Proteins (SCOP) database and showed that about half of them belong to multi-functional fold families for which protein similarity alone is not adequate to assign function. We also analyzed predicted structures from the LiveBench and the PDB-CAFASP experiments and showed that accurate homology-based functional assignments cannot be achieved approximately one third of the time, when the protein is a member of a multi-functional fold family. We then conducted extended performance evaluation and comparisons on both experimental and predicted structures using our Functional Signatures from Structural Alignments (FSSA) algorithm that we previously developed to handle the problem of classifying proteins belonging to multi-functional fold families. CONCLUSION The results indicate that the FSSA algorithm has better accuracy when compared to homology-based approaches for functional classification of both experimental and predicted protein structures, in part due to its use of local, as opposed to global, information for classifying function. The FSSA algorithm has also been implemented as a webserver and is available at http://protinfo.compbio.washington.edu/fssa.
Collapse
Affiliation(s)
- Kai Wang
- Computational Genomics Group, Department of Microbiology, University of Washington, Seattle, WA 98195, USA
| | - Ram Samudrala
- Computational Genomics Group, Department of Microbiology, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
12
|
Abstract
MOTIVATION Current projects for the massive characterization of proteomes are generating protein sequences and structures with unknown function. The difficulty of experimentally determining functionally important sites calls for the development of computational methods. The first techniques, based on the search for fully conserved positions in multiple sequence alignments (MSAs), were followed by methods for locating family-dependent conserved positions. These rely on the functional classification implicit in the alignment for locating these positions related with functional specificity. The next obvious step, still scarcely explored, is to detect these positions using a functional classification different from the one implicit in the sequence relationships between the proteins. Here, we present two new methods for locating functional positions which can incorporate an arbitrary external functional classification which may or may not coincide with the one implicit in the MSA. The Xdet method is able to use a functional classification with an associated hierarchy or similarity between functions to locate positions related to that classification. The MCdet method uses multivariate statistical analysis to locate positions responsible for each one of the functions within a multifunctional family. RESULTS We applied the methods to different cases, illustrating scenarios where there is a disagreement between the functional and the phylogenetic relationships, and demonstrated their usefulness for the phylogeny-independent prediction of functional positions.
Collapse
Affiliation(s)
- Florencio Pazos
- Protein Design Group, National Centre for Biotechnology (CNB-CSIC) C/Darwin, 3. Campus U. Autónoma, 28049 Cantoblanco, Madrid, Spain.
| | | | | |
Collapse
|
13
|
Ofran Y, Punta M, Schneider R, Rost B. Beyond annotation transfer by homology: novel protein-function prediction methods to assist drug discovery. Drug Discov Today 2006; 10:1475-82. [PMID: 16243268 DOI: 10.1016/s1359-6446(05)03621-4] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Every entirely sequenced genome reveals 100 s to 1000 s of protein sequences for which the only annotation available is 'hypothetical protein'. Thus, in the human genome and in the genomes of pathogenic agents there could be 1000 s of potential, unexplored drug targets. Computational prediction of protein function can play a role in studying these targets. We shall review the challenges, research approaches and recently developed tools in the field of computational function-prediction and we will discuss the ways these issues can change the process of drug discovery.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | | | | | |
Collapse
|
14
|
Binkowski TA, Joachimiak A, Liang J. Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Sci 2006; 14:2972-81. [PMID: 16322579 PMCID: PMC2253251 DOI: 10.1110/ps.051759005] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Structural genomics (SG) initiatives are expanding the universe of protein fold space by rapidly determining structures of proteins that were intentionally selected on the basis of low sequence similarity to proteins of known structure. Often these proteins have no associated biochemical or cellular functions. The SG success has resulted in an accelerated deposition of novel structures. In some cases the structural bioinformatics analysis applied to these novel structures has provided specific functional assignment. However, this approach has also uncovered limitations in the functional analysis of uncharacterized proteins using traditional sequence and backbone structure methodologies. A novel method, named pvSOAR (pocket and void Surface of Amino Acid Residues), of comparing the protein surfaces of geometrically defined pockets and voids was developed. pvSOAR was able to detect previously unrecognized and novel functional relationships between surface features of proteins. In this study, pvSOAR is applied to several structural genomics proteins. We examined the surfaces of YecM, BioH, and RpiB from Escherichia coli as well as the CBS domains from inosine-5'-monosphate dehydrogenase from Streptococcus pyogenes, conserved hypothetical protein Ta549 from Thermoplasm acidophilum, and CBS domain protein mt1622 from Methanobacterium thermoautotrophicum with the goal to infer information about their biochemical function.
Collapse
Affiliation(s)
- T Andrew Binkowski
- Department of Bioengineering, The University of Illinois, 851 South Morgan St., Room 218, Chicago, IL 60607, USA.
| | | | | |
Collapse
|
15
|
Cui J, Han LY, Cai CZ, Zheng CJ, Ji ZL, Chen YZ. Prediction of functional class of novel bacterial proteins without the use of sequence similarity by a statistical learning method. J Mol Microbiol Biotechnol 2006; 9:86-100. [PMID: 16319498 DOI: 10.1159/000088839] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692-3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.
Collapse
Affiliation(s)
- J Cui
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Singapore
| | | | | | | | | | | |
Collapse
|
16
|
Han LY, Zheng CJ, Lin HH, Cui J, Li H, Zhang HL, Tang ZQ, Chen YZ. Prediction of functional class of novel plant proteins by a statistical learning method. THE NEW PHYTOLOGIST 2005; 168:109-21. [PMID: 16159326 DOI: 10.1111/j.1469-8137.2005.01482.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods. A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated. To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999-2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods. The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.
Collapse
Affiliation(s)
- L Y Han
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Wang K, Samudrala R. FSSA: a novel method for identifying functional signatures from structural alignments. Bioinformatics 2005; 21:2969-77. [PMID: 15860561 DOI: 10.1093/bioinformatics/bti471] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION It is commonly believed that sequence determines structure, which in turn determines function. However, the presence of many proteins with the same structural fold but different functions suggests that global structure and function do not always correlate well. RESULTS We propose a method for accurate functional annotation, based on identification of functional signatures from structural alignments (FSSA) using the Structural Classification of Proteins (SCOP) database. The FSSA method is superior at function discrimination and classification compared with several methods that directly inherit functional annotation information from homology inference, such as Smith-Waterman, PSI-BLAST, hidden Markov models and structure comparison methods, for a large number of structural fold families. Our results indicate that the contributions of amino acid residue types and positions to structure and function are largely separable for proteins in multi-functional fold families.
Collapse
Affiliation(s)
- Kai Wang
- Computational Genomics Group, Department of Microbiology, University of Washington Seattle, WA 98195, USA
| | | |
Collapse
|
18
|
Guo T, Shi Y, Sun Z. A novel statistical ligand-binding site predictor: application to ATP-binding sites. Protein Eng Des Sel 2005; 18:65-70. [PMID: 15799998 DOI: 10.1093/protein/gzi006] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Structural genomics initiatives are leading to rapid growth in newly determined protein 3D structures, the functional characterization of which may still be inadequate. As an attempt to provide insights into the possible roles of the emerging proteins whose structures are available and/or to complement biochemical research, a variety of computational methods have been developed for the screening and prediction of ligand-binding sites in raw structural data, including statistical pattern classification techniques. In this paper, we report a novel statistical descriptor (the Oriented Shell Model) for protein ligand-binding sites, which utilizes the distance and angular position distribution of various structural and physicochemical features present in immediate proximity to the center of a binding site. Using the support vector machine (SVM) as the classifier, our model identified 69% of the ATP-binding sites in whole-protein scanning tests and in eukaryotic proteins the accuracy is particularly high. We propose that this feature extraction and machine learning procedure can screen out ligand-binding-capable protein candidates and can yield valuable biochemical information for individual proteins.
Collapse
Affiliation(s)
- Ting Guo
- Institute of Bioinformatics, MOE Key Laboratory of Bioinformatics, State Key Laboratory of Biomembrane and Membrane Biotechnology, Department of Biological Sciences and Biotechnology, Beijing 100084, China
| | | | | |
Collapse
|
19
|
Han L, Cai C, Ji Z, Chen Y. Prediction of functional class of novel viral proteins by a statistical learning method irrespective of sequence similarity. Virology 2005; 331:136-43. [PMID: 15582660 PMCID: PMC7111859 DOI: 10.1016/j.virol.2004.10.020] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2004] [Revised: 09/15/2004] [Accepted: 10/09/2004] [Indexed: 11/19/2022]
Abstract
The function of a substantial percentage of the putative protein-coding open reading frames (ORFs) in viral genomes is unknown. As their sequence is not similar to that of proteins of known function, the function of these ORFs cannot be assigned on the basis of sequence similarity. Methods complement or in combination with sequence similarity-based approaches are being explored. The web-based software SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi) to some extent assigns protein functional family irrespective of sequence similarity and has been found to be useful for studying distantly related proteins [Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13): 3692–3697]. Here 25 novel viral proteins are selected to test the capability of SVMProt for functional family assignment of viral proteins whose function cannot be confidently predicted on by sequence similarity methods at present. These proteins are without a sequence homolog in the Swissprot database, with its precise function provided in the literature, and not included in the training sets of SVMProt. The predicted functional classes of 72% of these proteins match the literature-described function, which is compared to the overall accuracy of 87% for SVMProt functional class assignment of 34 582 proteins. This suggests that SVMProt to some extent is capable of functional class assignment irrespective of sequence similarity and it is potentially useful for facilitating functional study of novel viral proteins.
Collapse
Affiliation(s)
- L.Y. Han
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| | - C.Z. Cai
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Department of Applied Physics, Chongquing University, Chongquing 400044, PR China
| | - Z.L. Ji
- Department of Biology, School of Life Sciences, Xiamen University, Xiamen 361000, FuJian Province, PR China
| | - Y.Z. Chen
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Block SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
- Corresponding author. Fax: +65 6774 6756.
| |
Collapse
|
20
|
Skolnick J, Kihara D, Zhang Y. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 2004; 56:502-18. [PMID: 15229883 DOI: 10.1002/prot.20106] [Citation(s) in RCA: 118] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This article describes the PROSPECTOR_3 threading algorithm, which combines various scoring functions designed to match structurally related target/template pairs. Each variant described was found to have a Z-score above which most identified templates have good structural (threading) alignments, Z(struct) (Z(good)). 'Easy' targets with accurate threading alignments are identified as single templates with Z > Z(good) or two templates, each with Z > Z(struct), having a good consensus structure in mutually aligned regions. 'Medium' targets have a pair of templates lacking a consensus structure, or a single template for which Z(struct) < Z < Z(good). PROSPECTOR_3 was applied to a comprehensive Protein Data Bank (PDB) benchmark composed of 1491 single domain proteins, 41-200 residues long and no more than 30% identical to any threading template. Of the proteins, 878 were found to be easy targets, with 761 having a root mean square deviation (RMSD) from native of less than 6.5 A. The average contact prediction accuracy was 46%, and on average 17.6 residue continuous fragments were predicted with RMSD values of 2.0 A. There were 606 medium targets identified, 87% (31%) of which had good structural (threading) alignments. On average, 9.1 residue, continuous fragments with RMSD of 2.5 A were predicted. Combining easy and medium sets, 63% (91%) of the targets had good threading (structural) alignments compared to native; the average target/template sequence identity was 22%. Only nine targets lacked matched templates. Moreover, PROSPECTOR_3 consistently outperforms PSIBLAST. Similar results were predicted for open reading frames (ORFS) < or =200 residues in the M. genitalium, E. coli and S. cerevisiae genomes. Thus, progress has been made in identification of weakly homologous/analogous proteins, with very high alignment coverage, both in a comprehensive PDB benchmark as well as in genomes.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington St., Suite 300, Buffalo, NY 14203, USA.
| | | | | |
Collapse
|
21
|
Chelliah V, Chen L, Blundell TL, Lovell SC. Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol 2004; 342:1487-504. [PMID: 15364576 DOI: 10.1016/j.jmb.2004.08.022] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2003] [Revised: 07/20/2004] [Accepted: 08/09/2004] [Indexed: 11/29/2022]
Abstract
Structural genomics projects are producing many three-dimensional structures of proteins that have been identified only from their gene sequences. It is therefore important to develop computational methods that will predict sites involved in productive intermolecular interactions that might give clues about functions. Techniques based on evolutionary conservation of amino acids have the advantage over physiochemical methods in that they are more general. However, the majority of techniques neither use all available structural and sequence information, nor are able to distinguish between evolutionary restraints that arise from the need to maintain structure and those that arise from function. Three methods to identify evolutionary restraints on protein sequence and structure are described here. The first identifies those residues that have a higher degree of conservation than expected: this is achieved by comparing for each amino acid position the sequence conservation observed in the homologous family of proteins with the degree of conservation predicted on the basis of amino acid type and local environment. The second uses information theory to identify those positions where environment-specific substitution tables make poor predictions of the overall amino acid substitution pattern. The third method identifies those residues that have highly conserved positions when three-dimensional structures of proteins in a homologous family are superposed. The scores derived from these methods are mapped onto the protein three-dimensional structures and contoured, allowing identification clusters of residues with strong evolutionary restraints that are sites of interaction in proteins involved in a variety of functions. Our method differs from other published techniques by making use of structural information to identify restraints that arise from the structure of the protein and differentiating these restraints from others that derive from intermolecular interactions that mediate functions in the whole organism.
Collapse
Affiliation(s)
- Vijayalakshmi Chelliah
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | | | | | | |
Collapse
|
22
|
Pazos F, Sternberg MJE. Automated prediction of protein function and detection of functional sites from structure. Proc Natl Acad Sci U S A 2004; 101:14754-9. [PMID: 15456910 PMCID: PMC522026 DOI: 10.1073/pnas.0404569101] [Citation(s) in RCA: 139] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2004] [Indexed: 11/18/2022] Open
Abstract
Current structural genomics projects are yielding structures for proteins whose functions are unknown. Accordingly, there is a pressing requirement for computational methods for function prediction. Here we present PHUNCTIONER, an automatic method for structure-based function prediction using automatically extracted functional sites (residues associated to functions). The method relates proteins with the same function through structural alignments and extracts 3D profiles of conserved residues. Functional features to train the method are extracted from the Gene Ontology (GO) database. The method extracts these features from the entire GO hierarchy and hence is applicable across the whole range of function specificity. 3D profiles associated with 121 GO annotations were extracted. We tested the power of the method both for the prediction of function and for the extraction of functional sites. The success of function prediction by our method was compared with the standard homology-based method. In the zone of low sequence similarity (approximately 15%), our method assigns the correct GO annotation in 90% of the protein structures considered, approximately 20% higher than inheritance of function from the closest homologue.
Collapse
Affiliation(s)
- Florencio Pazos
- Structural Bioinformatics Group, Biochemistry Building, Department of Biological Sciences, Imperial College London, London SW7 2AZ, UK
| | | |
Collapse
|
23
|
Reinhardt A, Eisenberg D. DPANN: Improved sequence to structure alignments following fold recognition. Proteins 2004; 56:528-38. [PMID: 15229885 DOI: 10.1002/prot.20144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In fold recognition (FR) a protein sequence of unknown structure is assigned to the closest known three-dimensional (3D) fold. Although FR programs can often identify among all possible folds the one a sequence adopts, they frequently fail to align the sequence to the equivalent residue positions in that fold. Such failures frustrate the next step in structure prediction, protein model building. Hence it is desirable to improve the quality of the alignments between the sequence and the identified structure. We have used artificial neural networks (ANN) to derive a substitution matrix to create alignments between a protein sequence and a protein structure through dynamic programming (DPANN: Dynamic Programming meets Artificial Neural Networks). The matrix is based on the amino acid type and the secondary structure state of each residue. In a database of protein pairs that have the same fold but lack sequences-similarity, DPANN aligns over 30% of all sequences to the paired structure, resembling closely the structural superposition of the pair. In over half of these cases the DPANN alignment is close to the structural superposition, although the initial alignment from the step of fold recognition is not close. Conversely, the alignment created during fold recognition outperforms DPANN in only 10% of all cases. Thus application of DPANN after fold recognition leads to substantial improvements in alignment accuracy, which in turn provides more useful templates for the modeling of protein structures. In the artificial case of using actual instead of predicted secondary structures for the probe protein, over 50% of the alignments are successful.
Collapse
|
24
|
Bhaduri A, Ravishankar R, Sowdhamini R. Conserved spatially interacting motifs of protein superfamilies: application to fold recognition and function annotation of genome data. Proteins 2004; 54:657-70. [PMID: 14997562 DOI: 10.1002/prot.10638] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Limitations in techniques for the elucidation of protein function have led to an increasing gap between the annotated proteins and those encoded in a genome. The functional selection and three-dimensional structural constraints of proteins in nature often relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. We identify spatially interacting conserved regions, or motifs, within protein superfamilies that are critical for structure and/or function. A search in sequence databases using these descriptors as additional constraints is an approach to identifying putative additional members of superfamilies. Such constrained searches have been tested against proteins of known structure to demonstrate high percentage specificity (93) with a low error rate of 0.0004. This approach has been compared with other sensitive sequence search methods (e.g., PSI-BLAST, HMMsearch, and IMPALA). It has been extended to analyze the distribution of 11 superfamilies in 93 genomes, including the human genome.
Collapse
Affiliation(s)
- Anirban Bhaduri
- National Centre for Biological Sciences, Tata Institute of Fundamental Research, UAS-GKVK Campus, Bangalore, India
| | | | | |
Collapse
|
25
|
Abstract
One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G-protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non-enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi-class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi.
Collapse
Affiliation(s)
- C Z Cai
- Department of Applied Physics, Chongqing University, Chongqing, Peoples Republic of China
| | | | | | | |
Collapse
|
26
|
Abstract
We show that three-dimensional signatures consisting of only a few functionally important residues can be diagnostic of membership in superfamilies of enzymes. Using the enolase superfamily as a model system, we demonstrate that such a signature, or template, can identify superfamily members in structural databases with high sensitivity and specificity. This is remarkable because superfamilies can be highly diverse, with members catalyzing many different overall reactions; the unifying principle can be a conserved partial reaction or chemical capability. Our definition of a superfamily thus hinges on the disposition of residues involved in a conserved function, rather than on fold similarity alone. A clear advantage of basing structure searches on such active site templates rather than on fold similarity is the specificity with which superfamilies with distinct functional characteristics can be identified within a large set of proteins with the same fold, such as the (beta/alpha)8 barrels. Preliminary results are presented for an additional group of enzymes with a different fold, the haloacid dehalogenase superfamily, suggesting that this approach may be generally useful for assigning reading frames of unknown function to specific superfamilies and thereby allowing inference of some of their functional properties.
Collapse
Affiliation(s)
- Elaine C Meng
- Department of Pharmaceutical Chemistry, University of California, Genentech Hall, 600 Sixteenth Street, San Francisco, CA 94143-2240, USA
| | | | | |
Collapse
|
27
|
|
28
|
Baxter SM, Rosenblum JS, Knutson S, Nelson MR, Montimurro JS, Di Gennaro JA, Speir JA, Burbaum JJ, Fetrow JS. Synergistic Computational and Experimental Proteomics Approaches for More Accurate Detection of Active Serine Hydrolases in Yeast. Mol Cell Proteomics 2004; 3:209-25. [PMID: 14645503 DOI: 10.1074/mcp.m300082-mcp200] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
An analysis of the structurally and catalytically diverse serine hydrolase protein family in the Saccharomyces cerevisiae proteome was undertaken using two independent but complementary, large-scale approaches. The first approach is based on computational analysis of serine hydrolase active site structures; the second utilizes the chemical reactivity of the serine hydrolase active site in complex mixtures. These proteomics approaches share the ability to fractionate the complex proteome into functional subsets. Each method identified a significant number of sequences, but 15 proteins were identified by both methods. Eight of these were unannotated in the Saccharomyces Genome Database at the time of this study and are thus novel serine hydrolase identifications. Three of the previously uncharacterized proteins are members of a eukaryotic serine hydrolase family, designated as Fsh (family of serine hydrolase), identified here for the first time. OVCA2, a potential human tumor suppressor, and DYR-SCHPO, a dihydrofolate reductase from Schizosaccharomyces pombe, are members of this family. Comparing the combined results to results of other proteomic methods showed that only four of the 15 proteins were identified in a recent large-scale, "shotgun" proteomic analysis and eight were identified using a related, but similar, approach (neither identifies function). Only 10 of the 15 were annotated using alternate motif-based computational tools. The results demonstrate the precision derived from combining complementary, function-based approaches to extract biological information from complex proteomes. The chemical proteomics technology indicates that a functional protein is being expressed in the cell, while the computational proteomics technology adds details about the specific type of function and residue that is likely being labeled. The combination of synergistic methods facilitates analysis, enriches true positive results, and increases confidence in novel identifications. This work also highlights the risks inherent in annotation transfer and the use of scoring functions for determination of correct annotations.
Collapse
Affiliation(s)
- Susan M Baxter
- GeneFormatics, Inc., 5830 Oberlin Drive, Suite 200, San Diego, CA 92121, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Affiliation(s)
- Carol A Rohl
- Department of Biochemistry and Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
| | | | | | | |
Collapse
|
30
|
Cammer SA, Hoffman BT, Speir JA, Canady MA, Nelson MR, Knutson S, Gallina M, Baxter SM, Fetrow JS. Structure-based active site profiles for genome analysis and functional family subclassification. J Mol Biol 2003; 334:387-401. [PMID: 14623182 DOI: 10.1016/j.jmb.2003.09.062] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
In previous work, structure-based functional site descriptors, fuzzy functional forms (FFFs), were developed to recognize structurally conserved active sites in proteins. These descriptors identify members of protein families according to active-site structural similarity, rather than overall sequence or structure similarity. FFFs are defined by a minimal number of highly conserved residues and their three-dimensional arrangement. This approach is advantageous for function assignment across broad families, but is limited when applied to detailed subclassification within these families. In the work described here, we developed a method of three-dimensional, or structure-based, active-site profiling that utilizes FFFs to identify residues located in the spatial environment around the active site. Three-dimensional active-site profiling reveals similarities and differences among active sites across protein families. Using this approach, active-site profiles were constructed from known structures for 193 functional families, and these profiles were verified as distinct and characteristic. To achieve this result, a scoring function was developed that discriminates between true functional sites and those that are geometrically most similar, but do not perform the same function. In a large-scale retrospective analysis of human genome sequences, this profile score was shown to identify specific functional families correctly. The method is effective at recognizing the likely subtype of structurally uncharacterized members of the diverse family of protein kinases, categorizing sequences correctly that were misclassified by global sequence alignment methods. Subfamily information provided by this three-dimensional active-site profiling method yields key information for specific and selective inhibitor design for use in the pharmaceutical industry.
Collapse
Affiliation(s)
- Stephen A Cammer
- GeneFormatics Inc., 5830 Oberlin Drive, San Diego, CA 92121, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Binkowski TA, Adamian L, Liang J. Inferring functional relationships of proteins from local sequence and spatial surface patterns. J Mol Biol 2003; 332:505-26. [PMID: 12948498 DOI: 10.1016/s0022-2836(03)00882-9] [Citation(s) in RCA: 129] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
We describe a novel approach for inferring functional relationship of proteins by detecting sequence and spatial patterns of protein surfaces. Well-formed concave surface regions in the form of pockets and voids are examined to identify similarity relationship that might be directly related to protein function. We first exhaustively identify and measure analytically all 910,379 surface pockets and interior voids on 12,177 protein structures from the Protein Data Bank. The similarity of patterns of residues forming pockets and voids are then assessed in sequence, in spatial arrangement, and in orientational arrangement. Statistical significance in the form of E and p-values is then estimated for each of the three types of similarity measurements. Our method is fully automated without human intervention and can be used without input of query patterns. It does not assume any prior knowledge of functional residues of a protein, and can detect similarity based on surface patterns small and large. It also tolerates, to some extent, conformational flexibility of functional sites. We show with examples that this method can detect functional relationship with specificity for members of the same protein family and superfamily, as well as remotely related functional surfaces from proteins of different fold structures. We envision that this method can be used for discovering novel functional relationship of protein surfaces, for functional annotation of protein structures with unknown biological roles, and for further inquiries on evolutionary origins of structural elements important for protein function.
Collapse
Affiliation(s)
- T Andrew Binkowski
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607-7052, USA
| | | | | |
Collapse
|
32
|
Campbell SJ, Gold ND, Jackson RM, Westhead DR. Ligand binding: functional site location, similarity and docking. Curr Opin Struct Biol 2003; 13:389-95. [PMID: 12831892 DOI: 10.1016/s0959-440x(03)00075-7] [Citation(s) in RCA: 139] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Computational methods for the detection and characterisation of protein ligand-binding sites have increasingly become an area of interest now that large amounts of protein structural information are becoming available prior to any knowledge of protein function. There have been particularly interesting recent developments in the following areas: first, functional site detection, whereby protein evolutionary information has been used to locate binding sites on the protein surface; second, functional site similarity, whereby structural similarity and three-dimensional templates can be used to compare and classify and potentially locate new binding sites; and third, ligand docking, which is being used to find and validate functional sites, in addition to having more conventional uses in small-molecule lead discovery.
Collapse
Affiliation(s)
- Stephen J Campbell
- School of Biochemistry and Molecular Biology, University of Leeds, Leeds, LS2 9JT, UK
| | | | | | | |
Collapse
|
33
|
Betz SF, Baxter SM, Fetrow JS. Function first: a powerful approach to post-genomic drug discovery. Drug Discov Today 2002; 7:865-71. [PMID: 12546953 DOI: 10.1016/s1359-6446(02)02398-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
In the post-genomic era, pharmaceutical researchers must evaluate vast numbers of protein sequences and formulate novel, intelligent strategies for identifying valid targets and discovering leads against them. The identification of small molecules that selectively target proteins or protein families will be aided by knowing the function and/or the structure of the target(s). By identifying protein function first, efficiencies are gained that allow subsequent focus of resources on particular protein families of interest. This article reviews current proteomic-scale approaches to identifying function as a way of accelerating lead discovery.
Collapse
Affiliation(s)
- Stephen F Betz
- GeneFormatics, 5830 Oberlin Drive, Suite 200, San Diego, CA 92121, USA
| | | | | |
Collapse
|
34
|
Bradley P, Kim PS, Berger B. TRILOGY: Discovery of sequence-structure patterns across diverse proteins. Proc Natl Acad Sci U S A 2002; 99:8500-5. [PMID: 12084910 PMCID: PMC124288 DOI: 10.1073/pnas.112221999] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/12/2002] [Indexed: 11/18/2022] Open
Abstract
We describe a new computer program, trilogy, for the automated discovery of sequence-structure patterns in proteins. trilogy implements a pattern discovery algorithm that begins with an exhaustive analysis of flexible three-residue patterns; a subset of these patterns are selected as seeds for an extension process in which longer patterns are identified. A key feature of the method is explicit treatment of both the sequence and structure components of these motifs: each trilogy pattern is a pair consisting of a sequence pattern and a structure pattern. Matches to both these component patterns are identified independently, allowing the program to assign a significance score to each sequence-structure pattern that assesses the degree of correlation between the corresponding sequence and structure motifs. trilogy identifies several thousand high-scoring patterns that occur across protein families. These include both previously identified and potentially novel motifs. We expect that these sequence-structure patterns will be useful in predicting protein structure from sequence, annotating newly determined protein structures, and identifying novel motifs of potential functional or structural significance. Further details on 7,768 significant patterns identified by trilogy can be found at http://theory.lcs.mit.edu/trilogy.
Collapse
Affiliation(s)
- Philip Bradley
- Department of Mathematics and Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | | |
Collapse
|
35
|
Abstract
The genomes of over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects, the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins, and have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathway. Many proteins do not have close homologues in different species (orphans) and there could even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous and that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to be to develop better methods for comparative proteomics.
Collapse
Affiliation(s)
- Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, BB217, New York, NY 10032, USA.
| |
Collapse
|
36
|
Schmidt S, Bork P, Dandekar T. A versatile structural domain analysis server using profile weight matrices. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 2002; 42:405-7. [PMID: 11911710 DOI: 10.1021/ci010374r] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The WEB tool "AnDom" assigns to a given protein sequence all experimentally determined structural domains contained within it, including multidomain and large proteins. The server uses profile specific matrices from custom generated multiple sequence alignments of all known SCOP domains (SCOP version 1.50). Prediction time is short allowing numerous applications for structural genomics including investigation of complex eucaryotic protein families. The WWW server is at http://www.bork.embl-heidelberg.de/AnDom, and profiles can be downloaded at ftp.bork.embl-heidelberg.de/pub/users/ schmidt/AnDom.
Collapse
|
37
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2002. [PMCID: PMC2447253 DOI: 10.1002/cfg.117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
|
38
|
Joachimiak MP, Cohen FE. JEvTrace: refinement and variations of the evolutionary trace in JAVA. Genome Biol 2002; 3:RESEARCH0077. [PMID: 12537566 PMCID: PMC151179 DOI: 10.1186/gb-2002-3-12-research0077] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2002] [Revised: 07/11/2002] [Accepted: 10/21/2002] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Details of functional speciation within gene families can be difficult to identify using standard multiple sequence alignment (MSA) methods. The evolutionary trace (ET) was developed as a visualization tool to combine MSA, phylogenetic and structural data for identification of functional sites in proteins. The method has been successful in extracting evolutionary details of functional surfaces in a number of biological systems and modifications of the method are useful in creating hypotheses about the function of previously unannotated genes. We wish to facilitate the graphical interpretation of disparate data types through the creation of flexible software implementations. RESULTS We have implemented the ET method in a JAVA graphical interface, JEvTrace. Users can analyze and visualize ET input and output with respect to protein phylogeny, sequence and structure. Function discovery with JEvTrace is demonstrated on two proteins with recently determined crystal structures: YlxR from Streptococcus pneumoniae with a predicted RNA-binding function, and a Haemophilus influenzae protein of unknown function, YbaK. To facilitate analysis and storage of results we propose a MSA coloring data structure. The sequence coloring format readily captures evolutionary, biological, functional and structural features of MSAs. CONCLUSIONS Protein families and phylogeny represent complex data with statistical outliers and special cases. The JEvTrace implementation of the ET method allows detailed mining and graphical visualization of evolutionary sequence relationships.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Graduate Group in Biophysics, University of California San Francisco, San Francisco, CA 94143-0450, USA
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, CA 94143-0450, USA
| | - Fred E Cohen
- Graduate Group in Biophysics, University of California San Francisco, San Francisco, CA 94143-0450, USA
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, CA 94143-0450, USA
| |
Collapse
|