151
|
Wrzeszczynski KO, Rost B. Cell cycle kinases predicted from conserved biophysical properties. Proteins 2009; 74:655-68. [PMID: 18704950 DOI: 10.1002/prot.22181] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine-learning techniques can classify functionally related proteins where homology-transfer as well as sequence and structure motifs fail. Here, we present a method that aimed at complementing homology-transfer in the identification of cell cycle control kinases from sequence alone. First, we identified functionally significant residues in cell cycle proteins through their high sequence conservation and biophysical properties. We then incorporated these residues and their features into support vector machines (SVM) to identify new kinases and more specifically to differentiate cell cycle kinases from other kinases and other proteins. As expected, the most informative residues tend to be highly conserved and tend to localize in the ATP binding regions of the kinases. Another observation confirmed that ATP binding regions are typically not found on the surface but in partially buried sites, and that this fact is correctly captured by accessibility predictions. Using these highly conserved, semi-buried residues and their biophysical properties, we could distinguish cell cycle S/T kinases from other kinase families at levels around 70-80% accuracy and 62-81% coverage. An application to the entire human proteome predicted at least 97 human proteins with limited previous annotations to be candidates for cell cycle kinases.
Collapse
Affiliation(s)
- Kazimierz O Wrzeszczynski
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
152
|
Hawkins T, Chitale M, Luban S, Kihara D. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009; 74:566-82. [PMID: 18655063 DOI: 10.1002/prot.22172] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA
| | | | | | | |
Collapse
|
153
|
Standley DM, Toh H, Nakamura H. Functional annotation by sequence-weighted structure alignments: statistical analysis and case studies from the Protein 3000 structural genomics project in Japan. Proteins 2009; 72:1333-51. [PMID: 18384072 DOI: 10.1002/prot.22015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized.
Collapse
Affiliation(s)
- Daron M Standley
- Research Center for Structural and Functional Proteomics, Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-0871, Japan.
| | | | | |
Collapse
|
154
|
Shevchenko A, Valcu CM, Junqueira M. Tools for exploring the proteomosphere. J Proteomics 2009; 72:137-44. [PMID: 19167528 DOI: 10.1016/j.jprot.2009.01.012] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Accepted: 01/13/2009] [Indexed: 11/29/2022]
Abstract
Homology-driven proteomics aims at exploring the proteomes of organisms with unsequenced genomes that, despite rapid genomic sequencing progress, still represent the overwhelming majority of species in the biosphere. Methodologies have been developed to enable automated LC-MS/MS identifications of unknown proteins, which rely on the sequence similarity between the fragmented peptides and reference database sequences from phylogenetically related species. However, because full sequences of matched proteins are not available and matching specificity is reduced, estimating protein abundances should become the obligatory element of homology-driven proteomics pipelines to circumvent the interpretation bias towards proteins from evolutionary conserved families.
Collapse
Affiliation(s)
- Andrej Shevchenko
- Max Planck Institute of Molecular Cell Biology and Genetics, 01307 Dresden, Germany.
| | | | | |
Collapse
|
155
|
Tseng YY, Dundas J, Liang J. Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns. J Mol Biol 2009; 387:451-64. [PMID: 19154742 DOI: 10.1016/j.jmb.2008.12.072] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Revised: 12/19/2008] [Accepted: 12/23/2008] [Indexed: 11/25/2022]
Abstract
Inferring protein functions from structures is a challenging task, as a large number of orphan protein structures from structural genomics project are now solved without their biochemical functions characterized. For proteins binding to similar substrates or ligands and carrying out similar functions, their binding surfaces are under similar physicochemical constraints, and hence the sets of allowed and forbidden residue substitutions are similar. However, it is difficult to isolate such selection pressure due to protein function from selection pressure due to protein folding, and evolutionary relationship reflected by global sequence and structure similarities between proteins is often unreliable for inferring protein function. We have developed a method, called pevoSOAR (pocket-based evolutionary search of amino acid residues), for predicting protein functions by solving the problem of uncovering amino acids residue substitution pattern due to protein function and separating it from amino acids substitution pattern due to protein folding. We incorporate evolutionary information specific to an individual binding region and match local surfaces on a large scale with millions of precomputed protein surfaces to identify those with similar functions. Our pevoSOAR method also generates a probablistic model called the computed binding a profile that characterizes protein-binding activities that may involve multiple substrates or ligands. We show that our method can be used to predict enzyme functions with accuracy. Our method can also assess enzyme binding specificity and promiscuity. In an objective large-scale test of 100 enzyme families with thousands of structures, our predictions are found to be sensitive and specific: At the stringent specificity level of 99.98%, we can correctly predict enzyme functions for 80.55% of the proteins. The overall area under the receiver operating characteristic curve measuring the performance of our prediction is 0.955, close to the perfect value of 1.00. The best Matthews coefficient is 86.6%. Our method also works well in predicting the biochemical functions of orphan proteins from structural genomics projects.
Collapse
Affiliation(s)
- Yan Yuan Tseng
- Department of Bioengineering, University of Illinois at Chicago, 851 S. Morgan Street, Room 218, SEO, MC-063, Chicago, IL 60607-7052, USA
| | | | | |
Collapse
|
156
|
Abstract
Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
Collapse
Affiliation(s)
- Umar Syed
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | | |
Collapse
|
157
|
Addou S, Rentzsch R, Lee D, Orengo CA. Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J Mol Biol 2008; 387:416-30. [PMID: 19135455 DOI: 10.1016/j.jmb.2008.12.045] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2008] [Revised: 12/12/2008] [Accepted: 12/17/2008] [Indexed: 11/24/2022]
Abstract
Divergence in function of homologous proteins is based on both sequence and structural changes. Overall enzyme function has been reported to diverge earlier (50% sequence identity) than overall structure (35%). We herein study the functional conservation of enzymes and non-enzyme sequences using the protein domain families in CATH-Gene3D. Despite the rapid increase in sequence data since the last comprehensive study by Tian and Skolnick, our findings suggest that generic thresholds of 40% and 60% aligned sequence identity are still sufficient to safely inherit third-level and full Enzyme Commission numbers, respectively. This increases to 50% and 70% on the domain level, unless the multi-domain architecture matches. Assignments from the Kyoto Encyclopedia of Genes and Genomes and the Munich Information Center for Protein Sequences Functional Catalogue seem to be less conserved with sequence, probably due to a more pathway-centric view: 80% domain sequence identity is required for safe function transfer. Comparing domains (more pairwise relationships) and the use of family-specific thresholds (varying evolutionary speeds) yields the highest coverage rates when transferring functions to model proteomes. An average twofold increase in enzyme annotations is seen for 523 proteomes in Gene3D. As simple 'rules of thumb', sequence identity thresholds do not require a bioinformatics background. We will provide and update this information with future releases of CATH-Gene3D.
Collapse
Affiliation(s)
- Sarah Addou
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | | | | |
Collapse
|
158
|
Punta M, Ofran Y. The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput Biol 2008; 4:e1000160. [PMID: 18974821 PMCID: PMC2518264 DOI: 10.1371/journal.pcbi.1000160] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Affiliation(s)
- Marco Punta
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
- Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York, New York, United States of America
- Northeast Structural Genomics Consortium (NESG), Columbia University, New York, New York, United States of America
| | - Yanay Ofran
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan, Israel
- * E-mail:
| |
Collapse
|
159
|
Mazumder R, Vasudevan S. Structure-guided comparative analysis of proteins: principles, tools, and applications for predicting function. PLoS Comput Biol 2008; 4:e1000151. [PMID: 18818720 PMCID: PMC2515338 DOI: 10.1371/journal.pcbi.1000151] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Affiliation(s)
- Raja Mazumder
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, D.C., United States of America
| | - Sona Vasudevan
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, D.C., United States of America
- * E-mail:
| |
Collapse
|
160
|
Karimpour-Fard A, Leach SM, Gill RT, Hunter LE. Predicting protein linkages in bacteria: which method is best depends on task. BMC Bioinformatics 2008; 9:397. [PMID: 18816389 PMCID: PMC2570368 DOI: 10.1186/1471-2105-9-397] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2008] [Accepted: 09/24/2008] [Indexed: 01/06/2023] Open
Abstract
Background Applications of computational methods for predicting protein functional linkages are increasing. In recent years, several bacteria-specific methods for predicting linkages have been developed. The four major genomic context methods are: Gene cluster, Gene neighbor, Rosetta Stone, and Phylogenetic profiles. These methods have been shown to be powerful tools and this paper provides guidelines for when each method is appropriate by exploring different features of each method and potential improvements offered by their combination. We also review many previous treatments of these prediction methods, use the latest available annotations, and offer a number of new observations. Results Using Escherichia coli K12 and Bacillus subtilis, linkage predictions made by each of these methods were evaluated against three benchmarks: functional categories defined by COG and KEGG, known pathways listed in EcoCyc, and known operons listed in RegulonDB. Each evaluated method had strengths and weaknesses, with no one method dominating all aspects of predictive ability studied. For functional categories, as previous studies have shown, the Rosetta Stone method was individually best at detecting linkages and predicting functions among proteins with shared KEGG categories while the Phylogenetic profile method was best for linkage detection and function prediction among proteins with common COG functions. Differences in performance under COG versus KEGG may be attributable to the presence of paralogs. Better function prediction was observed when using a weighted combination of linkages based on reliability versus using a simple unweighted union of the linkage sets. For pathway reconstruction, 99 complete metabolic pathways in E. coli K12 (out of the 209 known, non-trivial pathways) and 193 pathways with 50% of their proteins were covered by linkages from at least one method. Gene neighbor was most effective individually on pathway reconstruction, with 48 complete pathways reconstructed. For operon prediction, Gene cluster predicted completely 59% of the known operons in E. coli K12 and 88% (333/418)in B. subtilis. Comparing two versions of the E. coli K12 operon database, many of the unannotated predictions in the earlier version were updated to true predictions in the later version. Using only linkages found by both Gene Cluster and Gene Neighbor improved the precision of operon predictions. Additionally, as previous studies have shown, combining features based on intergenic region and protein function improved the specificity of operon prediction. Conclusion A common problem for computational methods is the generation of a large number of false positives that might be caused by an incomplete source of validation. By comparing two versions of a database, we demonstrated the dramatic differences on reported results. We used several benchmarks on which we have shown the comparative effectiveness of each prediction method, as well as provided guidelines as to which method is most appropriate for a given prediction task.
Collapse
Affiliation(s)
- Anis Karimpour-Fard
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado 80045, USA.
| | | | | | | |
Collapse
|
161
|
Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr Opin Struct Biol 2008; 18:394-402. [PMID: 18554899 DOI: 10.1016/j.sbi.2008.05.007] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2008] [Revised: 04/16/2008] [Accepted: 05/07/2008] [Indexed: 11/29/2022]
Abstract
Advances in protein structure determination, led by the structural genomics initiatives have increased the proportion of novel folds deposited in the Protein Data Bank. However, these structures are often not accompanied by functional annotations with experimental confirmation. In this review, we reassess the meaning of structural novelty and examine its relevance to the complexity of the structure-function paradigm. Recent advances in the prediction of protein function from structure are discussed, as well as new sequence-based methods for partitioning large, diverse superfamilies into biologically meaningful clusters. Obtaining structural data for these functionally coherent groups of proteins will allow us to better understand the relationship between structure and function.
Collapse
Affiliation(s)
- Oliver C Redfern
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
| | | | | |
Collapse
|
162
|
Gómez A, Cedano J, Espadaler J, Hermoso A, Piñol J, Querol E. Prediction of protein function improving sequence remote alignment search by a fuzzy logic algorithm. Protein J 2008; 27:130-9. [PMID: 18066655 DOI: 10.1007/s10930-007-9116-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence.
Collapse
Affiliation(s)
- Antonio Gómez
- Institut de Biotecnologia i Biomedicina, Departament de Bioquímica i Biologia Molecular de la, Universitat Autònoma de Barcelona, Bellaterra, Barcelona, 08193, Spain
| | | | | | | | | | | |
Collapse
|
163
|
Local function conservation in sequence and structure space. PLoS Comput Biol 2008; 4:e1000105. [PMID: 18604264 PMCID: PMC2427199 DOI: 10.1371/journal.pcbi.1000105] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2007] [Accepted: 05/28/2008] [Indexed: 11/19/2022] Open
Abstract
We assess the variability of protein function in protein sequence and structure space. Various regions in this space exhibit considerable difference in the local conservation of molecular function. We analyze and capture local function conservation by means of logistic curves. Based on this analysis, we propose a method for predicting molecular function of a query protein with known structure but unknown function. The prediction method is rigorously assessed and compared with a previously published function predictor. Furthermore, we apply the method to 500 functionally unannotated PDB structures and discuss selected examples. The proposed approach provides a simple yet consistent statistical model for the complex relations between protein sequence, structure, and function. The GOdot method is available online (http://godot.bioinf.mpi-inf.mpg.de).
Collapse
|
164
|
Watanabe RLA, Morett E, Vallejo EE. Inferring modules of functionally interacting proteins using the Bond Energy Algorithm. BMC Bioinformatics 2008; 9:285. [PMID: 18559112 PMCID: PMC2474619 DOI: 10.1186/1471-2105-9-285] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2008] [Accepted: 06/17/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Non-homology based methods such as phylogenetic profiles are effective for predicting functional relationships between proteins with no considerable sequence or structure similarity. Those methods rely heavily on traditional similarity metrics defined on pairs of phylogenetic patterns. Proteins do not exclusively interact in pairs as the final biological function of a protein in the cellular context is often hold by a group of proteins. In order to accurately infer modules of functionally interacting proteins, the consideration of not only direct but also indirect relationships is required. In this paper, we used the Bond Energy Algorithm (BEA) to predict functionally related groups of proteins. With BEA we create clusters of phylogenetic profiles based on the associations of the surrounding elements of the analyzed data using a metric that considers linked relationships among elements in the data set. RESULTS Using phylogenetic profiles obtained from the Cluster of Orthologous Groups of Proteins (COG) database, we conducted a series of clustering experiments using BEA to predict (upper level) relationships between profiles. We evaluated our results by comparing with COG's functional categories, And even more, with the experimentally determined functional relationships between proteins provided by the DIP and ECOCYC databases. Our results demonstrate that BEA is capable of predicting meaningful modules of functionally related proteins. BEA outperforms traditionally used clustering methods, such as k-means and hierarchical clustering by predicting functional relationships between proteins with higher accuracy. CONCLUSION This study shows that the linked relationships of phylogenetic profiles obtained by BEA is useful for detecting functional associations between profiles and extending functional modules not found by traditional methods. BEA is capable of detecting relationship among phylogenetic patterns by linking them through a common element shared in a group. Additionally, we discuss how the proposed method may become more powerful if other criteria to classify different levels of protein functional interactions, as gene neighborhood or protein fusion information, is provided.
Collapse
Affiliation(s)
- Ryosuke L A Watanabe
- ITESM Campus Estado de México, Carretera Lago de Guadalupe km 3,5, Atizapán de Zaragoza, 52926, México.
| | | | | |
Collapse
|
165
|
Powers R, Mercier KA, Copeland JC. The application of FAST-NMR for the identification of novel drug discovery targets. Drug Discov Today 2008; 13:172-9. [PMID: 18275915 DOI: 10.1016/j.drudis.2007.11.001] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2007] [Revised: 10/30/2007] [Accepted: 11/01/2007] [Indexed: 10/22/2022]
Abstract
The continued success of genome sequencing projects has resulted in a wealth of information, but 40-50% of identified genes correspond to hypothetical proteins or proteins of unknown function. The functional annotation screening technology by NMR (FAST-NMR) screen was developed to assign a biological function for these unannotated proteins with a structure solved by the protein structure initiative. FAST-NMR is based on the premise that a biological function can be described by a similarity in binding sites and ligand interactions with proteins of known function. The resulting co-structure and functional assignment may provide a starting point for a drug discovery effort.
Collapse
Affiliation(s)
- Robert Powers
- Department of Chemistry, University of Nebraska-Lincoln, Lincoln, NE 68522, USA.
| | | | | |
Collapse
|
166
|
Stevens FJ. Possible evolutionary links between immunoglobulin light chains and other proteins involved in amyloidosis. Amyloid 2008; 15:96-107. [PMID: 18484336 DOI: 10.1080/13506120802005973] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
With limited exceptions, proteins that account for the amyloidoses appear to be evolutionarily unrelated. Transthyretin is classified as having an "immunoglobulin-like" fold as found in light chain variable and constant domains. Thus, these amyloidogenic proteins have significant conformational similarity. In the absence of primary structure similarity sufficient to justify an inference of an evolutionary relationship, transthyretin is considered an analog of immunoglobulin domains having accrued the immunoglobulin-like fold by some form of convergent evolution of structure. Improvements in sequence comparison tools and strategies, coupled with recent logarithmic increases in the availability of primary structure data, now make it possible to suggest that transthyretin and immunoglobulins may have a common evolutionary origin. In addition, lactadherin, the medin fragment of which accounts for the most common form of human amyloid, also appears to be evolutionarily linked to transthyretin and immunoglobulins.
Collapse
Affiliation(s)
- Fred J Stevens
- Biosciences Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| |
Collapse
|
167
|
Espadaler J, Eswar N, Querol E, Avilés FX, Sali A, Marti-Renom MA, Oliva B. Prediction of enzyme function by combining sequence similarity and protein interactions. BMC Bioinformatics 2008; 9:249. [PMID: 18505562 PMCID: PMC2430716 DOI: 10.1186/1471-2105-9-249] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2007] [Accepted: 05/27/2008] [Indexed: 11/18/2022] Open
Abstract
Background A number of studies have used protein interaction data alone for protein function prediction. Here, we introduce a computational approach for annotation of enzymes, based on the observation that similar protein sequences are more likely to perform the same function if they share similar interacting partners. Results The method has been tested against the PSI-BLAST program using a set of 3,890 protein sequences from which interaction data was available. For protein sequences that align with at least 40% sequence identity to a known enzyme, the specificity of our method in predicting the first three EC digits increased from 80% to 90% at 80% coverage when compared to PSI-BLAST. Conclusion Our method can also be used in proteins for which homologous sequences with known interacting partners can be detected. Thus, our method could increase 10% the specificity of genome-wide enzyme predictions based on sequence matching by PSI-BLAST alone.
Collapse
Affiliation(s)
- Jordi Espadaler
- Laboratori de Bioinformàtica Estructural (GRIB), Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra-IMIM, 08003-Barcelona, Catalonia, Spain.
| | | | | | | | | | | | | |
Collapse
|
168
|
Fetrow JS. Active site profiling to identify protein functional sites in sequences and structures using the Deacon Active Site Profiler (DASP). CURRENT PROTOCOLS IN BIOINFORMATICS 2008; Chapter 8:8.10.1-8.10.16. [PMID: 18428769 DOI: 10.1002/0471250953.bi0810s14] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Methods for the annotation and analysis of functional sites in proteins are an area of active research, and those methods that allow detailed characterization of functional site features are much needed. A Web site application, DASP, which implements a previously described method (Cammer, et al., 2003) to allow users to create an active site profile for any protein family, is described. Two protocols for functional site analysis of protein families using DASP are presented: 1) creation of functional site signatures and a profile from proteins of known structure and 2) utilization of the active site profile to search sequences that contain fragments similar to those found in the functional site signatures. The active site profile produced by Basic Protocol 1 allows the user to analyze the features of the functional site, i.e., those characteristics that are common across the family and those that are unique to one or several members of the family. The characteristics that are unique to a subfamily might be described as specificity determinants i.e., features that impart specificity to a particular function. Basic Protocol 2 provides instructions for searching for sequences that might contain a similar functional site.
Collapse
|
169
|
Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res 2008; 36:W297-302. [PMID: 18463141 PMCID: PMC2447771 DOI: 10.1093/nar/gkn193] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
One of the challenges of the post-genomic era is to provide accurate function annotations for large volumes of data resulting from genome sequencing projects. Most function prediction servers utilize methods that transfer existing database annotations between orthologous sequences. In contrast, there are few methods that are independent of homology and can annotate distant and orphan protein sequences. The FFPred server adopts a machine-learning approach to perform function prediction in protein feature space using feature characteristics predicted from amino acid sequence. The features are scanned against a library of support vector machines representing over 300 Gene Ontology (GO) classes and probabilistic confidence scores returned for each annotation term. The GO term library has been modelled on human protein annotations; however, benchmark performance testing showed robust performance across higher eukaryotes. FFPred offers important advantages over traditional function prediction servers in its ability to annotate distant homologues and orphan protein sequences, and achieves greater coverage and classification accuracy than other feature-based prediction servers. A user may upload an amino acid and receive annotation predictions via email. Feature information is provided as easy to interpret graphics displayed on the sequence of interest, allowing for back-interpretation of the associations between features and function classes.
Collapse
Affiliation(s)
- A E Lobley
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
| | | | | | | |
Collapse
|
170
|
Toyota CG, Berthold CL, Gruez A, Jónsson S, Lindqvist Y, Cambillau C, Richards NGJ. Differential substrate specificity and kinetic behavior of Escherichia coli YfdW and Oxalobacter formigenes formyl coenzyme A transferase. J Bacteriol 2008; 190:2556-64. [PMID: 18245280 PMCID: PMC2293189 DOI: 10.1128/jb.01823-07] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 01/25/2008] [Indexed: 01/29/2023] Open
Abstract
The yfdXWUVE operon appears to encode proteins that enhance the ability of Escherichia coli MG1655 to survive under acidic conditions. Although the molecular mechanisms underlying this phenotypic behavior remain to be elucidated, findings from structural genomic studies have shown that the structure of YfdW, the protein encoded by the yfdW gene, is homologous to that of the enzyme that mediates oxalate catabolism in the obligate anaerobe Oxalobacter formigenes, O. formigenes formyl coenzyme A transferase (FRC). We now report the first detailed examination of the steady-state kinetic behavior and substrate specificity of recombinant, wild-type YfdW. Our studies confirm that YfdW is a formyl coenzyme A (formyl-CoA) transferase, and YfdW appears to be more stringent than the corresponding enzyme (FRC) in Oxalobacter in employing formyl-CoA and oxalate as substrates. We also report the effects of replacing Trp-48 in the FRC active site with the glutamine residue that occupies an equivalent position in the E. coli protein. The results of these experiments show that Trp-48 precludes oxalate binding to a site that mediates substrate inhibition for YfdW. In addition, the replacement of Trp-48 by Gln-48 yields an FRC variant for which oxalate-dependent substrate inhibition is modified to resemble that seen for YfdW. Our findings illustrate the utility of structural homology in assigning enzyme function and raise the question of whether oxalate catabolism takes place in E. coli upon the up-regulation of the yfdXWUVE operon under acidic conditions.
Collapse
Affiliation(s)
- Cory G Toyota
- Department of Chemistry, University of Florida, Gainesville, FL 32611-7200, USA
| | | | | | | | | | | | | |
Collapse
|
171
|
Pugalenthi G, Kumar KK, Suganthan P, Gangal R. Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. Biochem Biophys Res Commun 2008; 367:630-4. [DOI: 10.1016/j.bbrc.2008.01.038] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2008] [Accepted: 01/10/2008] [Indexed: 11/16/2022]
|
172
|
|
173
|
Bujnicki JM, Droogmans L, Grosjean H, Purushothaman SK, Lapeyre B. Bioinformatics-Guided Identification and Experimental Characterization of Novel RNA Methyltransferas. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-74268-5_7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
174
|
Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 2007; 8:995-1005. [PMID: 18037900 DOI: 10.1038/nrm2281] [Citation(s) in RCA: 359] [Impact Index Per Article: 21.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
175
|
Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res 2007; 36:D414-8. [PMID: 18032434 PMCID: PMC2238970 DOI: 10.1093/nar/gkm1019] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein-protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/
Collapse
Affiliation(s)
- Corin Yeats
- UCL, Department of Molecular Biology & Biochemistry, Darwin Building, Gower St, London, UK.
| | | | | | | | | | | | | |
Collapse
|
176
|
Investigation of factors affecting prediction of protein-protein interaction networks by phylogenetic profiling. BMC Genomics 2007; 8:393. [PMID: 17967189 PMCID: PMC2204017 DOI: 10.1186/1471-2164-8-393] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2007] [Accepted: 10/29/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The use of computational methods for predicting protein interaction networks will continue to grow with the number of fully sequenced genomes available. The Co-Conservation method, also known as the Phylogenetic profiles method, is a well-established computational tool for predicting functional relationships between proteins. RESULTS Here, we examined how various aspects of this method affect the accuracy and topology of protein interaction networks. We have shown that the choice of reference genome influences the number of predictions involving proteins of previously unknown function, the accuracy of predicted interactions, and the topology of predicted interaction networks. We show that while such results are relatively insensitive to the E-value threshold used in defining homologs, predicted interactions are influenced by the similarity metric that is employed. We show that differences in predicted protein interactions are biologically meaningful, where judicious selection of reference genomes, or use of a new scoring scheme that explicitly considers reference genome relatedness, produces known protein interactions as well as predicted protein interactions involving coordinated biological processes that are not accessible using currently available databases. CONCLUSION These studies should prove valuable for future studies seeking to further improve phylogenetic profiling methodologies as well for efforts to efficiently employ such methods to develop new biological insights.
Collapse
|
177
|
Shah PK, Tripathi LP, Jensen LJ, Gahnim M, Mason C, Furlong EE, Rodrigues V, White KP, Bork P, Sowdhamini R. Enhanced function annotations for Drosophila serine proteases: a case study for systematic annotation of multi-member gene families. Gene 2007; 407:199-215. [PMID: 17996400 DOI: 10.1016/j.gene.2007.10.012] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2007] [Revised: 09/09/2007] [Accepted: 10/07/2007] [Indexed: 12/30/2022]
Abstract
Systematically annotating function of enzymes that belong to large protein families encoded in a single eukaryotic genome is a very challenging task. We carried out such an exercise to annotate function for serine-protease family of the trypsin fold in Drosophila melanogaster, with an emphasis on annotating serine-protease homologues (SPHs) that may have lost their catalytic function. Our approach involves data mining and data integration to provide function annotations for 190 Drosophila gene products containing serine-protease-like domains, of which 35 are SPHs. This was accomplished by analysis of structure-function relationships, gene-expression profiles, large-scale protein-protein interaction data, literature mining and bioinformatic tools. We introduce functional residue clustering (FRC), a method that performs hierarchical clustering of sequences using properties of functionally important residues and utilizes correlation co-efficient as a quantitative similarity measure to transfer in vivo substrate specificities to proteases. We show that the efficiency of transfer of substrate-specificity information using this method is generally high. FRC was also applied on Drosophila proteases to assign putative competitive inhibitor relationships (CIRs). Microarray gene-expression data were utilized to uncover a large-scale and dual involvement of proteases in development and in immune response. We found specific recruitment of SPHs and proteases with CLIP domains in immune response, suggesting evolution of a new function for SPHs. We also suggest existence of separate downstream protease cascades for immune response against bacterial/fungal infections and parasite/parasitoid infections. We verify quality of our annotations using information from RNAi screens and other evidence types. Utilization of such multi-fold approaches results in 10-fold increase of function annotation for Drosophila serine proteases and demonstrates value in increasing annotations in multiple genomes.
Collapse
Affiliation(s)
- Parantu K Shah
- European Molecular Biology Laboratory, Meyerhofstrasse 1, Heidelberg, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
178
|
Functional differentiation of proteins: implications for structural genomics. Structure 2007; 15:405-15. [PMID: 17437713 DOI: 10.1016/j.str.2007.02.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2006] [Revised: 02/15/2007] [Accepted: 02/16/2007] [Indexed: 01/06/2023]
Abstract
Structural genomics is a broad initiative of various centers aiming to provide complete coverage of protein structure space. Because it is not feasible to experimentally determine the structures of all proteins, it is generally agreed that the only viable strategy to achieve such coverage is to carefully select specific proteins (targets), determine their structure experimentally, and then use comparative modeling techniques to model the rest. Here we suggest that structural genomics centers refine the structure-driven approach in target selection by adopting function-based criteria. We suggest targeting functionally divergent superfamilies within a given structural fold so that each function receives a structural characterization. We have developed a method to do so, and an itemized survey of several functionally rich folds shows that they are only partially functionally characterized. We call upon structural genomics centers to consider this approach and upon computational biologists to further develop function-based targeting methods.
Collapse
|
179
|
Shen HB, Chou KC. EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. Biochem Biophys Res Commun 2007; 364:53-9. [PMID: 17931599 DOI: 10.1016/j.bbrc.2007.09.098] [Citation(s) in RCA: 163] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2007] [Accepted: 09/22/2007] [Indexed: 11/25/2022]
Abstract
Given a protein sequence, how can we identify whether it is an enzyme or non-enzyme? If it is, which main functional class it belongs to? What about its sub-functional class? It is important to address these problems because they are closely correlated with the biological function of an uncharacterized protein and its acting object and process. Particularly, with the avalanche of protein sequences generated in the Post Genomic Age and relatively much slower progress in determining their functions by experiments, it is highly desired to develop an automated method by which one can get a fast and accurate answer to these questions. Here, a top-down predictor, called EzyPred, is developed by fusing the results derived from the functional domain and evolution information. EzyPred is a 3-layer predictor: the 1st layer prediction engine is for identifying a query protein as enzyme or non-enzyme; the 2nd layer for the main functional class; and the 3rd layer for the sub-functional class. The overall success rates for all the three layers are higher than 90% that were obtained through rigorous cross-validation tests on the very stringent benchmark datasets in which none of the proteins has > or = 40% sequence identity to any other in a same class or subclass. EzyPred is freely accessible at http://chou.med.harvard.edu/bioinf/EzyPred/, by which one can get the desired 3-level results for a query protein sequence within less than 90 s.
Collapse
Affiliation(s)
- Hong-Bin Shen
- Gordon Life Science Institute, San Diego, CA 92130, USA.
| | | |
Collapse
|
180
|
Fuhrer T, Chen L, Sauer U, Vitkup D. Computational prediction and experimental verification of the gene encoding the NAD+/NADP+-dependent succinate semialdehyde dehydrogenase in Escherichia coli. J Bacteriol 2007; 189:8073-8. [PMID: 17873044 PMCID: PMC2168661 DOI: 10.1128/jb.01027-07] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Although NAD(+)-dependent succinate semialdehyde dehydrogenase activity was first described in Escherichia coli more than 25 years ago, the responsible gene has remained elusive so far. As an experimental proof of concept for a gap-filling algorithm for metabolic networks developed earlier, we demonstrate here that the E. coli gene yneI is responsible for this activity. Our biochemical results demonstrate that the yneI-encoded succinate semialdehyde dehydrogenase can use either NAD(+) or NADP(+) to oxidize succinate semialdehyde to succinate. The gene is induced by succinate semialdehyde, and expression data indicate that yneI plays a unique physiological role in the general nitrogen metabolism of E. coli. In particular, we demonstrate using mutant growth experiments that the yneI gene has an important, but not essential, role during growth on arginine and probably has an essential function during growth on putrescine as the nitrogen source. The NADP(+)-dependent succinate semialdehyde dehydrogenase activity encoded by the functional homolog gabD appears to be important for nitrogen metabolism under N limitation conditions. The yneI-encoded activity, in contrast, functions primarily as a valve to prevent toxic accumulation of succinate semialdehyde. Analysis of available genome sequences demonstrated that orthologs of both yneI and gabD are broadly distributed across phylogenetic space.
Collapse
Affiliation(s)
- Tobias Fuhrer
- Institute of Molecular Systems Biology, ETH Zurich, CH-8093 Zurich, Switzerland
| | | | | | | |
Collapse
|
181
|
Affiliation(s)
- Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technische Universität München, Wissenchaftszentrum Weihenstephan, 85350 Freising, Germany
| |
Collapse
|
182
|
Kunik V, Meroz Y, Solan Z, Sandbank B, Weingart U, Ruppin E, Horn D. Functional representation of enzymes by specific peptides. PLoS Comput Biol 2007; 3:e167. [PMID: 17722976 PMCID: PMC1950953 DOI: 10.1371/journal.pcbi.0030167] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2007] [Accepted: 07/10/2007] [Indexed: 11/19/2022] Open
Abstract
Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 ± 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes. Sequence motifs are known to provide information about functional properties of proteins. In the past, many approaches have looked for deterministic motifs in protein sequences, by searching for functionally over-represented k-mers, with moderate levels of success. Here we revisit and renew the utility of deterministic motifs, by searching for them in a partially unsupervised and context-dependent manner. Using a novel motif extraction algorithm, MEX, deterministic sequence motifs are extracted from Swiss Prot data containing more than 50,000 enzymes. They are then filtered by the Enzyme Commission classification hierarchy to produce sets of specific peptides (SPs). The latter specify enzyme function for 93% of the data, comparing well with existing approaches for enzyme classification. Importantly, SPs are found to have biological significance. A majority of all known active and binding sites of enzymes are covered by SPs, and many SPs are found to lie within spatial pockets in the neighborhood of the active sites. Both these results have extremely high statistical significance. A user-friendly tool that displays the hits of SPs for any protein sequence that is presented as a query, together with the EC assignments due to these SPs, is available at http://adios.tau.ac.il/SPSearch.
Collapse
Affiliation(s)
- Vered Kunik
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Yasmine Meroz
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Zach Solan
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Ben Sandbank
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Uri Weingart
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
| | - Eytan Ruppin
- School of Computer Science, Tel Aviv University, Tel Aviv, Israel
- Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - David Horn
- School of Physics and Astronomy, Tel Aviv University, Tel Aviv, Israel
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
183
|
Chen L, Vitkup D. Distribution of orphan metabolic activities. Trends Biotechnol 2007; 25:343-8. [PMID: 17580095 DOI: 10.1016/j.tibtech.2007.06.001] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2007] [Revised: 04/17/2007] [Accepted: 06/01/2007] [Indexed: 10/23/2022]
Abstract
A significant fraction (30-40%) of known metabolic activities is currently orphan. Although orphan activities have been biochemically characterized, we do not know a single gene responsible for these reactions in any organism. The problem of orphan activities represents one of the major challenges of modern biochemistry. We analyze the distribution of orphans across biochemical space, through years of enzymatic characterization, and by biological organisms. We find that orphan metabolic activities have been accumulating for many decades. They are widely distributed across enzymatic functional space and metabolic network neighborhoods. Although orphans are relatively more abundant in less studied species, over half of orphan reactions have been experimentally characterized in more than one organism. Shrinking the space of orphan activities will likely require a close collaboration between computational and experimental laboratories.
Collapse
Affiliation(s)
- Lifeng Chen
- Center for Computational Biology and Bioinformatics and Department of Biomedical Informatics, Columbia University, 1130 Nicholas Ave., Irving Cancer Research Center, New York, NY 10032, USA
| | | |
Collapse
|
184
|
Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M. KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res 2007; 35:W182-5. [PMID: 17526522 PMCID: PMC1933193 DOI: 10.1093/nar/gkm321] [Citation(s) in RCA: 2843] [Impact Index Per Article: 167.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The number of complete and draft genomes is rapidly growing in recent years, and it has become increasingly important to automate the identification of functional properties and biological roles of genes in these genomes. In the KEGG database, genes in complete genomes are annotated with the KEGG orthology (KO) identifiers, or the K numbers, based on the best hit information using Smith-Waterman scores as well as by the manual curation. Each K number represents an ortholog group of genes, and it is directly linked to an object in the KEGG pathway map or the BRITE functional hierarchy. Here, we have developed a web-based server called KAAS (KEGG Automatic Annotation Server: http://www.genome.jp/kegg/kaas/) i.e. an implementation of a rapid method to automatically assign K numbers to genes in the genome, enabling reconstruction of KEGG pathways and BRITE hierarchies. The method is based on sequence similarities, bi-directional best hit information and some heuristics, and has achieved a high degree of accuracy when compared with the manually curated KEGG GENES database.
Collapse
Affiliation(s)
| | | | | | | | - Minoru Kanehisa
- *To whom Correspondence should be addressed. +81 774 38 3270+81 774 38 3269
| |
Collapse
|
185
|
Marti-Renom MA, Rossi A, Al-Shahrour F, Davis FP, Pieper U, Dopazo J, Sali A. The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 2007; 8 Suppl 4:S4. [PMID: 17570147 PMCID: PMC1892083 DOI: 10.1186/1471-2105-8-s4-s4] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Advances in structural biology, including structural genomics, have resulted in a rapid increase in the number of experimentally determined protein structures. However, about half of the structures deposited by the structural genomics consortia have little or no information about their biological function. Therefore, there is a need for tools for automatically and comprehensively annotating the function of protein structures. We aim to provide such tools by applying comparative protein structure annotation that relies on detectable relationships between protein structures to transfer functional annotations. Here we introduce two programs, AnnoLite and AnnoLyze, which use the structural alignments deposited in the DBAli database. Description AnnoLite predicts the SCOP, CATH, EC, InterPro, PfamA, and GO terms with an average sensitivity of ~90% and average precision of ~80%. AnnoLyze predicts ligand binding site and domain interaction patches with an average sensitivity of ~70% and average precision of ~30%, correctly localizing binding sites for small molecules in ~95% of its predictions. Conclusion The AnnoLite and AnnoLyze programs for comparative annotation of protein structures can reliably and automatically annotate new protein structures. The programs are fully accessible via the Internet as part of the DBAli suite of tools at .
Collapse
Affiliation(s)
- Marc A Marti-Renom
- Structural Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Andrea Rossi
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Fátima Al-Shahrour
- Functional Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Fred P Davis
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Ursula Pieper
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Joaquín Dopazo
- Functional Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Andrej Sali
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| |
Collapse
|
186
|
Mirkovic N, Li Z, Parnassa A, Murray D. Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization. Proteins 2007; 66:766-77. [PMID: 17154423 DOI: 10.1002/prot.21191] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.
Collapse
Affiliation(s)
- Nebojsa Mirkovic
- Department of Microbiology and Immunology, Weill Medical College of Cornell University, New York, New York 10021, USA
| | | | | | | |
Collapse
|
187
|
|
188
|
Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 2007; 8:86. [PMID: 17349043 PMCID: PMC1829165 DOI: 10.1186/1471-2105-8-86] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2006] [Accepted: 03/09/2007] [Indexed: 11/25/2022] Open
Abstract
Background Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. Results In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterised families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterised domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. Conclusion This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Tony A Lewis
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
189
|
Abstract
MOTIVATION Large-scale experiments reveal pairs of interacting proteins but leave the residues involved in the interactions unknown. These interface residues are essential for understanding the mechanism of interaction and are often desired drug targets. Reliable identification of residues that reside in protein-protein interface typically requires analysis of protein structure. Therefore, for the vast majority of proteins, for which there is no high-resolution structure, there is no effective way of identifying interface residues. RESULTS Here we present a machine learning-based method that identifies interacting residues from sequence alone. Although the method is developed using transient protein-protein interfaces from complexes of experimentally known 3D structures, it never explicitly uses 3D information. Instead, we combine predicted structural features with evolutionary information. The strongest predictions of the method reached over 90% accuracy in a cross-validation experiment. Our results suggest that despite the significant diversity in the nature of protein-protein interactions, they all share common basic principles and that these principles are identifiable from sequence alone.
Collapse
Affiliation(s)
- Yanay Ofran
- CUBIC & North-East Structural Genomics Consortium, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
190
|
Yuan Tseng Y, Lian J. Estimating evolutionary rate of local protein binding surfaces: a Bayesian Monte Carlo approach. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2006:739-42. [PMID: 17282289 DOI: 10.1109/iembs.2005.1616520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
To infer protein function by matching local surface patterns, an effective scoring matrix for evaluating surface similarity is critical. In this study, we develop an evolution model of binding surfaces using a continuous time Markov process. We develop a Bayesian Markov chain Monte Carlo method to estimate the substitution rates of amino acid residues with specialized move sets. We then develop scoring matrices of residue similarity specific to a functional site and show how they can be used to identify similar binding surfaces, and how such information can be used for predicting biological roles of proteins. Our method is especially effective in extracting evolutionary information from the phylogeny of sequences homologous to a protein structure, all of which may be of unknown functions.
Collapse
Affiliation(s)
- Yan Yuan Tseng
- Dept of Bioengineering, University of Illinois at Chicago, Chicago, IL 60607, USA.
| | | |
Collapse
|
191
|
Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotechnol 2007; 25:197-206. [PMID: 17287757 DOI: 10.1038/nbt1284] [Citation(s) in RCA: 1392] [Impact Index Per Article: 81.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The identification of protein function based on biological information is an area of intense research. Here we consider a complementary technique that quantitatively groups and relates proteins based on the chemical similarity of their ligands. We began with 65,000 ligands annotated into sets for hundreds of drug targets. The similarity score between each set was calculated using ligand topology. A statistical model was developed to rank the significance of the resulting similarity scores, which are expressed as a minimum spanning tree to map the sets together. Although these maps are connected solely by chemical similarity, biologically sensible clusters nevertheless emerged. Links among unexpected targets also emerged, among them that methadone, emetine and loperamide (Imodium) may antagonize muscarinic M3, alpha2 adrenergic and neurokinin NK2 receptors, respectively. These predictions were subsequently confirmed experimentally. Relating receptors by ligand chemistry organizes biology to reveal unexpected relationships that may be assayed using the ligands themselves.
Collapse
Affiliation(s)
- Michael J Keiser
- Department of Pharmaceutical Chemistry, University of California San Francisco, 1700 4th St, San Francisco California 94143-2550, USA
| | | | | | | | | | | |
Collapse
|
192
|
Shahbaba B, Neal RM. Gene function classification using Bayesian models with hierarchy-based priors. BMC Bioinformatics 2006; 7:448. [PMID: 17038174 PMCID: PMC1618412 DOI: 10.1186/1471-2105-7-448] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2006] [Accepted: 10/12/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We investigate whether annotation of gene function can be improved using a classification scheme that is aware that functional classes are organized in a hierarchy. The classifiers look at phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and an MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. RESULTS The results from all three models show substantial improvement over previous methods, which were based on the C5 decision tree algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining the three sources of information in this dataset, our new approach to combining data sources produces a higher accuracy rate than applying our models to each data source alone. CONCLUSION Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information.
Collapse
Affiliation(s)
- Babak Shahbaba
- Dept. of Public Health Sciences, Biostatistics, University of Toronto, Toronto, Ontario, Canada
| | - Radford M Neal
- Dept. of Statistics and Dept. of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
193
|
Abstract
Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.
Collapse
Affiliation(s)
- Chin-Sheng Yu
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan, Republic of China
| | | | | | | |
Collapse
|
194
|
Donini S, Percudani R, Credali A, Montanini B, Sartori A, Peracchi A. A threonine synthase homolog from a mammalian genome. Biochem Biophys Res Commun 2006; 350:922-8. [PMID: 17034760 DOI: 10.1016/j.bbrc.2006.09.112] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2006] [Accepted: 09/21/2006] [Indexed: 11/19/2022]
Abstract
The genomes of several vertebrates contain two genes encoding proteins highly similar to threonine synthase (TS), even though the biosynthesis of l-threonine (l-Thr) is not known to occur in these animals. We report a bioinformatic analysis of the two TS-like genes, the recombinant expression of one murine TS homolog (mTSH2) and its initial biochemical characterization. Recombinant mTSH2 contained bound pyridoxal-5'-phosphate (PLP), but did not synthesize l-Thr. The enzyme did, however, bind O-phospho-homoserine (PHS; the actual TS substrate) and degraded it to alpha-ketobutyrate, phosphate, and ammonia-a known side reaction of microbial TSs. mTSH2 also degraded O-phospho-threonine (PThr) to alpha-ketobutyrate, showing that it can act as a catabolic phospho-lyase on both gamma- and beta-phosphorylated substrates. These findings suggest an unusual evolutionary origin for mTSH2, whereby an original TS enzyme became 'recycled' into a phospho-lyase upon dismissal, in metazoa, of the l-Thr biosynthetic pathway.
Collapse
Affiliation(s)
- Stefano Donini
- Department of Biochemistry and Molecular Biology, University of Parma, 43100 Parma, Italy
| | | | | | | | | | | |
Collapse
|
195
|
Fernandez-Fuentes N, Fiser A. Saturating representation of loop conformational fragments in structure databanks. BMC STRUCTURAL BIOLOGY 2006; 6:15. [PMID: 16820050 PMCID: PMC1574324 DOI: 10.1186/1472-6807-6-15] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/27/2006] [Accepted: 07/04/2006] [Indexed: 11/30/2022]
Abstract
Background Short fragments of proteins are fundamental starting points in various structure prediction applications, such as in fragment based loop modeling methods but also in various full structure build-up procedures. The applicability and performance of these approaches depend on the availability of short fragments in structure databanks. Results We studied the representation of protein loop fragments up to 14 residues in length. All possible query fragments found in sequence databases (Sequence Space) were clustered and cross referenced with available structural fragments in Protein Data Bank (Structure Space). We found that the expansion of PDB in the last few years resulted in a dense coverage of loop conformational fragments. For each loops of length 8 in the current Sequence Space there is at least one loop in Structure Space with 50% or higher sequence identity. By correlating sequence and structure clusters of loops we found that a 50% sequence identity generally guarantees structural similarity. These percentages of coverage at 50% sequence cutoff drop to 96, 94, 68, 53, 33 and 13% for loops of length 9, 10, 11, 12, 13, and 14, respectively. There is not a single loop in the current Sequence Space at any length up to 14 residues that is not matched with a conformational segment that shares at least 20% sequence identity. This minimum observed identity is 40% for loops of 12 residues or shorter and is as high as 50% for 10 residue or shorter loops. We also assessed the impact of rapidly growing sequence databanks on the estimated number of new loop conformations and found that while the number of sequentially unique sequence segments increased about six folds during the last five years there are almost no unique conformational segments among these up to 12 residues long fragments. Conclusion The results suggest that fragment based prediction approaches are not limited any more by the completeness of fragments in databanks but rather by the effective scoring and search algorithms to locate them. The current favorable coverage and trends observed will be further accentuated with the progress of Protein Structure Initiative that targets new protein folds and ultimately aims at providing an exhaustive coverage of the structure space.
Collapse
Affiliation(s)
- Narcis Fernandez-Fuentes
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - András Fiser
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| |
Collapse
|
196
|
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006; 6:4023-37. [PMID: 16791826 DOI: 10.1002/pmic.200500938] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
Collapse
Affiliation(s)
- Lianyi Han
- Department of Computational Science, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | |
Collapse
|
197
|
Abstract
The ability to predict the function of a protein, given its sequence and/or 3D structure, is an essential requirement for exploiting the wealth of data made available by genomics and structural genomics projects and is therefore raising increasing interest in the computational biology community. To foster developments in the area as well as to establish the state of the art of present methods, a function prediction category was tentatively introduced in the 6th edition of the Critical Assessment of Techniques for Protein Structure Prediction (CASP) worldwide experiment. The assessment of the performance of the methods was made difficult by at least two factors: (a) the experimentally determined function of the targets was not available at the time of assessment; (b) the experiment is run blindly, preventing verification of whether the convergence of different predictions towards the same functional annotation was due to the similarity of the methods or to a genuine signal detectable by different methodologies. In this work, we collected information about the methods used by the various predictors and revisited the results of the experiment by verifying how often and in which cases a convergent prediction was obtained by methods based on different rationale. We propose a method for classifying the type and redundancy of the methods. We also analyzed the cases in which a function for the target protein has become available. Our results show that predictions derived from a consensus of different methods can reach an accuracy as high as 80%. It follows that some of the predictions submitted to CASP6, once reanalyzed taking into account the type of converging methods, can provide very useful information to researchers interested in the function of the target proteins.
Collapse
|
198
|
Abstract
In the CASP6 experiment, the new "Function Prediction" category was tentatively introduced. Predictors were asked to provide functional information on the CASP targets, many of which were of unknown function. This article describes the setup of the experiment and its results, highlighting what was learned from it, and suggesting modifications to its format for the next rounds. The obvious limitation of such an experiment is that the results cannot be assessed in the standard CASP fashion, as all targets remain of unknown function. Furthermore, we had to face the expected difficulties due to the novelty of the experiment and to the problems connected with function definition. Nevertheless, and even with a limited number of participating groups, we believe that the results of the experiment can be useful both for its future and for experimentalists working on the functional assignment of the CASP6 targets. We found that, in a few cases, a consensus functional prediction could be derived for targets of unknown function. However, our analysis suggests that a general description of the method used should be made available together with the predictions so that a higher reliability can be assigned to cases where completely independent methods give the same or similar predictions.
Collapse
Affiliation(s)
- Simonetta Soro
- Department of Biochemical Sciences, University of Rome, La Sapienza, Rome, Italy
| | | |
Collapse
|
199
|
Mika S, Rost B. Protein-protein interactions more conserved within species than across species. PLoS Comput Biol 2006; 2:e79. [PMID: 16854211 PMCID: PMC1513270 DOI: 10.1371/journal.pcbi.0020079] [Citation(s) in RCA: 82] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2005] [Indexed: 11/21/2022] Open
Abstract
Experimental high-throughput studies of protein–protein interactions are beginning to provide enough data for comprehensive computational studies. Today, about ten large data sets, each with thousands of interacting pairs, coarsely sample the interactions in fly, human, worm, and yeast. Another about 55,000 pairs of interacting proteins have been identified by more careful, detailed biochemical experiments. Most interactions are experimentally observed in prokaryotes and simple eukaryotes; very few interactions are observed in higher eukaryotes such as mammals. It is commonly assumed that pathways in mammals can be inferred through homology to model organisms, e.g. the experimental observation that two yeast proteins interact is transferred to infer that the two corresponding proteins in human also interact. Two pairs for which the interaction is conserved are often described as interologs. The goal of this investigation was a large-scale comprehensive analysis of such inferences, i.e. of the evolutionary conservation of interologs. Here, we introduced a novel score for measuring the overlap between protein–protein interaction data sets. This measure appeared to reflect the overall quality of the data and was the basis for our two surprising results from our large-scale analysis. Firstly, homology-based inferences of physical protein–protein interactions appeared far less successful than expected. In fact, such inferences were accurate only for extremely high levels of sequence similarity. Secondly, and most surprisingly, the identification of interacting partners through sequence similarity was significantly more reliable for protein pairs within the same organism than for pairs between species. Our analysis underlined that the discrepancies between different datasets are large, even when using the same type of experiment on the same organism. This reality considerably constrains the power of homology-based transfer of interactions. In particular, the experimental probing of interactions in distant model organisms has to be undertaken with some caution. More comprehensive images of protein–protein networks will require the combination of many high-throughput methods, including in silico inferences and predictions. http://www.rostlab.org/results/2006/ppi_homology/ The IntAct database contains about ten large-scale data sets of protein–protein interactions. Each set contains thousands of experimentally observed pair interactions. Most pairs were observed in yeast (Saccharomyces cerevisiae), fly (Drosophila melanogaster), and worm (Caenorhabditis elegans). These interactions are often perceived as model organisms in the sense that one can infer that two mouse proteins interact if one experimentally observes the two corresponding proteins in worm to interact. Here, the authors analyzed in detail how the sequence signals of physical protein–protein interactions are conserved. It is a common assumption that protein–protein interactions can easily be inferred through homology transfer from one model organism to another organism of interest. Here, the authors demonstrated that such homology transfers are only accurate at unexpectedly high levels of sequence identity. Even more surprisingly, homology transfers of protein–protein interactions are significantly more reliable for protein pairs from the same species than for two protein pairs from different organisms. The observation that interactions were much more conserved within than across species was valid for all levels of sequence similarity, i.e. for very similar as well as for more diverged interologs.
Collapse
Affiliation(s)
- Sven Mika
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA.
| | | |
Collapse
|
200
|
Glasner ME, Fayazmanesh N, Chiang RA, Sakai A, Jacobson MP, Gerlt JA, Babbitt PC. Evolution of structure and function in the o-succinylbenzoate synthase/N-acylamino acid racemase family of the enolase superfamily. J Mol Biol 2006; 360:228-50. [PMID: 16740275 DOI: 10.1016/j.jmb.2006.04.055] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2006] [Revised: 04/22/2006] [Accepted: 04/25/2006] [Indexed: 11/30/2022]
Abstract
Understanding how proteins evolve to provide both exquisite specificity and proficient activity is a fundamental problem in biology that has implications for protein function prediction and protein engineering. To study this problem, we analyzed the evolution of structure and function in the o-succinylbenzoate synthase/N-acylamino acid racemase (OSBS/NAAAR) family, part of the mechanistically diverse enolase superfamily. Although all characterized members of the family catalyze the OSBS reaction, this family is extraordinarily divergent, with some members sharing <15% identity. In addition, a member of this family, Amycolatopsis OSBS/NAAAR, is promiscuous, catalyzing both dehydration and racemization. Although the OSBS/NAAAR family appears to have a single evolutionary origin, no sequence or structural motifs unique to this family could be identified; all residues conserved in the family are also found in enolase superfamily members that have different functions. Based on their species distribution, several uncharacterized proteins similar to Amycolatopsis OSBS/NAAAR appear to have been transmitted by lateral gene transfer. Like Amycolatopsis OSBS/NAAAR, these might have additional or alternative functions to OSBS because many are from organisms lacking the pathway in which OSBS is an intermediate. In addition to functional differences, the OSBS/NAAAR family exhibits surprising structural variations, including large differences in orientation between the two domains. These results offer several insights into protein evolution. First, orthologous proteins can exhibit significant structural variation, and specificity can be maintained with little conservation of ligand-contacting residues. Second, the discovery of a set of proteins similar to Amycolatopsis OSBS/NAAAR supports the hypothesis that new protein functions evolve through promiscuous intermediates. Finally, a combination of evolutionary, structural, and sequence analyses identified characteristics that might prime proteins, such as Amycolatopsis OSBS/NAAAR, for the evolution of new activities.
Collapse
Affiliation(s)
- Margaret E Glasner
- Department of Biopharmaceutical Sciences, University of California, San Francisco, CA 94143, USA
| | | | | | | | | | | | | |
Collapse
|