1
|
Linial M. Fishing with (Proto)Net-a principled approach to protein target selection. Comp Funct Genomics 2008; 4:542-8. [PMID: 18629007 PMCID: PMC2447289 DOI: 10.1002/cfg.328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2003] [Revised: 08/05/2003] [Accepted: 08/05/2003] [Indexed: 12/02/2022] Open
Abstract
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have
not been determined, but that represent an as yet unknown structural superfamily
or fold. Once such a structure is solved, it can be used as a template for modelling
homologous proteins. This will aid in unveiling the structural diversity of the protein
space. Currently, no reliable method for accurate 3D structural prediction is available
when a sequence or a structure homologue is not available. Here we present a
systematic methodology for selecting target proteins whose structure is likely to
adopt a new, as yet unknown superfamily or fold. Our method takes advantage
of a global classification of the sequence space as presented by ProtoNet-3D, which
is a hierarchical agglomerative clustering of the proteins of interest (the proteins in
Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in
the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet
structurally solved, along with the probability of each of the proteins belonging to a
new superfamily or fold. The sorted list has been self-validated against real structural
data that was not available when the predictions were made. The practical application
of using our computational–statistical method to determine novel superfamilies for
structural genomics projects is also discussed.
Collapse
Affiliation(s)
- Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel.
| |
Collapse
|
2
|
Functional differentiation of proteins: implications for structural genomics. Structure 2007; 15:405-15. [PMID: 17437713 DOI: 10.1016/j.str.2007.02.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2006] [Revised: 02/15/2007] [Accepted: 02/16/2007] [Indexed: 01/06/2023]
Abstract
Structural genomics is a broad initiative of various centers aiming to provide complete coverage of protein structure space. Because it is not feasible to experimentally determine the structures of all proteins, it is generally agreed that the only viable strategy to achieve such coverage is to carefully select specific proteins (targets), determine their structure experimentally, and then use comparative modeling techniques to model the rest. Here we suggest that structural genomics centers refine the structure-driven approach in target selection by adopting function-based criteria. We suggest targeting functionally divergent superfamilies within a given structural fold so that each function receives a structural characterization. We have developed a method to do so, and an itemized survey of several functionally rich folds shows that they are only partially functionally characterized. We call upon structural genomics centers to consider this approach and upon computational biologists to further develop function-based targeting methods.
Collapse
|
3
|
Mirkovic N, Li Z, Parnassa A, Murray D. Strategies for high-throughput comparative modeling: applications to leverage analysis in structural genomics and protein family organization. Proteins 2007; 66:766-77. [PMID: 17154423 DOI: 10.1002/prot.21191] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The technological breakthroughs in structural genomics were designed to facilitate the solution of a sufficient number of structures, so that as many protein sequences as possible can be structurally characterized with the aid of comparative modeling. The leverage of a solved structure is the number and quality of the models that can be produced using the structure as a template for modeling and may be viewed as the "currency" with which the success of a structural genomics endeavor can be measured. Moreover, the models obtained in this way should be valuable to all biologists. To this end, at the Northeast Structural Genomics Consortium (NESG), a modular computational pipeline for automated high-throughput leverage analysis was devised and used to assess the leverage of the 186 unique NESG structures solved during the first phase of the Protein Structure Initiative (January 2000 to July 2005). Here, the results of this analysis are presented. The number of sequences in the nonredundant protein sequence database covered by quality models produced by the pipeline is approximately 39,000, so that the average leverage is approximately 210 models per structure. Interestingly, only 7900 of these models fulfill the stringent modeling criterion of being at least 30% sequence-identical to the corresponding NESG structures. This study shows how high-throughput modeling increases the efficiency of structure determination efforts by providing enhanced coverage of protein structure space. In addition, the approach is useful in refining the boundaries of structural domains within larger protein sequences, subclassifying sequence diverse protein families, and defining structure-based strategies specific to a particular family.
Collapse
Affiliation(s)
- Nebojsa Mirkovic
- Department of Microbiology and Immunology, Weill Medical College of Cornell University, New York, New York 10021, USA
| | | | | | | |
Collapse
|
4
|
Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 2007; 8:86. [PMID: 17349043 PMCID: PMC1829165 DOI: 10.1186/1471-2105-8-86] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2006] [Accepted: 03/09/2007] [Indexed: 11/25/2022] Open
Abstract
Background Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. Results In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterised families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterised domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. Conclusion This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Tony A Lewis
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
5
|
Riboldi-Tunnicliffe A, Isaacs NW, Mitchell TJ. 1.2 Angstroms crystal structure of the S. pneumoniae PhtA histidine triad domain a novel zinc binding fold. FEBS Lett 2005; 579:5353-60. [PMID: 16194532 DOI: 10.1016/j.febslet.2005.08.066] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2005] [Revised: 08/30/2005] [Accepted: 08/31/2005] [Indexed: 11/29/2022]
Abstract
The recently described pneumococcal histidine triad protein family has been shown to be highly conserved within the pneumococcus. As part of our structural genomics effort on proteins from Streptococcus pneumoniae, we have expressed, crystallised and solved the structure of PhtA-166-220 at 1.2 Angstroms using remote SAD with zinc. The structure of PhtA-166-220 shows no similarity to any protein structure. The overall fold contains 3beta-strands and a single short alpha-helix. The structure appears to contain a novel zinc binding motif. The remaining 4 histidine triad repeats from PhtA have been modelled based on the crystal structure of the PhtA histidine triad repeat 2. From this modelling work, we speculate that only three of the five histidine triad repeats contain the residues in the correct geometry to allow the binding of a zinc ion.
Collapse
Affiliation(s)
- A Riboldi-Tunnicliffe
- University of Glasgow, Division of Infection and Immunity, IBLS Joseph Black Building, UK
| | | | | |
Collapse
|
6
|
Rubin SM, Pelton JG, Yokota H, Kim R, Wemmer DE. Solution structure of a putative ribosome binding protein from Mycoplasma pneumoniae and comparison to a distant homolog. ACTA ACUST UNITED AC 2005; 4:235-43. [PMID: 15185964 DOI: 10.1023/b:jsfg.0000016127.57320.82] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The solution structure of MPN156, a ribosome-binding factor A (RBFA) protein family member from Mycoplasma pneumoniae, is presented. The structure, solved by nuclear magnetic resonance, has a type II KH fold typical of RNA binding proteins. Despite only approximately 20% sequence identity between MPN156 and another family member from Escherichia coli, the two proteins have high structural similarity. The comparison demonstrates that many of the conserved residues correspond to conserved elements in the structures. Compared to a structure based alignment, standard alignment methods based on sequence alone mispair a majority of amino acids in the two proteins. Implications of these discrepancies for sequence based structural modeling are discussed.
Collapse
Affiliation(s)
- Seth M Rubin
- Department of Chemistry, University of California, Berkeley, CA 94720, USA
| | | | | | | | | |
Collapse
|
7
|
Kifer I, Sasson O, Linial M. Predicting fold novelty based on ProtoNet hierarchical classification. Bioinformatics 2004; 21:1020-7. [PMID: 15539447 DOI: 10.1093/bioinformatics/bti135] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Structural genomics projects aim to solve a large number of protein structures with the ultimate objective of representing the entire protein space. The computational challenge is to identify and prioritize a small set of proteins with new, currently unknown, superfamilies or folds. RESULTS We develop a method that assigns each protein a likelihood of it belonging to a new, yet undetermined, structural superfamily. The method relies on a variant of ProtoNet, an automatic hierarchical classification scheme of all protein sequences from SwissProt. Our results show that proteins that are remote from solved structures in the ProtoNet hierarchy are more likely to belong to new superfamilies. The results are validated against SCOP releases from recent years that account for about half of the solved structures known to date. We show that our new method and the representation of ProtoNet are superior in detecting new targets, compared to our previous method using ProtoMap classification. Furthermore, our method outperforms PSI-BLAST search in detecting potential new superfamilies.
Collapse
Affiliation(s)
- Ilona Kifer
- Department of Biological Chemistry, Institute of Life Sciences Jerusalem 91904, Israel
| | | | | |
Collapse
|
8
|
Liu J, Hegyi H, Acton TB, Montelione GT, Rost B. Automatic target selection for structural genomics on eukaryotes. Proteins 2004; 56:188-200. [PMID: 15211504 DOI: 10.1002/prot.20012] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
A central goal of structural genomics is to experimentally determine representative structures for all protein families. At least 14 structural genomics pilot projects are currently investigating the feasibility of high-throughput structure determination; the National Institutes of Health funded nine of these in the United States. Initiatives differ in the particular subset of "all families" on which they focus. At the NorthEast Structural Genomics consortium (NESG), we target eukaryotic protein domain families. The automatic target selection procedure has three aims: 1) identify all protein domain families from currently five entirely sequenced eukaryotic target organisms based on their sequence homology, 2) discard those families that can be modeled on the basis of structural information already present in the PDB, and 3) target representatives of the remaining families for structure determination. To guarantee that all members of one family share a common foldlike region, we had to begin by dissecting proteins into structural domain-like regions before clustering. Our hierarchical approach, CHOP, utilizing homology to PrISM, Pfam-A, and SWISS-PROT chopped the 103,796 eukaryotic proteins/ORFs into 247,222 fragments. Of these fragments, 122,999 appeared suitable targets that were grouped into >27,000 singletons and >18,000 multifragment clusters. Thus, our results suggested that it might be necessary to determine >40,000 structures to minimally cover the subset of five eukaryotic proteomes.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | | | | | | | |
Collapse
|
9
|
Abstract
Guessing the boundaries of structural domains has been an important and challenging problem in experimental and computational structural biology. Predictions were based on intuition, biochemical properties, statistics, sequence homology and other aspects of predicted protein structure. Here, we introduced CHOPnet, a de novo method that predicts structural domains in the absence of homology to known domains. Our method was based on neural networks and relied exclusively on information available for all proteins. Evaluating sustained performance through rigorous cross-validation on proteins of known structure, we correctly predicted the number of domains in 69% of all proteins. For 50% of the two-domain proteins the centre of the predicted boundary was closer than 20 residues to the boundary assigned from three-dimensional (3D) structures; this was about eight percentage points better than predictions by 'equal split'. Our results appeared to compare favourably with those from previously published methods. CHOPnet may be useful to restrict the experimental testing of different fragments for structure determination in the context of structural genomics.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| | | |
Collapse
|
10
|
Raymond S, O'Toole N, Cygler M. A data management system for structural genomics. Proteome Sci 2004; 2:4. [PMID: 15210054 PMCID: PMC449731 DOI: 10.1186/1477-5956-2-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2004] [Accepted: 06/21/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND: Structural genomics (SG) projects aim to determine thousands of protein structures by the development of high-throughput techniques for all steps of the experimental structure determination pipeline. Crucial to the success of such endeavours is the careful tracking and archiving of experimental and external data on protein targets. RESULTS: We have developed a sophisticated data management system for structural genomics. Central to the system is an Oracle-based, SQL-interfaced database. The database schema deals with all facets of the structure determination process, from target selection to data deposition. Users access the database via any web browser. Experimental data is input by users with pre-defined web forms. Data can be displayed according to numerous criteria. A list of all current target proteins can be viewed, with links for each target to associated entries in external databases. To avoid unnecessary work on targets, our data management system matches protein sequences weekly using BLAST to entries in the Protein Data Bank and to targets of other SG centers worldwide. CONCLUSION: Our system is a working, effective and user-friendly data management tool for structural genomics projects. In this report we present a detailed summary of the various capabilities of the system, using real target data as examples, and indicate our plans for future enhancements.
Collapse
Affiliation(s)
- Stéphane Raymond
- Biotechnology Research Institute, National Research Council, 6100 Royalmount Avenue, Montréal, Québec H4P 2R2, Canada
- Montréal Joint Centre for Structural Biology, Montréal, Québec, Canada
| | - Nicholas O'Toole
- Department of Biochemistry, McGill University, Montréal, Québec H3G 1Y6, Canada
- Montréal Joint Centre for Structural Biology, Montréal, Québec, Canada
| | - Miroslaw Cygler
- Biotechnology Research Institute, National Research Council, 6100 Royalmount Avenue, Montréal, Québec H4P 2R2, Canada
- Department of Biochemistry, McGill University, Montréal, Québec H3G 1Y6, Canada
- Montréal Joint Centre for Structural Biology, Montréal, Québec, Canada
| |
Collapse
|
11
|
Friedberg I, Jaroszewski L, Ye Y, Godzik A. The interplay of fold recognition and experimental structure determination in structural genomics. Curr Opin Struct Biol 2004; 14:307-12. [PMID: 15193310 DOI: 10.1016/j.sbi.2004.04.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Achieving the goals of structural genomics initiatives depends on the outcomes of two groups of factors: the number and distribution of experimentally determined protein structures, and our ability to assign novel proteins to known structures (fold recognition) and use them to build models (modeling). The quality of the tools used for fold recognition defines the scope of experimental effort - the more distant the templates that can be recognized, the smaller the number of proteins that have to be solved. Recent improvements in fold recognition may have suggested that the goals of structural genomics initiatives are getting closer. However, problems that surfaced during the first few years of active work have put many of the early estimates in doubt and new ones are still slow in coming.
Collapse
Affiliation(s)
- Iddo Friedberg
- The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, California 92037, USA
| | | | | | | |
Collapse
|
12
|
Abstract
We developed a method CHOP dissecting proteins into domain-like fragments. The basic idea was to cut proteins beginning from very reliable experimental information (PDB), proceeding to expert annotations of domain-like regions (Pfam-A), and completing through cuts based on termini of known proteins. In this way, CHOP dissected more than two thirds of all proteins from 62 proteomes. Analysis of our structural domain-like fragments revealed four surprising results. First, >70% of all dissected proteins contained more than one fragment. Second, most domains spanned on average over approximately 100 residues. This average was similar for eukaryotic and prokaryotic proteins, and it is also valid-although previously not described-for all proteins in the PDB. Third, single-domain proteins were significant longer than most domains in multidomain proteins. Fourth, three fourths of all domains appeared shorter than 210 residues. We believe that our CHOP fragments constituted an important resource for functional and structural genomics. Nevertheless, our main motivation to develop CHOP was that the single-linkage clustering method failed to adequately group full-length proteins. In contrast, CLUP-the simple clustering scheme CLUP introduced here-succeeded largely to group the CHOP fragments from 62 proteomes such that all members of one cluster shared a basic structural core. CLUP found >63,000 multi- and >118,000 single-member clusters. Although most fragments were restricted to a particular cluster, approximately 24% of the fragments were duplicated in at least two clusters. Our thresholds for grouping two fragments into the same cluster were rather conservative. Nevertheless, our results suggested that structural genomics initiatives have to target >30,000 fragments to at least cover the multimember clusters in 62 proteomes.
Collapse
Affiliation(s)
- Jinfeng Liu
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, USA
| | | |
Collapse
|
13
|
Frishman D. What we have learned about prokaryotes from structural genomics. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2004; 7:211-24. [PMID: 14506850 DOI: 10.1089/153623103322246601] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Five years ago systematic determination and theoretical analysis of all protein structures encoded in model prokaryotic organisms was proposed as a powerful way to obtain new insights into protein function and the variety of protein folds. What has been the pay-off from studying structures in genomic context? Have we learned anything new about protein structure? Can we now predict protein function better? In this contribution, I summarize the status of large-scale structure determination projects on prokaryotes and provide an overview of the main results obtained from experimental and theoretical studies in this dynamic research field.
Collapse
Affiliation(s)
- Dmitrij Frishman
- Department of Genome Oriented Bioinformatics, Technical University of Munich, Freising-Weihenstephan, Germany.
| |
Collapse
|
14
|
Grigoriev IV, Choi IG. Target selection for structural genomics: a single genome approach. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2003; 6:349-62. [PMID: 12626094 DOI: 10.1089/153623102321112773] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
We describe our strategy for selecting targets for protein structure determination in context of structural genomics of a single genome. In the course of target selection, we have studied two of the smallest microbial genomes, Mycoplasma genitalium and Mycoplasma pneumoniae. To our surprise, we found that only 71 Mycoplasma genes or their orthologues can be considered as easy targets for high-throughput structural studies--far fewer than expected. We discuss the methods and criteria used for target selection and the reasons explaining rarity of easy targets. First, despite the common opinion that protein folds can be predicted for only 30-50% of genes, the number of "truly unknown" structures is less than one-third. Second, due to the different codon usage, two thirds of Mycoplasma proteins cannot be directly expressed in E. coli in high-throughput manner and require substitution by their homologues from other organisms. Third, membrane or large multi-domain proteins are difficult targets because of solubility and size issues and often require identification and structure determination of protein domains. Finally, we propose different approaches to address the difficult targets.
Collapse
Affiliation(s)
- Igor V Grigoriev
- Department of Chemistry and E.O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA.
| | | |
Collapse
|
15
|
Kennedy MA, Montelione GT, Arrowsmith CH, Markley JL. Role for NMR in structural genomics. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2003; 2:155-69. [PMID: 12836706 DOI: 10.1023/a:1021261026670] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The 2nd EMSL Workshop on Structural Genomics was held on 28th and 29th July 2000 at the Environmental Molecular Sciences Laboratory at the Department of Energy's Pacific Northwest National Laboratory in Richland, WA. The workshop focused on four topics: 1. The role for NMR in structural and functional genomics; 2. The technical challenges NMR faces for structural and functional genomics; 3. The potential need for a national NMR center for structural and functional genomics in the United States; and 4. Organization of the NMR community. This report summarizes the workshop proceedings and conclusions reached regarding the role of NMR in the emerging fields of structural and functional genomics.
Collapse
Affiliation(s)
- Michael A Kennedy
- Pacific Northwest National Laboratory, Environmental Molecular Sciences Laboratory, Richland, WA 99352, USA.
| | | | | | | |
Collapse
|
16
|
Abstract
High-throughput sequencing of human genomes and those of important model organisms (mouse, Drosophila melanogaster, Caenorhabditis elegans, fungi, archaea) and bacterial pathogens has laid the foundation for another "big science" initiative in biology. Together, X-ray crystallographers, nuclear magnetic resonance (NMR) spectroscopists, and computational biologists are pursuing high-throughput structural studies aimed at developing a comprehensive three-dimensional view of the protein structure universe. The new science of structural genomics promises more than 10,000 experimental protein structures and millions of calculated homology models of related proteins. The evolutionary underpinnings and technological challenges of automating target selection, protein expression and purification, sample preparation, NMR and X-ray data measurement/analysis, homology modeling, and structure/function annotation are discussed in detail. An informative case study from one of the structural genomics centers funded by the National Institutes of Health and the National Institute of General Medical Sciences (NIH/NIGMS) demonstrates how this experimental/computational pipeline will reveal important links between form and function in biology and provide new insights into evolution and human health and disease.
Collapse
Affiliation(s)
- Stephen K Burley
- Howard Hughes Medical Institute, Laboratories of Molecular Biophysics, The Rockefeller University, New York New York 10021, USA.
| | | |
Collapse
|
17
|
Marchler-Bauer A, Panchenko AR, Ariel N, Bryant SH. Comparison of sequence and structure alignments for protein domains. Proteins 2002; 48:439-46. [PMID: 12112669 DOI: 10.1002/prot.10163] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Profile search methods based on protein domain alignments have proven to be useful tools in comparative sequence analysis. Domain alignments used by currently available search methods have been computed by sequence comparison. With the growth of the protein structure database, however, alignments of many domain pairs have also been computed by structure comparison. Here, we examine the extent to which information from these two sources agrees. We measure agreement with respect to identification of homologous regions in each protein, that is, with respect to the location of domain boundaries. We also measure agreement with respect to identification of homologous residue sites by comparing alignments and assessing the accuracy of the molecular models they predict. We find that domain alignments in publicly available collections based on sequence and structure comparison are largely consistent. However, the homologous regions identified by sequence comparison are often shorter than those identified by 3D structure comparison. In addition, when overall sequence similarity is low alignments from sequence comparison produce less accurate molecular models, suggesting that they less accurately identify homologous sites. These observations suggest that structure comparison results might be used to improve the overall accuracy of domain alignment collections and the performance of profile search methods based on them.
Collapse
Affiliation(s)
- Aron Marchler-Bauer
- Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | |
Collapse
|
18
|
Abstract
The genomes of over 60 organisms from all three kingdoms of life are now entirely sequenced. In many respects, the inventory of proteins used in different kingdoms appears surprisingly similar. However, eukaryotes differ from other kingdoms in that they use many long proteins, and have more proteins with coiled-coil helices and with regions abundant in regular secondary structure. Particular structural domains are used in many pathways. Nevertheless, one domain tends to occur only once in one particular pathway. Many proteins do not have close homologues in different species (orphans) and there could even be folds that are specific to one species. This view implies that protein fold space is discrete. An alternative model suggests that structure space is continuous and that modern proteins evolved by aggregating fragments of ancient proteins. Either way, after having harvested proteomes by applying standard tools, the challenge now seems to be to develop better methods for comparative proteomics.
Collapse
Affiliation(s)
- Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street, BB217, New York, NY 10032, USA.
| |
Collapse
|
19
|
Osipiuk J, Górnicki P, Maj L, Dementieva I, Laskowski R, Joachimiak A. Streptococcus pneumonia YlxR at 1.35 A shows a putative new fold. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2001; 57:1747-51. [PMID: 11679764 PMCID: PMC2792016 DOI: 10.1107/s0907444901014019] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2001] [Accepted: 08/23/2001] [Indexed: 11/10/2022]
Abstract
The structure of the YlxR protein of unknown function from Streptococcus pneumonia was determined to 1.35 A. YlxR is expressed from the nusA/infB operon in bacteria and belongs to a small protein family (COG2740) that shares a conserved sequence motif GRGA(Y/W). The family shows no significant amino-acid sequence similarity with other proteins. Three-wavelength diffraction MAD data were collected to 1.7 A from orthorhombic crystals using synchrotron radiation and the structure was determined using a semi-automated approach. The YlxR structure resembles a two-layer alpha/beta sandwich with the overall shape of a cylinder and shows no structural homology to proteins of known structure. Structural analysis revealed that the YlxR structure represents a new protein fold that belongs to the alpha-beta plait superfamily. The distribution of the electrostatic surface potential shows a large positively charged patch on one side of the protein, a feature often found in nucleic acid-binding proteins. Three sulfate ions bind to this positively charged surface. Analysis of potential binding sites uncovered several substantial clefts, with the largest spanning 3/4 of the protein. A similar distribution of binding sites and a large sharply bent cleft are observed in RNA-binding proteins that are unrelated in sequence and structure. It is proposed that YlxR is an RNA-binding protein.
Collapse
Affiliation(s)
- Jerzy Osipiuk
- Argonne National Laboratory, Structural Biology Center, Biosciences Division, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Piotr Górnicki
- Department of Molecular Genetics and Cell Biology, University of Chicago, 920 East 58th Street, Chicago, IL 60637, USA
| | - Luke Maj
- Argonne National Laboratory, Structural Biology Center, Biosciences Division, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Irina Dementieva
- Argonne National Laboratory, Structural Biology Center, Biosciences Division, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Roman Laskowski
- The Department of Crystallography, Birkbeck College, Malet Street, London WC1E 7HX, England
| | - Andrzej Joachimiak
- Argonne National Laboratory, Structural Biology Center, Biosciences Division, 9700 South Cass Avenue, Argonne, IL 60439, USA
| |
Collapse
|
20
|
Smit JW, Romijn JA. Structural genomics in endocrinology. Pharmacogenomics 2001; 2:353-60. [PMID: 11722285 DOI: 10.1517/14622416.2.4.353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
Traditionally, endocrine research evolved from the phenotypical characterisation of endocrine disorders to the identification of underlying molecular pathophysiology. This approach has been, and still is, extremely successful. The introduction of genomics and proteomics has resulted in a reversal of this sequence of endocrine research: reverse endocrinology. This approach has provided endocrinology with powerful tools to dissect novel molecular pathways involved in health and disease and to identify new drug targets, like the peroxisome-proliferator activated receptor (PPAR) nuclear receptor family. The reiterative combination of innovative genomics and proteomics, and classical endocrine approaches will enable maintenance of endocrinology as a front-runner in biological research and innovate therapeutical approaches in a continuing interaction between bed and bench.
Collapse
Affiliation(s)
- J W Smit
- Department of Endocrinology and Metabolic Diseases, Leiden University Medical Center, C4-R, PO Box 9600, 3500 RC Leiden, The Netherlands
| | | |
Collapse
|
21
|
Abstract
Structural genomics projects aim to provide an experimental or computational three-dimensional model structure for all of the tractable macromolecules that are encoded by complete genomes. To this end, pilot centres worldwide are now exploring the feasibility of large-scale structure determination. Their experimental structures and computational models are expected to yield insight into the molecular function and mechanism of thousands of proteins. The pervasiveness of this information is likely to change the use of structure in molecular biology and biochemistry.
Collapse
Affiliation(s)
- S E Brenner
- Department of Plant and Microbial Biology, University of California, 461A Koshland Hall, Berkeley, California 94720-3102, USA.
| |
Collapse
|
22
|
Abstract
Following the complete genome sequencing of an increasing number of organisms, structural biology is engaging in a systematic approach of high-throughput structure determination called structural genomics to create a complete inventory of protein folds/structures that will help predict functions for all proteins. First results show that structural genomics will be highly effective in finding functional annotations for proteins of unknown function.
Collapse
Affiliation(s)
- P R Mittl
- Institute of Biochemistry, University of Zürich, Winterthurerstrasse 190, 8057 Zürich, Switzerland
| | | |
Collapse
|
23
|
Blundell TL, Mizuguchi K. Structural genomics: an overview. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2001; 73:289-95. [PMID: 11063776 DOI: 10.1016/s0079-6107(00)00008-0] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- T L Blundell
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, 2 1GA, Cambridge CB, UK.
| | | |
Collapse
|
24
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447210 DOI: 10.1002/cfg.57] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
25
|
Portugaly E, Linial M. Estimating the probability for a protein to have a new fold: A statistical computational model. Proc Natl Acad Sci U S A 2000; 97:5161-6. [PMID: 10792051 PMCID: PMC25799 DOI: 10.1073/pnas.090559497] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Structural genomics aims to solve a large number of protein structures that represent the protein space. Currently an exhaustive solution for all structures seems prohibitively expensive, so the challenge is to define a relatively small set of proteins with new, currently unknown folds. This paper presents a method that assigns each protein with a probability of having an unsolved fold. The method makes extensive use of protomap, a sequence-based classification, and scop, a structure-based classification. According to protomap, the protein space encodes the relationship among proteins as a graph whose vertices correspond to 13,354 clusters of proteins. A representative fold for a cluster with at least one solved protein is determined after superposition of all scop (release 1.37) folds onto protomap clusters. Distances within the protomap graph are computed from each representative fold to the neighboring folds. The distribution of these distances is used to create a statistical model for distances among those folds that are already known and those that have yet to be discovered. The distribution of distances for solved/unsolved proteins is significantly different. This difference makes it possible to use Bayes' rule to derive a statistical estimate that any protein has a yet undetermined fold. Proteins that score the highest probability to represent a new fold constitute the target list for structural determination. Our predicted probabilities for unsolved proteins correlate very well with the proportion of new folds among recently solved structures (new scop 1.39 records) that are disjoint from our original training set.
Collapse
Affiliation(s)
- E Portugaly
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel
| | | |
Collapse
|