51
|
Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA. CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 2008; 3:e232. [PMID: 18052539 PMCID: PMC2098860 DOI: 10.1371/journal.pcbi.0030232] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2007] [Accepted: 10/11/2007] [Indexed: 11/19/2022] Open
Abstract
We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure–based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification. Proteins comprise individual folding units known as domains, with a significant proportion containing two or more (multidomain structures). Each domain is thought to represent a unit of evolution and adopts a specific fold. Detecting domains is often the first step in classifying proteins into evolutionary families for studying the relationship between sequence, structure, and function. Automatically identifying domains from structural data is problematic due to the fact that domains vary substantially in their compactness and geometric separation from one another in the whole protein. We present a novel method, CATHEDRAL, which iteratively identifies each domain by comparing a query structure against a library of manually verified domains in the CATH domain database through computational structure comparison. We find that CATHEDRAL is able to outperform the majority of popular structure comparison methods for finding structural relatives. Furthermore, it is able to accurately identify domain boundaries and outperform other methods of structure-based domain prediction for the majority of proteins. CATHEDRAL is available as a Webserver to provide domain annotations for the community and hence aid in structural and functional characterisation of newly solved protein structures.
Collapse
Affiliation(s)
- Oliver C Redfern
- Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
| | | | | | | | | |
Collapse
|
52
|
|
53
|
Carter P, Lee D, Orengo C. Chapter 1. Target selection in structural genomics projects to increase knowledge of protein structure and function space. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2008; 75:1-52. [PMID: 20731988 DOI: 10.1016/s0065-3233(07)75001-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Structural genomics aims to solve the three-dimensional structures of proteins at a rapid rate and in a cost-effective manner, with the hope of significantly impacting on the life sciences, biotechnology, and drug discovery in the long-term. Structural genomics initiatives started in Japan in 1997 with the advent of the Protein Folds Project. Since then many new initiatives have begun worldwide, with diverse aims motivating the selection of proteins for structure determination. In this chapter, we consider the biological goals of high-throughput structural biology, while focusing on the Protein Structure Initiative in the United States. This is the most productive of the structural genomics initiatives, having solved 3,363 new structures between September 2000 and October 2008.
Collapse
Affiliation(s)
- Phil Carter
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | | |
Collapse
|
54
|
|
55
|
Abstract
Metals play a variety of roles in biological processes, and hence their presence in a protein structure can yield vital functional information. Because the residues that coordinate a metal often undergo conformational changes upon binding, detection of binding sites based on simple geometric criteria in proteins without bound metal is difficult. However, aspects of the physicochemical environment around a metal binding site are often conserved even when this structural rearrangement occurs. We have developed a Bayesian classifier using known zinc binding sites as positive training examples and nonmetal binding regions that nonetheless contain residues frequently observed in zinc sites as negative training examples. In order to allow variation in the exact positions of atoms, we average a variety of biochemical and biophysical properties in six concentric spherical shells around the site of interest. At a specificity of 99.8%, this method achieves 75.5% sensitivity in unbound proteins at a positive predictive value of 73.6%. We also test its accuracy on predicted protein structures obtained by homology modeling using templates with 30%-50% sequence identity to the target sequences. At a specificity of 99.8%, we correctly identify at least one zinc binding site in 65.5% of modeled proteins. Thus, in many cases, our model is accurate enough to identify metal binding sites in proteins of unknown structure for which no high sequence identity homologs of known structure exist. Both the source code and a Web interface are available to the public at http://feature.stanford.edu/metals.
Collapse
Affiliation(s)
- Jessica C Ebert
- Department of Genetics, Stanford University, Stanford, California 94305, USA
| | | |
Collapse
|
56
|
In silico elucidation of the molecular mechanism defining the adverse effect of selective estrogen receptor modulators. PLoS Comput Biol 2007; 3:e217. [PMID: 18052534 PMCID: PMC2098847 DOI: 10.1371/journal.pcbi.0030217] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2007] [Accepted: 09/26/2007] [Indexed: 12/12/2022] Open
Abstract
Early identification of adverse effect of preclinical and commercial drugs is crucial in developing highly efficient therapeutics, since unexpected adverse drug effects account for one-third of all drug failures in drug development. To correlate protein–drug interactions at the molecule level with their clinical outcomes at the organism level, we have developed an integrated approach to studying protein–ligand interactions on a structural proteome-wide scale by combining protein functional site similarity search, small molecule screening, and protein–ligand binding affinity profile analysis. By applying this methodology, we have elucidated a possible molecular mechanism for the previously observed, but molecularly uncharacterized, side effect of selective estrogen receptor modulators (SERMs). The side effect involves the inhibition of the Sacroplasmic Reticulum Ca2+ ion channel ATPase protein (SERCA) transmembrane domain. The prediction provides molecular insight into reducing the adverse effect of SERMs and is supported by clinical and in vitro observations. The strategy used in this case study is being applied to discover off-targets for other commercially available pharmaceuticals. The process can be included in a drug discovery pipeline in an effort to optimize drug leads and reduce unwanted side effects. Early identification of the side effects of preclinical and commercial drugs is crucial in developing highly efficient therapeutics, as unexpected side effects account for one-third of all drug failures in drug development and lead to drugs being withdrawn from the market. Compared with the experimental identification of off-target proteins that cause side effects, computational approaches not only save time and costs by providing a candidate list of potential off-targets, but also provide insight into understanding the molecular mechanisms of protein–drug interactions. In this paper we describe an integrated approach to identifying similar drug binding pockets across protein families that have different global shapes. In a case study, we elucidate a possible molecular mechanism for the observed side effects of selective estrogen receptor modulators (SERMs), which are widely used to treat and prevent breast cancer and other diseases. The prediction provides molecular insight into reducing the side effects of SERMs and is supported by clinical and biochemical observations. The strategy used in this case study is being applied to discover off-targets for other commercially available pharmaceuticals and to repurpose existing safe pharmaceuticals to treat different diseases. The process can be included in a drug discovery pipeline in an effort to optimize drug leads, reduce unwanted side effects, and accelerate development of new drugs.
Collapse
|
57
|
Kim SM, Bowers PM, Pal D, Strong M, Terwilliger TC, Kaufmann M, Eisenberg D. Functional linkages can reveal protein complexes for structure determination. Structure 2007; 15:1079-89. [PMID: 17850747 DOI: 10.1016/j.str.2007.06.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2005] [Revised: 05/25/2007] [Accepted: 06/01/2007] [Indexed: 11/19/2022]
Abstract
In the study of protein complexes, is there a computational method for inferring which combinations of proteins in an organism are likely to form a crystallizable complex? Here we attempt to answer this question, using the Protein Data Bank (PDB) to assess the usefulness of inferred functional protein linkages from the Prolinks database. We find that of the 242 nonredundant prokaryotic protein complexes shared between the current PDB and Prolinks, 44% (107/242) contain proteins linked at high confidence by one or more methods of computed functional linkages. Similarly, high-confidence linkages detect 47% of known Escherichia coli protein complexes, with 45% accuracy. Together these findings suggest that functional linkages will be useful in defining protein complexes for structural studies, including for structural genomics. We offer a database of inferred linkages corresponding to likely protein complexes for some 629,952 pairs of proteins in 154 prokaryotes and archaea.
Collapse
Affiliation(s)
- Sul-Min Kim
- Department of Chemistry and Biochemistry, University of California-Los Angeles, Los Angeles, CA 90095, USA
| | | | | | | | | | | | | |
Collapse
|
58
|
Qi Y, Sadreyev RI, Wang Y, Kim BH, Grishin NV. A comprehensive system for evaluation of remote sequence similarity detection. BMC Bioinformatics 2007; 8:314. [PMID: 17725841 PMCID: PMC2031906 DOI: 10.1186/1471-2105-8-314] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2007] [Accepted: 08/28/2007] [Indexed: 11/25/2022] Open
Abstract
Background Accurate and sensitive performance evaluation is crucial for both effective development of better structure prediction methods based on sequence similarity, and for the comparative analysis of existing methods. Up to date, there has been no satisfactory comprehensive evaluation method that (i) is based on a large and statistically unbiased set of proteins with clearly defined relationships; and (ii) covers all performance aspects of sequence-based structure predictors, such as sensitivity and specificity, alignment accuracy and coverage, and structure template quality. Results With the aim of designing such a method, we (i) select a statistically balanced set of divergent protein domains from SCOP, and define similarity relationships for the majority of these domains by complementing the best of information available in SCOP with a rigorous SVM-based algorithm; and (ii) develop protocols for the assessment of similarity detection and alignment quality from several complementary perspectives. The evaluation of similarity detection is based on ROC-like curves and includes several complementary approaches to the definition of true/false positives. Reference-dependent approaches use the 'gold standard' of pre-defined domain relationships and structure-based alignments. Reference-independent approaches assess the quality of structural match predicted by the sequence alignment, with respect to the whole domain length (global mode) or to the aligned region only (local mode). Similarly, the evaluation of alignment quality includes several reference-dependent and -independent measures, in global and local modes. As an illustration, we use our benchmark to compare the performance of several methods for the detection of remote sequence similarities, and show that different aspects of evaluation reveal different properties of the evaluated methods, highlighting their advantages, weaknesses, and potential for further development. Conclusion The presented benchmark provides a new tool for a statistically unbiased assessment of methods for remote sequence similarity detection, from various complementary perspectives. This tool should be useful both for users choosing the best method for a given purpose, and for developers designing new, more powerful methods. The benchmark set, reference alignments, and evaluation codes can be downloaded from .
Collapse
Affiliation(s)
- Yuan Qi
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Ruslan I Sadreyev
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Yong Wang
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Bong-Hyun Kim
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, 5323, Harry Hines Blvd, Dallas, TX 75390-9050, USA
| |
Collapse
|
59
|
Dugan JM, Altman RB. Using surface envelopes to constrain molecular modeling. Protein Sci 2007; 16:1266-73. [PMID: 17586766 PMCID: PMC2206696 DOI: 10.1110/ps.062733407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Molecular density information (as measured by electron microscopic reconstructions or crystallographic density maps) can be a powerful source of information for molecular modeling. Molecular density constrains models by specifying where atoms should and should not be. Low-resolution density information can often be obtained relatively quickly, and there is a need for methods that use it effectively. We have previously described a method for scoring molecular models with surface envelopes to discriminate between plausible and implausible fits. We showed that we could successfully filter out models with the wrong shape based on this discrimination power. Ideally, however, surface information should be used during the modeling process to constrain the conformations that are sampled. In this paper, we describe an extension of our method for using shape information during computational modeling. We use the envelope scoring metric as part of an objective function in a global optimization that also optimizes distances and angles while avoiding collisions. We systematically tested surface representations of proteins (using all nonhydrogen heavy atoms) with different abundance of distance information and showed that the root mean square deviation (RMSD) of models built with envelope information is consistently improved, particularly in data sets with relatively small sets of short-range distances.
Collapse
|
60
|
Jenney FE, Adams MWW. The impact of extremophiles on structural genomics (and vice versa). Extremophiles 2007; 12:39-50. [PMID: 17563834 DOI: 10.1007/s00792-007-0087-9] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2006] [Accepted: 04/19/2007] [Indexed: 11/24/2022]
Abstract
The advent of the complete genome sequences of various organisms in the mid-1990s raised the issue of how one could determine the function of hypothetical proteins. While insight might be obtained from a 3D structure, the chances of being able to predict such a structure is limited for the deduced amino acid sequence of any uncharacterized gene. A template for modeling is required, but there was only a low probability of finding a protein closely-related in sequence with an available structure. Thus, in the late 1990s, an international effort known as structural genomics (SG) was initiated, its primary goal to "fill sequence-structure space" by determining the 3D structures of representatives of all known protein families. This was to be achieved mainly by X-ray crystallography and it was estimated that at least 5,000 new structures would be required. While the proteins (genes) for SG have subsequently been derived from hundreds of different organisms, extremophiles and particularly thermophiles have been specifically targeted due to the increased stability and ease of handling of their proteins, relative to those from mesophiles. This review summarizes the significant impact that extremophiles and proteins derived from them have had on SG projects worldwide. To what extent SG has influenced the field of extremophile research is also discussed.
Collapse
Affiliation(s)
- Francis E Jenney
- Department of Biochemistry and Molecular Biology, University of Georgia, Davison Life Sciences Complex, Green Street, Athens, GA 30602-7229, USA
| | | |
Collapse
|
61
|
Tung CH, Huang JW, Yang JM. Kappa-alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of protein structure database. Genome Biol 2007; 8:R31. [PMID: 17335583 PMCID: PMC1868941 DOI: 10.1186/gb-2007-8-3-r31] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2006] [Revised: 01/05/2007] [Accepted: 03/03/2007] [Indexed: 11/23/2022] Open
Abstract
3D BLAST, a novel protein structure database search tool, is a useful tool for analysing novel structures, capable of returning a list of aligned structures ordered according to E-values. We present a novel protein structure database search tool, 3D-BLAST, that is useful for analyzing novel structures and can return a ranked list of alignments. This tool has the features of BLAST (for example, robust statistical basis, and effective and reliable search capabilities) and employs a kappa-alpha (κ, α) plot derived structural alphabet and a new substitution matrix. 3D-BLAST searches more than 12,000 protein structures in 1.2 s and yields good results in zones with low sequence similarity.
Collapse
Affiliation(s)
- Chi-Hua Tung
- Institute of Bioinformatics, National Chiao Tung University, 75 Po-Ai Street, Hsinchu, 30050, Taiwan
| | - Jhang-Wei Huang
- Institute of Bioinformatics, National Chiao Tung University, 75 Po-Ai Street, Hsinchu, 30050, Taiwan
| | - Jinn-Moon Yang
- Institute of Bioinformatics, National Chiao Tung University, 75 Po-Ai Street, Hsinchu, 30050, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, 75 Po-Ai Street, Hsinchu, 30050, Taiwan
- Core Facility for Structural Bioinformatics, National Chiao Tung University, 75 Po-Ai Street, Hsinchu, Taiwan
| |
Collapse
|
62
|
Mareuil F, Sizun C, Perez J, Schoenauer M, Lallemand JY, Bontems F. A simple genetic algorithm for the optimization of multidomain protein homology models driven by NMR residual dipolar coupling and small angle X-ray scattering data. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2007; 37:95-104. [PMID: 17522855 DOI: 10.1007/s00249-007-0170-2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2007] [Revised: 04/10/2007] [Accepted: 04/21/2007] [Indexed: 11/24/2022]
Abstract
Most proteins comprise several domains and/or participate in functional complexes. Owing to ongoing structural genomic projects, it is likely that it will soon be possible to predict, with reasonable accuracy, the conserved regions of most structural domains. Under these circumstances, it will be important to have methods, based on simple-to-acquire experimental data, that allow to build and refine structures of multi-domain proteins or of protein complexes from homology models of the individual domains/proteins. It has been recently shown that small angle X-ray scattering (SAXS) and NMR residual dipolar coupling (RDC) data can be combined to determine the architecture of such objects when the X-ray structures of the domains are known and can be considered as rigid objects. We developed a simple genetic algorithm to achieve the same goal, but by using homology models of the domains considered as deformable objects. We applied it to two model systems, an S1KH bi-domain of the NusA protein and the gammaS-crystallin protein. Despite its simplicity our algorithm is able to generate good solutions when driven by SAXS and RDC data.
Collapse
Affiliation(s)
- Fabien Mareuil
- ICSN-RMN, Institut de Chimie des Substances Naturelles 91190 Gif-sur-Yvette and Ecole Polytechnique, 91128 Palaiseau, France
| | | | | | | | | | | |
Collapse
|
63
|
Marti-Renom MA, Rossi A, Al-Shahrour F, Davis FP, Pieper U, Dopazo J, Sali A. The AnnoLite and AnnoLyze programs for comparative annotation of protein structures. BMC Bioinformatics 2007; 8 Suppl 4:S4. [PMID: 17570147 PMCID: PMC1892083 DOI: 10.1186/1471-2105-8-s4-s4] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Advances in structural biology, including structural genomics, have resulted in a rapid increase in the number of experimentally determined protein structures. However, about half of the structures deposited by the structural genomics consortia have little or no information about their biological function. Therefore, there is a need for tools for automatically and comprehensively annotating the function of protein structures. We aim to provide such tools by applying comparative protein structure annotation that relies on detectable relationships between protein structures to transfer functional annotations. Here we introduce two programs, AnnoLite and AnnoLyze, which use the structural alignments deposited in the DBAli database. Description AnnoLite predicts the SCOP, CATH, EC, InterPro, PfamA, and GO terms with an average sensitivity of ~90% and average precision of ~80%. AnnoLyze predicts ligand binding site and domain interaction patches with an average sensitivity of ~70% and average precision of ~30%, correctly localizing binding sites for small molecules in ~95% of its predictions. Conclusion The AnnoLite and AnnoLyze programs for comparative annotation of protein structures can reliably and automatically annotate new protein structures. The programs are fully accessible via the Internet as part of the DBAli suite of tools at .
Collapse
Affiliation(s)
- Marc A Marti-Renom
- Structural Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Andrea Rossi
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Fátima Al-Shahrour
- Functional Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Fred P Davis
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Ursula Pieper
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| | - Joaquín Dopazo
- Functional Genomics Unit, Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain
| | - Andrej Sali
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry, and California Institute for Quantitative Biomedical Research, University of California at San Francisco, San Francisco, CA 94143, USA
| |
Collapse
|
64
|
Marsden RL, Lewis TA, Orengo CA. Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 2007; 8:86. [PMID: 17349043 PMCID: PMC1829165 DOI: 10.1186/1471-2105-8-86] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2006] [Accepted: 03/09/2007] [Indexed: 11/25/2022] Open
Abstract
Background Structural genomics initiatives were established with the aim of solving protein structures on a large-scale. For many initiatives, such as the Protein Structure Initiative (PSI), the primary aim of target selection is focussed towards structurally characterising protein families which, so far, lack a structural representative. It is therefore of considerable interest to gain insights into the number and distribution of these families, and what efforts may be required to achieve a comprehensive structural coverage across all protein families. Results In this analysis we have derived a comprehensive domain annotation of the genomes using CATH, Pfam-A and Newfam domain families. We consider what proportions of structurally uncharacterised families are accessible to high-throughput structural genomics pipelines, specifically those targeting families containing multiple prokaryotic orthologues. In measuring the domain coverage of the genomes, we show the benefits of selecting targets from both structurally uncharacterised domain families, whilst in addition, pursuing additional targets from large structurally characterised protein superfamilies. Conclusion This work suggests that such a combined approach to target selection is essential if structural genomics is to achieve a comprehensive structural coverage of the genomes, leading to greater insights into structure and the mechanisms that underlie protein evolution.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Tony A Lewis
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Christine A Orengo
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| |
Collapse
|
65
|
Abstract
Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains. The overall sizes of Structural Classification of Proteins (SCOP) categories (number of families, superfamilies, and folds) appear to be directly proportional to the number of deposited PDB files. Using our weighted chain count, which is most correlated to the change in the size of each SCOP category in any time period, shows that the rate of increase of SCOP categories is actually slowing down. This enables the final size of each of these SCOP categories to be predicted without examining or comparing protein structures. In the last 3 years, structures solved by structural genomics (SG) initiatives, especially the United States National Institutes of Health Protein Structure Initiative, have begun to redress the slowing growth of the PDB. Structures solved by SG are 3.8 times less sequence-redundant than typical PDB structures. Since mid-2004, SG programs have contributed half the novel structures measured by weighted chain counts. Our analysis does not rely on visual inspection of coordinate sets: it is done automatically, providing an accurate, up-to-date measure of the growth of novel protein structural data.
Collapse
Affiliation(s)
- Michael Levitt
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305-5126, USA.
| |
Collapse
|
66
|
Lundstrom K. Structural genomics: the ultimate approach for rational drug design. Mol Biotechnol 2007; 34:205-12. [PMID: 17172666 DOI: 10.1385/mb:34:2:205] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/1999] [Revised: 11/30/1999] [Accepted: 11/30/1999] [Indexed: 11/11/2022]
Abstract
Structural genomics can be defined as structural biology on a large number of target proteins in parallel. This approach plays an important role in modern structure-based drug design. Although a number of structural genomics initiatives have been initiated, relatively few are associated with integral membrane proteins. This indicates the difficulties in expression, purification, and crystallization of membrane proteins, which has also been confirmed by the existence of some 100 high-resolution structures of membrane proteins among the more than 30,000 entries in public databases. Paradoxically, membrane proteins represent 60-70% of current drug targets and structural knowledge could both improve and speed up the drug discovery process. In order to improve the success rate for structure resolution of membrane proteins structural genomics networks have been established.
Collapse
Affiliation(s)
- Kenneth Lundstrom
- Flamel Technologies, 33, Avenue du Georges Levy, 69693 Venisseux, France.
| |
Collapse
|
67
|
Mueller M, Martens L, Apweiler R. Annotating the human proteome: Beyond establishing a parts list. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2007; 1774:175-91. [PMID: 17223395 DOI: 10.1016/j.bbapap.2006.11.011] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2006] [Revised: 11/16/2006] [Accepted: 11/21/2006] [Indexed: 12/31/2022]
Abstract
The completion of the human genome has shifted the attention from deciphering the sequence to the identification and characterisation of the functional components, including genes. Improved gene prediction algorithms, together with the existing transcript and protein information, have enabled the identification of most exons in a genome. Availability of the 'parts list' has fostered the development of experimental approaches to systematically interrogate gene function on the genome, transcriptome and proteome level. Studying gene function at the protein level is vital to the understanding of how cells perform their functions as variations in protein isoforms and protein quantity which may underlie a change in phenotype can often not be deduced from sequence or transcript level genomics experiments alone. Recent advancements in proteomics have afforded technologies capable of measuring protein expression, post-translational modifications of these proteins, their subcellular localisation and assembly into complexes and pathways. Although an enormous amount of data already exists on the function of many human proteins, much of it is scattered over multiple resources. Public domain databases are therefore required to manage and collate this information and present it to the user community in both a human and machine readable manner. Of special importance here is the integration of heterogeneous data to facilitate the creation of resources that go beyond a mere parts list.
Collapse
Affiliation(s)
- Michael Mueller
- EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | | | | |
Collapse
|
68
|
Mayer KL, Qu Y, Bansal S, LeBlond PD, Jenney FE, Brereton PS, Adams MWW, Xu Y, Prestegard JH. Structure determination of a new protein from backbone-centered NMR data and NMR-assisted structure prediction. Proteins 2006; 65:480-9. [PMID: 16927360 DOI: 10.1002/prot.21119] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Targeting of proteins for structure determination in structural genomic programs often includes the use of threading and fold recognition methods to exclude proteins belonging to well-populated fold families, but such methods can still fail to recognize preexisting folds. The authors illustrate here a method in which limited amounts of structural data are used to improve an initial homology search and the data are subsequently used to produce a structure by data-constrained refinement of an identified structural template. The data used are primarily NMR-based residual dipolar couplings, but they also include additional chemical shift and backbone-nuclear Overhauser effect data. Using this methodology, a backbone structure was efficiently produced for a 10 kDa protein (PF1455) from Pyrococcus furiosus. Its relationship to existing structures and its probable function are discussed.
Collapse
Affiliation(s)
- K L Mayer
- Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
69
|
Puri M, Robin G, Cowieson N, Forwood JK, Listwan P, Hu SH, Guncar G, Huber T, Kellie S, Hume DA, Kobe B, Martin JL. Focusing in on structural genomics: The University of Queensland structural biology pipeline. ACTA ACUST UNITED AC 2006; 23:281-9. [PMID: 17097918 DOI: 10.1016/j.bioeng.2006.09.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2006] [Revised: 09/22/2006] [Accepted: 09/25/2006] [Indexed: 10/24/2022]
Abstract
The flood of new genomic sequence information together with technological innovations in protein structure determination have led to worldwide structural genomics (SG) initiatives. The goals of SG initiatives are to accelerate the process of protein structure determination, to fill in protein fold space and to provide information about the function of uncharacterized proteins. In the long-term, these outcomes are likely to impact on medical biotechnology and drug discovery, leading to a better understanding of disease as well as the development of new therapeutics. Here we describe the high throughput pipeline established at the University of Queensland in Australia. In this focused pipeline, the targets for structure determination are proteins that are expressed in mouse macrophage cells and that are inferred to have a role in innate immunity. The aim is to characterize the molecular structure and the biochemical and cellular function of these targets by using a parallel processing pipeline. The pipeline is designed to work with tens to hundreds of target gene products and comprises target selection, cloning, expression, purification, crystallization and structure determination. The structures from this pipeline will provide insights into the function of previously uncharacterized macrophage proteins and could lead to the validation of new drug targets for chronic obstructive pulmonary disease and arthritis.
Collapse
Affiliation(s)
- Munish Puri
- Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
70
|
Greene LH, Lewis TE, Addou S, Cuff A, Dallman T, Dibley M, Redfern O, Pearl F, Nambudiry R, Reid A, Sillitoe I, Yeats C, Thornton JM, Orengo CA. The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res 2006; 35:D291-7. [PMID: 17135200 PMCID: PMC1751535 DOI: 10.1093/nar/gkl959] [Citation(s) in RCA: 239] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt.
Collapse
Affiliation(s)
| | | | | | - Alison Cuff
- To whom correspondence should be addressed: Tel: +1 44 207 679 3890; Fax: +1 44 207 679 7193;
| | | | | | | | | | | | | | | | | | - Janet M. Thornton
- European Bioinformatics Institute, Hinxton HallHinxton, Cambridge CB 10 IRQ, UK
| | | |
Collapse
|
71
|
Hausrath AC, Goriely A. Continuous representations of proteins: construction of coordinate models from curvature profiles. J Struct Biol 2006; 158:267-81. [PMID: 17222563 DOI: 10.1016/j.jsb.2006.11.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2006] [Revised: 10/27/2006] [Accepted: 11/07/2006] [Indexed: 12/01/2022]
Abstract
A representation of proteins based on the geometry of space curves is described. This representation enables the application of continuum methods to the analysis of macromolecular structure and form that cannot be applied to the more familiar discrete atomic coordinate models. It is shown that the continuous modeling method defines the geometry of the protein fold very efficiently. An analytical solution for curve construction is employed from which both continuous and coordinate models can be obtained. The method is applied to five representative test proteins which are used to assess the accuracy and efficiency of the modeling procedure.
Collapse
Affiliation(s)
- A C Hausrath
- Department of Biochemistry and Molecular Biophysics, University of Arizona, 1041 E. Lowell, Tucson, AZ 85721, USA.
| | | |
Collapse
|
72
|
Shih ESC, Gan RCR, Hwang MJ. OPAAS: a web server for optimal, permuted, and other alternative alignments of protein structures. Nucleic Acids Res 2006; 34:W95-8. [PMID: 16845117 PMCID: PMC1538888 DOI: 10.1093/nar/gkl264] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The large number of experimentally determined protein 3D structures is a rich resource for studying protein function and evolution, and protein structure comparison (PSC) is a key method for such studies. When comparing two protein structures, almost all currently available PSC servers report a single and sequential (i.e. topological) alignment, whereas the existence of good alternative alignments, including those involving permutations (i.e. non-sequential or non-topological alignments), is well known. We have recently developed a novel PSC method that can detect alternative alignments of statistical significance (alignment similarity P-value <10−5), including structural permutations at all levels of complexity. OPAAS, the server of this PSC method freely accessible at our website (), provides an easy-to-read hierarchical layout of output to display detailed information on all of the significant alternative alignments detected. Because these alternative alignments can offer a more complete picture on the structural, evolutionary and functional relationship between two proteins, OPAAS can be used in structural bioinformatics research to gain additional insight that is not readily provided by existing PSC servers.
Collapse
Affiliation(s)
| | | | - Ming-Jing Hwang
- To whom correspondence should be addressed. Tel: +886 2 2789 9033; Fax: +886 2 2788 7641;
| |
Collapse
|
73
|
Tsuchiya Y, Kinoshita K, Ito N, Nakamura H. PreBI: prediction of biological interfaces of proteins in crystals. Nucleic Acids Res 2006; 34:W320-4. [PMID: 16844993 PMCID: PMC1538861 DOI: 10.1093/nar/gkl267] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
PreBI is a server that predicts biological interfaces in protein crystal structures, according to the complementarity and the area of the interface. The server accepts a coordinate file in the PDB format, and all of the possible interfaces are generated automatically, according to the symmetry operations given in the coordinate file. For all of the interfaces generated, the complementarities of the electrostatic potential, hydrophobicity and shape of the interfaces are analyzed, and the most probable biological interface is identified according to the combination of the degree of complementarity derived from the database analyses and the area of the interface. The results can be checked through an interactive viewer, and the most probable complex can be downloaded as atomic coordinates in the PDB format. PreBI is available at .
Collapse
Affiliation(s)
- Yuko Tsuchiya
- Institute for Protein Research, Osaka University3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Kengo Kinoshita
- Institute of Medical Science, University of Tokyo4-6-1 Shirokanedai, Minatoku, Tokyo, 108-8639, Japan
- Structure and Function of Biomolecules, SORSTJST, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan
- To whom correspondence should be addressed at Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, Tokyo, 108-8639, Japan. Tel: +81 3 5449 5131; Fax: +81 3 5449 5133;
| | - Nobutoshi Ito
- School of Biomedical Science, Tokyo Medical and Dental University1-5-45 Yushima, Bunkyo-ku, Tokyo, 113-8510, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| |
Collapse
|
74
|
Mestres J. Representativity of target families in the Protein Data Bank: impact for family-directed structure-based drug discovery. Drug Discov Today 2006; 10:1629-37. [PMID: 16376823 DOI: 10.1016/s1359-6446(05)03593-2] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Analysis of the population of enzyme structures in the Protein Data Bank across all levels of the functional classification based on enzyme commission (EC) numbers reveals that, in spite of the almost exponential growth in the number of structures deposited, progress in achieving complete occupancy at all EC levels is relatively slow. Moreover, inspection of the distribution of the population among the members of the different enzyme families uncovers a strong bias towards enzymes widely recognized as therapeutically relevant targets. The low representativity levels identified in some target families warn on the current scope and applicability of structure-based approaches to family-directed strategies in drug discovery.
Collapse
Affiliation(s)
- Jordi Mestres
- Chemogenomics Laboratory, Research Unit on Biomedical Informatics, Institut Municipal d'Investigació Mèdica and Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain.
| |
Collapse
|
75
|
Rigden DJ. Understanding the cell in terms of structure and function: insights from structural genomics. Curr Opin Biotechnol 2006; 17:457-64. [PMID: 16890423 DOI: 10.1016/j.copbio.2006.07.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2006] [Revised: 06/21/2006] [Accepted: 07/25/2006] [Indexed: 10/24/2022]
Abstract
Structural genomics programs are only now moving into the large-scale production phase, yet have already produced around 2000 protein structures. Through a widespread if not exclusive emphasis on structural novelty, our knowledge of the protein fold universe is improving rapidly. With this information comes the challenge of structure-based function annotation for the many target proteins about which little or nothing is known. Recent years have therefore seen the emergence of impressively diverse bioinformatics approaches to predict the function of a protein structure. Attention is now turning to means of combining these predictions with information from various other sources.
Collapse
Affiliation(s)
- Daniel J Rigden
- School of Biological Sciences, University of Liverpool, Biosciences Building, Crown Street, Liverpool L69 7ZB, UK.
| |
Collapse
|
76
|
Yang JM, Tung CH. Protein structure database search and evolutionary classification. Nucleic Acids Res 2006; 34:3646-59. [PMID: 16885238 PMCID: PMC1540718 DOI: 10.1093/nar/gkl395] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2006] [Revised: 05/06/2006] [Accepted: 05/09/2006] [Indexed: 11/14/2022] Open
Abstract
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].
Collapse
Affiliation(s)
- Jinn-Moon Yang
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, 30050, Taiwan.
| | | |
Collapse
|
77
|
Gu J, Gribskov M, Bourne PE. Wiggle-predicting functionally flexible regions from primary sequence. PLoS Comput Biol 2006; 2:e90. [PMID: 16839194 PMCID: PMC1500818 DOI: 10.1371/journal.pcbi.0020090] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2006] [Accepted: 06/02/2006] [Indexed: 11/18/2022] Open
Abstract
The Wiggle series are support vector machine-based predictors that identify regions of functional flexibility using only protein sequence information. Functionally flexible regions are defined as regions that can adopt different conformational states and are assumed to be necessary for bioactivity. Many advances have been made in understanding the relationship between protein sequence and structure. This work contributes to those efforts by making strides to understand the relationship between protein sequence and flexibility. A coarse-grained protein dynamic modeling approach was used to generate the dataset required for support vector machine training. We define our regions of interest based on the participation of residues in correlated large-scale fluctuations. Even with this structure-based approach to computationally define regions of functional flexibility, predictors successfully extract sequence-flexibility relationships that have been experimentally confirmed to be functionally important. Thus, a sequence-based tool to identify flexible regions important for protein function has been created. The ability to identify functional flexibility using a sequence based approach complements structure-based definitions and will be especially useful for the large majority of proteins with unknown structures. The methodology offers promise to identify structural genomics targets amenable to crystallization and the possibility to engineer more flexible or rigid regions within proteins to modify their bioactivity.
Collapse
Affiliation(s)
- Jenny Gu
- Department of Pharmacology and Biomedical Sciences Graduate Program, University of California San Diego, La Jolla, California, USA.
| | | | | |
Collapse
|
78
|
Yura K, Yamaguchi A, Go M. Coverage of whole proteome by structural genomics observed through protein homology modeling database. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2006; 7:65-76. [PMID: 17146617 PMCID: PMC1769342 DOI: 10.1007/s10969-006-9010-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/11/2006] [Accepted: 08/08/2006] [Indexed: 11/07/2022]
Abstract
We have been developing FAMSBASE, a protein homology-modeling database of whole ORFs predicted from genome sequences. The latest update of FAMSBASE ( http://daisy.nagahama-i-bio.ac.jp/Famsbase/ ), which is based on the protein three-dimensional (3D) structures released by November 2003, contains modeled 3D structures for 368,724 open reading frames (ORFs) derived from genomes of 276 species, namely 17 archaebacterial, 130 eubacterial, 18 eukaryotic and 111 phage genomes. Those 276 genomes are predicted to have 734,193 ORFs in total and the current FAMSBASE contains protein 3D structure of approximately 50% of the ORF products. However, cases that a modeled 3D structure covers the whole part of an ORF product are rare. When portion of an ORF with 3D structure is compared in three kingdoms of life, in archaebacteria and eubacteria, approximately 60% of the ORFs have modeled 3D structures covering almost the entire amino acid sequences, however, the percentage falls to about 30% in eukaryotes. When annual differences in the number of ORFs with modeled 3D structure are calculated, the fraction of modeled 3D structures of soluble protein for archaebacteria is increased by 5%, and that for eubacteria by 7% in the last 3 years. Assuming that this rate would be maintained and that determination of 3D structures for predicted disordered regions is unattainable, whole soluble protein model structures of prokaryotes without the putative disordered regions will be in hand within 15 years. For eukaryotic proteins, they will be in hand within 25 years. The 3D structures we will have at those times are not the 3D structure of the entire proteins encoded in single ORFs, but the 3D structures of separate structural domains. Measuring or predicting spatial arrangements of structural domains in an ORF will then be a coming issue of structural genomics.
Collapse
Affiliation(s)
- Kei Yura
- Quantum Bioinformatics Team, Center for Computational Science and Engineering, Japan Atomic Energy Agency, Kyoto 619-0215, Japan.
| | | | | |
Collapse
|
79
|
García-Serna R, Opatowski L, Mestres J. FCP: functional coverage of the proteome by structures. Bioinformatics 2006; 22:1792-3. [PMID: 16705012 DOI: 10.1093/bioinformatics/btl188] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Tools and resources for translating the remarkable growth witnessed in recent years in the number of protein structures determined experimentally into actual gain in the functional coverage of the proteome are becoming increasingly necessary. We introduce FCP, a publicly accessible web tool dedicated to analyzing the current state and trends of the population of structures within protein families. FCP offers both graphical and quantitative data on the degree of functional coverage of enzymes and nuclear receptors by existing structures, as well as on the bias observed in the distribution of structures along their respective functional classification schemes. AVAILABILITY http://cgl.imim.es/fcp CONTACT jmestres@imim.es.
Collapse
Affiliation(s)
- Ricard García-Serna
- Chemogenomics Laboratory, Research Unit on Biomedical Informatics, Institut Municipal d'Investigació Mèdica and Universitat Pompeu Fabra, Dr Aiguader 88, 08003 Barcelona, Catalonia, Spain
| | | | | |
Collapse
|
80
|
Marsden RL, Ranea JAG, Sillero A, Redfern O, Yeats C, Maibaum M, Lee D, Addou S, Reeves GA, Dallman TJ, Orengo CA. Exploiting protein structure data to explore the evolution of protein function and biological complexity. Philos Trans R Soc Lond B Biol Sci 2006; 361:425-40. [PMID: 16524831 PMCID: PMC1609337 DOI: 10.1098/rstb.2005.1801] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry, University College London Gower Street, London WC1E 6BT, UK.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
81
|
Svensson AKE, Zitzewitz JA, Matthews C, Smith VF. The relationship between chain connectivity and domain stability in the equilibrium and kinetic folding mechanisms of dihydrofolate reductase from E.coli. Protein Eng Des Sel 2006; 19:175-85. [PMID: 16452118 PMCID: PMC5441858 DOI: 10.1093/protein/gzj017] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2006] [Accepted: 01/06/2006] [Indexed: 11/14/2022] Open
Abstract
The role of domains in defining the equilibrium and kinetic folding properties of dihydrofolate reductase (DHFR) from Escherichia coli was probed by examining the thermodynamic and kinetic properties of a set of variants in which the chain connectivity in the discontinuous loop domain (DLD) and the adenosine-binding domain (ABD) was altered by permutation. To test the concept that chain cleavage can selectively destabilize the domain in which the N- and C-termini are resident, permutations were introduced at one position within the ABD, one within the DLD and one at a boundary between the domains. The results demonstrated that a continuous ABD is required for a stable thermal intermediate and a continuous DLD is required for a stable urea intermediate. The permutation at the domain interface had both a thermal and urea intermediate. Strikingly, the observable kinetic folding responses of all three permuted proteins were very similar to the wild-type protein. These results demonstrate a crucial role for stable domains in defining the energy surface for the equilibrium folding reaction of DHFR. If domain connectivity affects the kinetic mechanism, the effects must occur in the sub-millisecond time range.
Collapse
|
82
|
Chandonia JM, Kim SH. Structural proteomics of minimal organisms: conservation of protein fold usage and evolutionary implications. BMC STRUCTURAL BIOLOGY 2006; 6:7. [PMID: 16566839 PMCID: PMC1488858 DOI: 10.1186/1472-6807-6-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2005] [Accepted: 03/28/2006] [Indexed: 11/10/2022]
Abstract
BACKGROUND Determining the complete repertoire of protein structures for all soluble, globular proteins in a single organism has been one of the major goals of several structural genomics projects in recent years. RESULTS We report that this goal has nearly been reached for several "minimal organisms"--parasites or symbionts with reduced genomes--for which over 95% of the soluble, globular proteins may now be assigned folds, overall 3-D backbone structures. We analyze the structures of these proteins as they relate to cellular functions, and compare conservation of fold usage between functional categories. We also compare patterns in the conservation of folds among minimal organisms and those observed between minimal organisms and other bacteria. CONCLUSION We find that proteins performing essential cellular functions closely related to transcription and translation exhibit a higher degree of conservation in fold usage than proteins in other functional categories. Folds related to transcription and translation functional categories were also overrepresented in minimal organisms compared to other bacteria.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Sung-Hou Kim
- Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Chemistry, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
83
|
Sadreyev RI, Grishin NV. Exploring dynamics of protein structure determination and homology-based prediction to estimate the number of superfamilies and folds. BMC STRUCTURAL BIOLOGY 2006; 6:6. [PMID: 16549009 PMCID: PMC1444916 DOI: 10.1186/1472-6807-6-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2005] [Accepted: 03/20/2006] [Indexed: 11/10/2022]
Abstract
Background As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains? Results To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database. Conclusion The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.
Collapse
Affiliation(s)
- Ruslan I Sadreyev
- Howard Hughes Medical Institute/Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-8816, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute/Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-8816, USA
| |
Collapse
|
84
|
Abstract
The Protein Model Database (PMDB) is a public resource aimed at storing manually built 3D models of proteins. The database is designed to provide access to models published in the scientific literature, together with validating experimental data. It is a relational database and it currently contains >74 000 models for ∼240 proteins. The system is accessible at and allows predictors to submit models along with related supporting evidence and users to download them through a simple and intuitive interface. Users can navigate in the database and retrieve models referring to the same target protein or to different regions of the same protein. Each model is assigned a unique identifier that allows interested users to directly access the data.
Collapse
Affiliation(s)
| | | | - Domenico Cozzetto
- Department of Biochemical Sciences, University ‘La Sapienza’P.le Aldo Moro, 5, I-00185 Rome, Italy
| | - Ivano Giuseppe Talamo
- Department of Biochemical Sciences, University ‘La Sapienza’P.le Aldo Moro, 5, I-00185 Rome, Italy
| | - Anna Tramontano
- Department of Biochemical Sciences, University ‘La Sapienza’P.le Aldo Moro, 5, I-00185 Rome, Italy
- Istituto Pasteur—Fondazione Cenci Bolognetti, University ‘La Sapienza’P.le Aldo Moro, 5, I-00185 Rome, Italy
- To whom correspondence should be addressed. Tel: +39 0649910556; Fax: +39 0649910717;
| |
Collapse
|
85
|
Abstract
The SWISS-MODEL Repository is a database of annotated 3D protein structure models generated by the SWISS-MODEL homology-modelling pipeline. As of September 2005, the repository contained 675,000 models for 604,000 different protein sequences of the UniProt database. Regular updates ensure that the content of the repository reflects the current state of sequence and structure databases, integrating new or modified target sequences, and making use of new template structures. Each Repository entry consists of one or more 3D models accompanied by detailed information about the target protein and the model building process: functional annotation, a detailed template selection log, target-template alignment, summary of the model building and model quality assessment. The SWISS-MODEL Repository is freely accessible at http://swissmodel.expasy.org/repository/.
Collapse
Affiliation(s)
| | - Torsten Schwede
- To whom corresponding should be addressed. Tel: +41 61 267 15 81; Fax: +41 61 267 15 84;
| |
Collapse
|
86
|
Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA. Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 2006; 34:1066-80. [PMID: 16481312 PMCID: PMC1373602 DOI: 10.1093/nar/gkj494] [Citation(s) in RCA: 54] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.
Collapse
Affiliation(s)
- Russell L Marsden
- Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK.
| | | | | | | | | |
Collapse
|
87
|
Casbon JA, Saqi MAS. On single and multiple models of protein families for the detection of remote sequence relationships. BMC Bioinformatics 2006; 7:48. [PMID: 16448555 PMCID: PMC1397874 DOI: 10.1186/1471-2105-7-48] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2005] [Accepted: 01/31/2006] [Indexed: 11/23/2022] Open
Abstract
Background The detection of relationships between a protein sequence of unknown function and a sequence whose function has been characterised enables the transfer of functional annotation. However in many cases these relationships can not be identified easily from direct comparison of the two sequences. Methods which compare sequence profiles have been shown to improve the detection of these remote sequence relationships. However, the best method for building a profile of a known set of sequences has not been established. Here we examine how the type of profile built affects its performance, both in detecting remote homologs and in the resulting alignment accuracy. In particular, we consider whether it is better to model a protein superfamily using a single structure-based alignment that is representative of all known cases of the superfamily, or to use multiple sequence-based profiles each representing an individual member of the superfamily. Results Using profile-profile methods for remote homolog detection we benchmark the performance of single structure-based superfamily models and multiple domain models. On average, over all superfamilies, using a truncated receiver operator characteristic (ROC5) we find that multiple domain models outperform single superfamily models, except at low error rates where the two models behave in a similar way. However there is a wide range of performance depending on the superfamily. For 12% of all superfamilies the ROC5 value for superfamily models is greater than 0.2 above the domain models and for 10% of superfamilies the domain models show a similar improvement in performance over the superfamily models. Conclusion Using a sensitive profile-profile method we have investigated the performance of single structure-based models and multiple sequence models (domain models) in detecting remote superfamily members. We find that overall, multiple models perform better in recognition although single structure-based models display better alignment accuracy.
Collapse
Affiliation(s)
- James A Casbon
- Bioinformatics Group, Institute of Cell and Molecular Science, The Genome Centre, Queen Mary's School of Medicine and Dentistry, Charterhouse Square, London, EC1M 6BQ, UK
| | - Mansoor AS Saqi
- Bioinformatics Group, Institute of Cell and Molecular Science, The Genome Centre, Queen Mary's School of Medicine and Dentistry, Charterhouse Square, London, EC1M 6BQ, UK
| |
Collapse
|
88
|
Abstract
Structural genomics (SG) projects aim to expand our structural knowledge of biological macromolecules while lowering the average costs of structure determination. We quantitatively analyzed the novelty, cost, and impact of structures solved by SG centers, and we contrast these results with traditional structural biology. The first structure identified in a protein family enables inference of the fold and of ancient relationships to other proteins; in the year ending 31 January 2005, about half of such structures were solved at a SG center rather than in a traditional laboratory. Furthermore, the cost of solving a structure at the most efficient SG center in the United States has dropped to one-quarter of the estimated cost of solving a structure by traditional methods. However, the efficiency of the top structural biology laboratories-even though they work on very challenging structures-is comparable to that of SG centers; moreover, traditional structural biology papers are cited significantly more often, suggesting greater current impact.
Collapse
Affiliation(s)
- John-Marc Chandonia
- Berkeley Structural Genomics Center, Physical Biosciences Division, Lawrence Berkeley National Laboratory, and Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
89
|
Singh J, Deng Z, Narale G, Chuaqui C. Structural Interaction Fingerprints: A New Approach to Organizing, Mining, Analyzing, and Designing Protein-Small Molecule Complexes. Chem Biol Drug Des 2006; 67:5-12. [PMID: 16492144 DOI: 10.1111/j.1747-0285.2005.00323.x] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The combination of advances in structure-based drug design efforts in the pharmaceutical industry in parallel with structural genomics initiatives in the public domain has led to an explosion in the number of structures of protein-small molecule complexes structures. This information has critical importance to both the understanding of the structural basis for molecular recognition in biological systems and the design of better drugs. A significant challenge exists in managing this vast amount of data and fully leveraging it. Here, we review our work to develop a simple, fast way to store, organize, mine, and analyze large numbers of protein-small molecule complexes. We illustrate the utility of the approach to the management of inhibitor complexes from the protein kinase family. Finally, we describe our recent efforts in applying this method to the design of target-focused chemical libraries.
Collapse
Affiliation(s)
- Juswinder Singh
- Computational Drug Design Group, Department of Research Informatics, Biogen Idec, 12 Cambridge Center, Cambridge, MA 02142, USA
| | | | | | | |
Collapse
|
90
|
Gutteridge A, Thornton JM. Understanding nature's catalytic toolkit. Trends Biochem Sci 2005; 30:622-9. [PMID: 16214343 DOI: 10.1016/j.tibs.2005.09.006] [Citation(s) in RCA: 133] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2005] [Revised: 08/24/2005] [Accepted: 09/15/2005] [Indexed: 11/25/2022]
Abstract
Enzymes catalyse numerous reactions in nature, often causing spectacular accelerations in the catalysis rate. One aspect of understanding how enzymes achieve these feats is to explore how they use the limited set of residue side chains that form their 'catalytic toolkit'. Combinations of different residues form 'catalytic units' that are found repeatedly in different unrelated enzymes. Most catalytic units facilitate rapid catalysis in the enzyme active site either by providing charged groups to polarize substrates and to stabilize transition states, or by modifying the pKa values of other residues to provide more effective acids and bases. Given recent efforts to design novel enzymes, the rise of structural genomics and subsequent efforts to predict the function of enzymes from their structure, these units provide a simple framework to describe how nature uses the tools at her disposal, and might help to improve techniques for designing and predicting enzyme function.
Collapse
Affiliation(s)
- Alex Gutteridge
- EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | | |
Collapse
|
91
|
Xie L, Bourne PE. Functional coverage of the human genome by existing structures, structural genomics targets, and homology models. PLoS Comput Biol 2005; 1:e31. [PMID: 16118666 PMCID: PMC1188274 DOI: 10.1371/journal.pcbi.0010031] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2005] [Accepted: 07/18/2005] [Indexed: 11/23/2022] Open
Abstract
The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the “most wanted list” that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html. The sequencing of the human genome provides biologists with new opportunities to understand the molecular basis of physiological processes and disease states. To take full advantage of these opportunities, the three-dimensional structures of the gene products are needed to provide the appropriate level of detail. Since protein structure determination lags behind protein sequence determination, an important and ongoing question becomes: what degree of coverage of the human proteome do we have from experimental structures, and what can we infer by modeling? Or, turning the question around: what structures do we need to determine (the “most wanted list”) to further our understanding of the human condition? This paper addresses these questions through integration of existing data resources correlated using comparative functional features, namely the Gene Ontology, which describes biochemical process, molecular function, and cellular location for all types of proteins, and the Enzyme Commission classification for enzymes. Genetic disease states are linked through the Online Mendelian Inheritance in Man resource. Readers can ask their own questions of the resource at http://function.rcsb.org:8080/pdb/function_distribution/index.html. The resource should prove particularly useful to the structural genomics community as it strives to undertake large-scale structure determination with a goal of improving the understanding of protein functional space.
Collapse
Affiliation(s)
- Lei Xie
- San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, California, United States of America
| | - Philip E Bourne
- San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, California, United States of America
- *To whom correspondence should be addressed. E-mail:
| |
Collapse
|
92
|
|
93
|
Ferrer-Costa C, Shanahan HP, Jones S, Thornton JM. HTHquery: a method for detecting DNA-binding proteins with a helix-turn-helix structural motif. Bioinformatics 2005; 21:3679-80. [PMID: 16030074 DOI: 10.1093/bioinformatics/bti575] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY HTHquery is a web-based service to determine if a protein structure has a helix-turn-helix structural motif which could bind to DNA. It is based on a similarity with a set of structural templates, the accessibility of a putative structural motif and a positive electrostatic potential in the neighbourhood of the putative motif. A set of scores are computed, based on each template, using a linear predictor. From the training set used, the predictor has a true positive rate of 83.5% and a false positive rate of 0.8%. The emphasis for the website is on providing a straightforward interface which can be easily used by a bench-based scientist. AVAILABILITY HTHquery is implemented using a set of Perl scripts and C program and can be accessed freely on the website http://www.ebi.ac.uk/thornton-srv/databases/HTHquery.
Collapse
Affiliation(s)
- C Ferrer-Costa
- Molecular Modelling and Bioinformatics, IRBB-Parc Cientific de Barcelona, UB, Josep Samitier, 1-5 08028 Barcelona, Catalonia, Spain
| | | | | | | |
Collapse
|
94
|
Büssow K, Scheich C, Sievert V, Harttig U, Schultz J, Simon B, Bork P, Lehrach H, Heinemann U. Structural genomics of human proteins--target selection and generation of a public catalogue of expression clones. Microb Cell Fact 2005; 4:21. [PMID: 15998469 PMCID: PMC1250228 DOI: 10.1186/1475-2859-4-21] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Accepted: 07/05/2005] [Indexed: 11/12/2022] Open
Abstract
Background The availability of suitable recombinant protein is still a major bottleneck in protein structure analysis. The Protein Structure Factory, part of the international structural genomics initiative, targets human proteins for structure determination. It has implemented high throughput procedures for all steps from cloning to structure calculation. This article describes the selection of human target proteins for structure analysis, our high throughput cloning strategy, and the expression of human proteins in Escherichia coli host cells. Results and Conclusion Protein expression and sequence data of 1414 E. coli expression clones representing 537 different proteins are presented. 139 human proteins (18%) could be expressed and purified in soluble form and with the expected size. All E. coli expression clones are publicly available to facilitate further functional characterisation of this set of human proteins.
Collapse
Affiliation(s)
- Konrad Büssow
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, 14195 Berlin, Germany
| | - Christoph Scheich
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, 14195 Berlin, Germany
| | - Volker Sievert
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, 14195 Berlin, Germany
| | - Ulrich Harttig
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- RZPD German Resource Center for Genome Research GmbH, Heubnerweg 6, 14059 Berlin, Germany
- DIFE, Arthur-Scheunert-Allee 114–116, 14558 Nuthetal, Germany
| | - Jörg Schultz
- EMBL Heidelberg, Meyerhofstr. 1, 69117 Heidelberg, Germany
- Department of Bioinformatics, University of Würzburg, Biocenter, Am Hubland, 97074 Würzburg, Germany
| | - Bernd Simon
- EMBL Heidelberg, Meyerhofstr. 1, 69117 Heidelberg, Germany
| | - Peer Bork
- EMBL Heidelberg, Meyerhofstr. 1, 69117 Heidelberg, Germany
| | - Hans Lehrach
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max-Planck-Institut für Molekulare Genetik, Ihnestr. 73, 14195 Berlin, Germany
| | - Udo Heinemann
- Protein Structure Factory, Heubnerweg 6, 14059 Berlin, Germany
- Max-Delbrück-Centrum für Molekulare Medizin, Robert-Rössle-Str. 10, 13092 Berlin, Germany
- Institut für Chemie/Kristallographie, Freie Universität, Takustr. 6, 14195 Berlin, Germany
| |
Collapse
|