51
|
Abstract
The recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations.
Collapse
Affiliation(s)
- A Müller
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England
| | | | | |
Collapse
|
52
|
Burley SK, Almo SC, Bonanno JB, Capel M, Chance MR, Gaasterland T, Lin D, Sali A, Studier FW, Swaminathan S. Structural genomics: beyond the human genome project. Nat Genet 1999; 23:151-7. [PMID: 10508510 DOI: 10.1038/13783] [Citation(s) in RCA: 275] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With access to whole genome sequences for various organisms and imminent completion of the Human Genome Project, the entire process of discovery in molecular and cellular biology is poised to change. Massively parallel measurement strategies promise to revolutionize how we study and ultimately understand the complex biochemical circuitry responsible for controlling normal development, physiologic homeostasis and disease processes. This information explosion is also providing the foundation for an important new initiative in structural biology. We are about to embark on a program of high-throughput X-ray crystallography aimed at developing a comprehensive mechanistic understanding of normal and abnormal human and microbial physiology at the molecular level. We present the rationale for creation of a structural genomics initiative, recount the efforts of ongoing structural genomics pilot studies, and detail the lofty goals, technical challenges and pitfalls facing structural biologists.
Collapse
Affiliation(s)
- S K Burley
- Howard Hughes Medical Institute, 1230 York Avenue, New York, New York 10021, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
53
|
Stohl EA, Brady SF, Clardy J, Handelsman J. ZmaR, a novel and widespread antibiotic resistance determinant that acetylates zwittermicin A. J Bacteriol 1999; 181:5455-60. [PMID: 10464220 PMCID: PMC94055 DOI: 10.1128/jb.181.17.5455-5460.1999] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
ZmaR is a resistance determinant of unusual abundance in the environment and confers on gram-positive and gram-negative bacteria resistance to zwittermicin A, a novel broad-spectrum antibiotic produced by species of Bacillus. The ZmaR protein has no sequence similarity to proteins of known function; thus, the purpose of the present study was to determine the function of ZmaR in vitro. Cell extracts of E. coli containing zmaR inactivated zwittermicin A by covalent modification. Chemical analysis of inactivated zwittermicin A by 1H NMR, 13C NMR, and high- and low-resolution mass spectrometry demonstrated that the inactivated zwittermicin A was acetylated. Purified ZmaR protein inactivated zwittermicin A, and biochemical assays for acetyltransferase activity with [14C]acetyl coenzyme A demonstrated that ZmaR catalyzes the acetylation of zwittermicin A with acetyl coenzyme A as a donor group, suggesting that ZmaR may constitute a new class of acetyltransferases. Our results allow us to assign a biochemical function to a resistance protein that has no sequence similarity to proteins of known function, contributing fundamental knowledge to the fields of antibiotic resistance and protein function.
Collapse
Affiliation(s)
- E A Stohl
- Department of Plant Pathology, University of Wisconsin, Madison, Wisconsin 53706, USA
| | | | | | | |
Collapse
|
54
|
Abstract
Spectacular achievements in whole genome sequencing open up new possibilities for structural research. Protein structures can now be studied in their natural genomic context. On the other hand, structure prediction algorithms can be improved using species-specific tendencies in folding patterns. Finally, efficient strategies to select targets for structure determination can be devised. In this review we consider new computational approaches and results in protein structure analysis stemming from the availability of complete genomes.
Collapse
Affiliation(s)
- D Frishman
- GSF-Forschungszentrum fuer Umwelt und Gesundheit, Munich Information Center for Protein Sequences, am Max-Planck-Institut für Biochemie, Martinsried, Germany.
| | | |
Collapse
|
55
|
Paw?owski K, Zhang B, Rychlewski L, Godzik A. TheHelicobacter pylori genome: From sequence analysis to structural and functional predictions. Proteins 1999. [DOI: 10.1002/(sici)1097-0134(19990701)36:1<20::aid-prot2>3.0.co;2-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
56
|
Zhang B, Rychlewski L, Pawłowski K, Fetrow JS, Skolnick J, Godzik A. From fold predictions to function predictions: automation of functional site conservation analysis for functional genome predictions. Protein Sci 1999; 8:1104-15. [PMID: 10338021 PMCID: PMC2144342 DOI: 10.1110/ps.8.5.1104] [Citation(s) in RCA: 45] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
A database of functional sites for proteins with known structures, SITE, is constructed and used in conjunction with a simple pattern matching program SiteMatch to evaluate possible function conservation in a recently constructed database of fold predictions for Escherichia coli proteins (Rychlewski L et al., 1999, Protein Sci 8:614-624). In this and other prediction databases, fold predictions are based on algorithms that can recognize weak sequence similarities and putatively assign new proteins into already characterized protein families. It is not clear whether such sequence similarities arise from distant homologies or general similarity of physicochemical features along the sequence. Leaving aside the important question of nature of relations within fold superfamilies, it is possible to assess possible function conservation by looking at the pattern of conservation of crucial functional residues. SITE consists of a multilevel function description based on structure annotations and structure analyses. In particular, active site residues, ligand binding residues, and patterns of hydrophobic residues on the protein surface are used to describe different functional features. SiteMatch, a simple pattern matching program, is designed to check the conservation of residues involved in protein activity in alignments generated by any alignment method. Here, this procedure is used to study conservation of functional features in alignments between protein sequences from the E. coli genome and their optimal structural templates. The optimal templates were identified and alignments taken from the database of genomic structural predictions was described in a previous publication (Rychlewski L et al., 1999, Protein Sci 8:614-624). An automated assessment of function conservation is used to analyze the relation between fold and function similarity for a large number of fold predictions. For instance, it is shown that identifying low significance predictions with a high level of functional residue conservations can be used to extend the prediction sensitivity for fold prediction methods. Over 100 new fold/function predictions in this class were obtained in the E. coli genome. At the same time, about 30% of our previous fold predictions are not confirmed as function predictions, further highlighting the problem of function divergence in fold superfamilies.
Collapse
Affiliation(s)
- B Zhang
- The Scripps Research Institute, La Jolla, California 92037, USA
| | | | | | | | | | | |
Collapse
|
57
|
Abstract
Methods for protein structure (3D)-sequence (1D) compatibility evaluation (threading) have been developed during the past decade. The protocol in which a sequence can recognize its compatible structure in the structural library (i.e., the fold recognition or the forward-folding search) is available for the structure prediction of new proteins. However, the reverse protocol, in which a structure recognizes its homologous sequences among a sequence database, named the inverse-folding search, is a more difficult application. In this study, we have investigated the feasibility of the latter approach. A structural library, composed of about 400 well-resolved structures with mutually dissimilar sequences, was prepared, and 163 of them had remote homologs in the library. We examined whether they could correctly seek their homologs by both forward- and inverse-folding searches. The results showed that the inverse-folding protocol is more effective than the forward-folding protocol, once the reference states of the compatibility functions are appropriately adjusted. This adjustment only slightly affects the ability of the forward-folding search. We noticed that the scoring, in which a given sequence is re-mounted onto a structure according to the 3D-1D alignment determined by the dynamic programming method, is only effective in the forward-folding protocol and not in the inverse-folding protocol. Namely, the inverse-folding search works significantly better with the score given by the 3D-1D alignment per se, rather than that obtained by the re-mounting. The implications of these results are discussed.
Collapse
Affiliation(s)
- M Ota
- National Institute of Genetics, Mishima, Shizuoka, Japan.
| | | |
Collapse
|
58
|
Jones DT. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 1999; 287:797-815. [PMID: 10191147 DOI: 10.1006/jmbi.1999.2583] [Citation(s) in RCA: 614] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
A new protein fold recognition method is described which is both fast and reliable. The method uses a traditional sequence alignment algorithm to generate alignments which are then evaluated by a method derived from threading techniques. As a final step, each threaded model is evaluated by a neural network in order to produce a single measure of confidence in the proposed prediction. The speed of the method, along with its sensitivity and very low false-positive rate makes it ideal for automatically predicting the structure of all the proteins in a translated bacterial genome (proteome). The method has been applied to the genome of Mycoplasma genitalium, and analysis of the results shows that as many as 46 % of the proteins derived from the predicted protein coding regions have a significant relationship to a protein of known structure. In some cases, however, only one domain of the protein can be predicted, giving a total coverage of 30 % when calculated as a fraction of the number of amino acid residues in the whole proteome.
Collapse
Affiliation(s)
- D T Jones
- Department of Biological Sciences, University of Warwick, Coventry, CV4 7AL, UK.
| |
Collapse
|
59
|
Salamov AA, Suwa M, Orengo CA, Swindells MB. Genome analysis: Assigning protein coding regions to three-dimensional structures. Protein Sci 1999; 8:771-7. [PMID: 10211823 PMCID: PMC2144302 DOI: 10.1110/ps.8.4.771] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
We describe the results of a procedure for maximizing the number of sequences that can be reliably linked to a protein of known three-dimensional structure. Unlike other methods, which try to increase sensitivity through the use of fold recognition software, we only use conventional sequence alignment tools, but apply them in a manner that significantly increases the number of relationships detected. We analyzed 11 genomes and found that, depending on the genome, between 23 and 32% of the ORFs had significant matches to proteins of known structure. In all cases, the aligned region consisted of either >100 residues or >50% of the smaller sequence. Slightly higher percentages could be attained if smaller motifs were also included. This is significantly higher than most previously reported methods, even those that have a fold-recognition component. We survey the biochemical and structural characteristics of the most frequently occurring proteins, and discuss the extent to which alignment methods can realistically assign function to gene products.
Collapse
|
60
|
Rychlewski L, Zhang B, Godzik A. Functional insights from structural predictions: analysis of the Escherichia coli genome. Protein Sci 1999; 8:614-24. [PMID: 10091664 PMCID: PMC2144289 DOI: 10.1110/ps.8.3.614] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Fold assignments for proteins from the Escherichia coli genome are carried out using BASIC, a profile-profile alignment algorithm, recently tested on fold recognition benchmarks and on the Mycoplasma genitalium genome and PSI BLAST, the newest generation of the de facto standard in homology search algorithms. The fold assignments are followed by automated modeling and the resulting three-dimensional models are analyzed for possible function prediction. Close to 30% of the proteins encoded in the E. coli genome can be recognized as homologous to a protein family with known structure. Most of these homologies (23% of the entire genome) can be recognized both by PSI BLAST and BASIC algorithms, but the latter recognizes an additional 260 homologies. Previous estimates suggested that only 10-15% of E. coli proteins can be characterized this way. This dramatic increase in the number of recognized homologies between E. coli proteins and structurally characterized protein families is partly due to the rapid increase of the database of known protein structures, but mostly it is due to the significant improvement in prediction algorithms. Knowing protein structure adds a new dimension to our understanding of its function and the predictions presented here can be used to predict function for uncharacterized proteins. Several examples, analyzed in more detail in this paper, include the DPS protein protecting DNA from oxidative damage (predicted to be homologous to ferritin with iron ion acting as a reducing agent) and the ahpC/tsa family of proteins, which provides resistance to various oxidating agents (predicted to be homologous to glutathione peroxidase).
Collapse
Affiliation(s)
- L Rychlewski
- Department of Molecular Biology, The Scripps Research Institute, La Jolla, California 92037, USA
| | | | | |
Collapse
|
61
|
Fischer D. Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions. Proteins 1999. [DOI: 10.1002/(sici)1097-0134(1999)37:3+<61::aid-prot9>3.0.co;2-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
62
|
Wolf YI, Brenner SE, Bash PA, Koonin EV. Distribution of Protein Folds in the Three Superkingdoms of Life. Genome Res 1999. [DOI: 10.1101/gr.9.1.17] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. A collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced. In the completely sequenced genomes, folds could be automatically identified for 20%–30% of the proteins, with 3%–6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes. Within the bacteria, this distribution differs between parasitic and free-living species. In all analyzed genomes, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are ferredoxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, β-propellers and TIM-barrels. The observed diversity of protein folds in different proteomes is approximately twice as high as it would be expected from a simple stochastic model describing a proteome as a finite sample from an infinite pool of proteins with an exponential distribution of the fold fractions. Distribution of the number of domains with different folds in one protein fits the geometric model, which is compatible with the evolution of multidomain proteins by random combination of domains.[Fold predictions for proteins from 14 proteomes are available on the World Wide Web atftp://ncbi.nlm.nih.gov/pub/koonin/FOLDS/index.html. The FIDs are available by anonymous ftp at the same location.]
Collapse
|
63
|
|
64
|
Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K, Rost B, Rychlewski L, Sternberg M. CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins 1999. [DOI: 10.1002/(sici)1097-0134(1999)37:3+<209::aid-prot27>3.0.co;2-y] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
65
|
Teichmann SA, Park J, Chothia C. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 1998; 95:14658-63. [PMID: 9843945 PMCID: PMC24505 DOI: 10.1073/pnas.95.25.14658] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.
Collapse
Affiliation(s)
- S A Teichmann
- Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, United Kingdom.
| | | | | |
Collapse
|
66
|
Abstract
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage-whether a given fold occurs in a particular organism. Of the approximately 340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in allhelical structure and enriched in mixed helix-sheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many non-homologous sequence families, and are especially similar in overall architecture-eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly expressed folds are considerably different from the most highly duplicated folds. A tree can be constructed grouping genomes in terms of their shared folds. This has a remarkably similar topology to more conventional classifications, based on very different measures of relatedness. Finally, folds of membrane proteins can be analyzed through transmembrane-helix (TM) prediction. All the genomes appear to have similar usage patterns for these folds, with the occurrence of a particular fold falling off rapidly with increasing numbers of TM-elements, according to a "Zipf-like" law. This implies there are no marked preferences for proteins with particular numbers of TM-helices (e.g. 7-TM) in microbial genomes.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
| |
Collapse
|
67
|
Abstract
The recent sequencing of the entire genomes of Mycoplasma genitalium and M. pneumoniae has attracted considerable attention to the molecular biology of mycoplasmas, the smallest self-replicating organisms. It appears that we are now much closer to the goal of defining, in molecular terms, the entire machinery of a self-replicating cell. Comparative genomics based on comparison of the genomic makeup of mycoplasmal genomes with those of other bacteria, has opened new ways of looking at the evolutionary history of the mycoplasmas. There is now solid genetic support for the hypothesis that mycoplasmas have evolved as a branch of gram-positive bacteria by a process of reductive evolution. During this process, the mycoplasmas lost considerable portions of their ancestors' chromosomes but retained the genes essential for life. Thus, the mycoplasmal genomes carry a high percentage of conserved genes, greatly facilitating gene annotation. The significant genome compaction that occurred in mycoplasmas was made possible by adopting a parasitic mode of life. The supply of nutrients from their hosts apparently enabled mycoplasmas to lose, during evolution, the genes for many assimilative processes. During their evolution and adaptation to a parasitic mode of life, the mycoplasmas have developed various genetic systems providing a highly plastic set of variable surface proteins to evade the host immune system. The uniqueness of the mycoplasmal systems is manifested by the presence of highly mutable modules combined with an ability to expand the antigenic repertoire by generating structural alternatives, all compressed into limited genomic sequences. In the absence of a cell wall and a periplasmic space, the majority of surface variable antigens in mycoplasmas are lipoproteins. Apart from providing specific antimycoplasmal defense, the host immune system is also involved in the development of pathogenic lesions and exacerbation of mycoplasma induced diseases. Mycoplasmas are able to stimulate as well as suppress lymphocytes in a nonspecific, polyclonal manner, both in vitro and in vivo. As well as to affecting various subsets of lymphocytes, mycoplasmas and mycoplasma-derived cell components modulate the activities of monocytes/macrophages and NK cells and trigger the production of a wide variety of up-regulating and down-regulating cytokines and chemokines. Mycoplasma-mediated secretion of proinflammatory cytokines, such as tumor necrosis factor alpha, interleukin-1 (IL-1), and IL-6, by macrophages and of up-regulating cytokines by mitogenically stimulated lymphocytes plays a major role in mycoplasma-induced immune system modulation and inflammatory responses.
Collapse
Affiliation(s)
- S Razin
- Department of Membrane and Ultrastructure Research, The Hebrew University-Hadassah Medical School, Jerusalem 91120, Israel.
| | | | | |
Collapse
|
68
|
Sánchez R, Sali A. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc Natl Acad Sci U S A 1998; 95:13597-602. [PMID: 9811845 PMCID: PMC24864 DOI: 10.1073/pnas.95.23.13597] [Citation(s) in RCA: 282] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/1998] [Indexed: 11/18/2022] Open
Abstract
The function of a protein generally is determined by its three-dimensional (3D) structure. Thus, it would be useful to know the 3D structure of the thousands of protein sequences that are emerging from the many genome projects. To this end, fold assignment, comparative protein structure modeling, and model evaluation were automated completely. As an illustration, the method was applied to the proteins in the Saccharomyces cerevisiae (baker's yeast) genome. It resulted in all-atom 3D models for substantial segments of 1,071 (17%) of the yeast proteins, only 40 of which have had their 3D structure determined experimentally. Of the 1,071 modeled yeast proteins, 236 were related clearly to a protein of known structure for the first time; 41 of these previously have not been characterized at all.
Collapse
Affiliation(s)
- R Sánchez
- Laboratories of Molecular Biophysics, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA
| | | |
Collapse
|
69
|
Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol 1998; 283:707-25. [PMID: 9790834 DOI: 10.1006/jmbi.1998.2144] [Citation(s) in RCA: 262] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.
Collapse
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, PF 10.2209, Germany.
| | | | | | | | | | | |
Collapse
|
70
|
Dubchak I, Muchnik I, Kim SH. Assignment of folds for proteins of unknown function in three microbial genomes. MICROBIAL & COMPARATIVE GENOMICS 1998; 3:171-5. [PMID: 9775387 DOI: 10.1089/omi.1.1998.3.171] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Analysis of DNA sequences of several microbial genomes has revealed that a large fraction of predicted coding regions has no known protein function. Information about the three-dimensional folds of these proteins may provide insight into their possible functions. To predict the folds for protein sequences with little or no homology to proteins of known function, we used computational neural networks trained on the database of proteins with known three-dimensional structures. Global descriptions of protein sequences based on physical and structural properties of the constituent amino acids were used as inputs for neural networks. Of the 131, 498, and 868 protein sequences of unknown function from Mycoplasma genitalium, Haemophilus influenzae, and Methanococcus jannaschii (Fleischmann et al. 1995), we have made high-confidence fold assignments for 4, 10, and 19 sequences, respectively.
Collapse
Affiliation(s)
- I Dubchak
- E. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, USA
| | | | | |
Collapse
|
71
|
Russell RB, Sasieni PD, Sternberg MJ. Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 1998; 282:903-18. [PMID: 9743635 DOI: 10.1006/jmbi.1998.2043] [Citation(s) in RCA: 162] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
A method is presented to assess the significance of binding site similarities within superimposed protein three-dimensional (3D) structures and applied to all similar structures in the Protein Data Bank. For similarities between 3D structures lacking significant sequence similarity, the important distinction was made between remote homology (an ancient common ancestor) and analogy (likely convergence to a folding motif) according to the structural classification of proteins (SCOP) database. Supersites were defined as structural locations on groups of analogous proteins (i.e. superfolds) showing a statistically significant tendency to bind substrates despite little evidence of a common ancestor for the proteins considered. We identify three potentially new superfolds containing supersites: ferredoxin-like folds, four-helical bundles and double-stranded beta helices. In addition, the method quantifies binding site similarities within homologous proteins and previously identified supersites such as that found in the beta/alpha (TIM) barrels. For the nine superfolds, the accuracy of predictions of binding site locations is assessed. Implications for protein evolution, and the prediction of protein function either through fold recognition or tertiary structure comparison, are discussed.
Collapse
Affiliation(s)
- R B Russell
- Biomolecular Modelling Laboratory, Lincoln's Inn Fields, PO Box 123, London WC2A 3PX, UK
| | | | | |
Collapse
|
72
|
Abstract
The rapid growth in the number of experimentally determined three-dimensional protein structures has sharpened the need for comprehensive and up-to-date surveys of known structures. Classic work on protein structure classification has made it clear that a structural survey is best carried out at the level of domains, i.e., substructures that recur in evolution as functional units in different protein contexts. We present a method for automated domain identification from protein structure atomic coordinates based on quantitative measures of compactness and, as the new element, recurrence. Compactness criteria are used to recursively divide a protein into a series of successively smaller and smaller substructures. Recurrence criteria are used to select an optimal size level of these substructures, so that many of the chosen substructures are common to different proteins at a high level of statistical significance. The joint application of these criteria automatically yields consistent domain definitions between remote homologs, a result difficult to achieve using compactness criteria alone. The method is applied to a representative set of 1,137 sequence-unique protein families covering 6,500 known structures. Clustering of the resulting set of domains (substructures) yields 594 distinct fold classes (types of substructures). The Dali Domain Dictionary (http://www.embl-ebi.ac.uk/dali/) not only provides a global structural classification, but also a comprehensive description of families of protein sequences grouped around representative proteins of known structure. The classification will be continuously updated and can serve as a basis for improving our understanding of protein evolution and function and for evolving optimal strategies to complete the map of all natural protein structures.
Collapse
Affiliation(s)
- L Holm
- EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, United Kingdom
| | | |
Collapse
|
73
|
Herrmann R, Reiner B. Mycoplasma pneumoniae and Mycoplasma genitalium: a comparison of two closely related bacterial species. Curr Opin Microbiol 1998; 1:572-9. [PMID: 10066529 DOI: 10.1016/s1369-5274(98)80091-x] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The rapid progress in sequencing large quantities of DNA will provide an increasing number of complete genome sequences of closely related bacterial species as well as of pairs of isolates from the same species with different features, such as a pathogenic and an apathogenic representative. This opens the way to apply subtractive comparative analysis as a tool to select from the large pool of all bacterial genes a relatively small set of genes that can be correlated with the expression of a certain phenotype. These selected genes can then be the target for further functional analyses.
Collapse
Affiliation(s)
- R Herrmann
- Zentrum für Molekulare Biologie Heidelberg, Mikrobiologie, Universität Heidelberg, Im Neuenheimer Feld 282, 69120 Heidelberg, Germany.
| | | |
Collapse
|
74
|
Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22:277-304. [PMID: 10357579 DOI: 10.1111/j.1574-6976.1998.tb00371.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
75
|
Rychlewski L, Zhang B, Godzik A. Fold and function predictions for Mycoplasma genitalium proteins. FOLDING & DESIGN 1998; 3:229-38. [PMID: 9710568 DOI: 10.1016/s1359-0278(98)00034-0] [Citation(s) in RCA: 79] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
BACKGROUND Uncharacterized proteins from newly sequenced genomes provide perfect targets for fold and function prediction. RESULTS For 38% of the entire genome of Mycoplasma genitalium, sequence similarity to a protein with a known structure can be recognized using a new sequence alignment algorithm. When comparing genomes of M. genitalium and Escherichia coli, > 80% of M. genitalium proteins have a significant sequence similarity to a protein in E. coli and there are > 40 examples that have not been recognized before. For all cases of proteins with significant profile similarities, there are strong analogies in their functions, if the functions of both proteins are known. The results presented here and other recent results strongly support the argument that such proteins are actually homologous. Assuming this homology allows one to make tentative functional assignments for > 50 previously uncharacterized proteins, including such intriguing cases as the putative beta-lactam antibiotic resistance protein in M. gentalium. CONCLUSIONS Using a new profile-to-profile alignment algorithm, the three-dimensional fold can be predicted for almost 40% of proteins from a genome of the small bacterium M. genitalium, and tentative function can be assigned to almost 80% of the entire genome. Some predictions lead to new insights about known functions or point to hitherto unexpected features of M. genitalium.
Collapse
Affiliation(s)
- L Rychlewski
- Department of Molecular Biology, Scripps Research Institute, La Jolla, CA 92037, USA
| | | | | |
Collapse
|
76
|
Kim SH. Shining a light on structural genomics. NATURE STRUCTURAL BIOLOGY 1998; 5 Suppl:643-5. [PMID: 9699614 DOI: 10.1038/1334] [Citation(s) in RCA: 94] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Affiliation(s)
- S H Kim
- Department of Chemistry, E.O. Lawrence Berkeley National Laboratory, University of California, Berkeley 94720, USA.
| |
Collapse
|
77
|
Huynen M, Doerks T, Eisenhaber F, Orengo C, Sunyaev S, Yuan Y, Bork P. Homology-based fold predictions for Mycoplasma genitalium proteins. J Mol Biol 1998; 280:323-6. [PMID: 9665839 DOI: 10.1006/jmbi.1998.1884] [Citation(s) in RCA: 88] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Homology search techniques based on the iterative PSI-BLAST method in combination with various filters for low sequence complexity are applied to assign folds to all Mycoplasma genitalium proteins. The resulting procedure (implemented as a web server) is able to predict at least one domain in 37% of these proteins automatically, with an estimated accuracy higher than 98%. Taking structural features such as coiled coil or transmembrane regions aside, folds can be assigned to more than half of the globular proteins in a bacterium just by iterative sequence comparison.
Collapse
Affiliation(s)
- M Huynen
- EMBL, Max-Delbrück-Center for Molecular Medicine, Meyerhoftstr.1, Heidelberg, 69012, Germany
| | | | | | | | | | | | | |
Collapse
|
78
|
Beamer LJ, Fischer D, Eisenberg D. Detecting distant relatives of mammalian LPS-binding and lipid transport proteins. Protein Sci 1998; 7:1643-6. [PMID: 9684900 PMCID: PMC2144061 DOI: 10.1002/pro.5560070721] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In mammals, a family of four lipid binding proteins has been previously defined that includes two lipopolysaccharide binding proteins and two lipid transfer proteins. The first member of this family to have its three-dimensional structure determined is bactericidal/permeability-increasing protein (BPI). Using both the sequence and structure of BPI, along with recently developed sequence-sequence and sequence-structure similarity search methods, we have identified 13 distant members of the family in a diverse set of eukaryotes, including rat, chicken, Caenorhabditis elegans, and Biomphalaria galbrata. Although the sequence similarity between these 13 new members and any of the 4 original members of the BPI family is well below the "twilight zone," their high sequence-structure compatibility with BPI indicates they are likely to share its fold. These findings broaden the BPI family to include a member found in retina and brain, and suggest that a primitive member may have contained only one of the two similar domains of BPI.
Collapse
Affiliation(s)
- L J Beamer
- Biochemistry Department, University of Missouri-Columbia, 65211, USA
| | | | | |
Collapse
|
79
|
Koonin EV, Tatusov RL, Galperin MY. Beyond complete genomes: from sequence to structure and function. Curr Opin Struct Biol 1998; 8:355-63. [PMID: 9666332 DOI: 10.1016/s0959-440x(98)80070-5] [Citation(s) in RCA: 114] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Computer analysis of complete prokaryotic genomes shows that microbial proteins are in general highly conserved--approximately 70% of them contain ancient conserved regions. This allows us to delineate families of orthologs across a wide phylogenetic range and, in many cases, predict protein functions with considerable precision. Sequence database searches using newly developed, sensitive algorithms result in the unification of such orthologous families into larger superfamilies sharing common sequence motifs. For many of these superfamilies, prediction of the structural fold and specific amino acid residues involved in enzymatic catalysis is possible. Taken together, sequence and structure comparisons provide a powerful methodology that can successfully complement traditional experimental approaches.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | |
Collapse
|
80
|
Abstract
Computational biology exploits the evolutionary connectivity between proteins and protein families to predict structural and functional properties of uncharacterized gene products. In the past year, conceptual and statistical refinements have substantially improved algorithms for the detection of remote homologues. In conjunction with the rapid growth of biological databases, the global organization of proteins into sequence families, functional families and structural families has become both pertinent and feasible.
Collapse
Affiliation(s)
- L Holm
- European Molecular Biology Laboratory-European Bio-informatics Institute, Cambridge, UK
| |
Collapse
|
81
|
Abstract
The exponential growth of sequence data does not necessarily lead to an increase in knowledge about the functions of genes and their products. Prediction of function using comparative sequence analysis is extremely powerful but, if not performed appropriately, may also lead to the creation and propagation of assignment errors. While current homology detection methods can cope with the data flow, the identification, verification and annotation of functional features need to be drastically improved.
Collapse
Affiliation(s)
- P Bork
- EMBL, Heidelberg, Germany.
| | | |
Collapse
|
82
|
Affiliation(s)
- B Rost
- European Molecular Biology Laboratory, Heidelberg, Germany.
| |
Collapse
|
83
|
Léonetti JP, Wong K, Geiduschek EP. Core-sigma interaction: probing the interaction of the bacteriophage T4 gene 55 promoter recognition protein with E.coli RNA polymerase core. EMBO J 1998; 17:1467-75. [PMID: 9482743 PMCID: PMC1170494 DOI: 10.1093/emboj/17.5.1467] [Citation(s) in RCA: 24] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
The bacterial RNA polymerase sigma subunits are key participants in the early steps of RNA synthesis, conferring specificity of promoter recognition, facilitating promoter opening and promoter clearance, and responding to diverse transcriptional regulators. The T4 gene 55 protein (gp55), the sigma protein of the bacteriophage T4 late genes, is one of the smallest and most divergent members of this family. Protein footprinting was used to identify segments of gp55 that become buried upon binding to RNA polymerase core, and are therefore likely to constitute its interface with the core enzyme. Site-directed mutagenesis in two parts of this contact surface generated gene 55 proteins that are defective in polymerase-binding to different degrees. Alignment with the sequences of the sigma proteins and with a recently determined structure of a large segment of sigma70 suggests that the gp55 counterpart of sigma70 regions 2.1 and 2.2 is involved in RNA polymerase core binding, and that sigma70 and gp55 may be structurally similar in this region. The diverse phenotypes of the mutants implicate this region of gp55 in multiple aspects of sigma function.
Collapse
Affiliation(s)
- J P Léonetti
- Department of Biology and Center for Molecular Genetics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0634, USA.
| | | | | |
Collapse
|