1
|
Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han JDJ, Bertin N, Chung S, Vidal M, Gerstein M. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res 2004; 14:1107-18. [PMID: 15173116 PMCID: PMC419789 DOI: 10.1101/gr.1774904] [Citation(s) in RCA: 399] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Proteins function mainly through interactions, especially with DNA and other proteins. While some large-scale interaction networks are now available for a number of model organisms, their experimental generation remains difficult. Consequently, interolog mapping--the transfer of interaction annotation from one organism to another using comparative genomics--is of significant value. Here we quantitatively assess the degree to which interologs can be reliably transferred between species as a function of the sequence similarity of the corresponding interacting proteins. Using interaction information from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, and Helicobacter pylori, we find that protein-protein interactions can be transferred when a pair of proteins has a joint sequence identity >80% or a joint E-value <10(-70). (These "joint" quantities are the geometric means of the identities or E-values for the two pairs of interacting proteins.) We generalize our interolog analysis to protein-DNA binding, finding such interactions are conserved at specific thresholds between 30% and 60% sequence identity depending on the protein family. Furthermore, we introduce the concept of a "regulog"--a conserved regulatory relationship between proteins across different species. We map interologs and regulogs from yeast to a number of genomes with limited experimental annotation (e.g., Arabidopsis thaliana) and make these available through an online database at http://interolog.gersteinlab.org. Specifically, we are able to transfer approximately 90,000 potential protein-protein interactions to the worm. We test a number of these in two-hybrid experiments and are able to verify 45 overlaps, which we show to be statistically significant.
Collapse
Affiliation(s)
- Haiyuan Yu
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
2
|
|
3
|
Abstract
The level of sequence similarity that implies similarity in protein structure is well established. Recently, many groups proposed thresholds for similarity in sequence implying similarity in enzymatic function. All previous results suggest the strong conservation of enzymatic function above levels of 50% pairwise sequence identity. Here, I argue that all groups substantially overestimated the conservation of enzyme function because their data sets were either too biased, or too small. An unbiased analysis suggested that less than 30% of the pair fragments above 50% sequence identity have entirely identical EC numbers. Another surprising finding was that even BLAST E-values below 10(-50) did not suffice to automatically transfer enzyme function without errors. As expected, most misclassifications originated from similarities in relatively short regions and/or from transferring annotations for different domains. Both problems cannot be corrected easily by adjusting the thresholds for automatic transfer of genome annotations. A score relating sequence identity to alignment length (distance from HSSP-threshold) outperformed statistical BLAST scores for high sequence similarity. In particular, the distance score allowed error-free transfer of enzyme function for the 10% most similar enzyme pairs. The results illustrated how difficult it is to assess the conservation of protein function and to guarantee error-free genome annotations, in general: sets with millions of pair comparisons might not suffice to arrive at statistically significant conclusions. In practice, the revised detailed estimates for the sequence conservation of enzyme function may provide important benchmarks for everyday sequence analysis and for more cautious automatic genome annotations.
Collapse
Affiliation(s)
- Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| |
Collapse
|
4
|
Karmirantzou M, Hamodrakas SJ. A Web-based classification system of DNA-binding protein families. PROTEIN ENGINEERING 2001; 14:465-72. [PMID: 11522919 DOI: 10.1093/protein/14.7.465] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The family of DNA-binding proteins is one of the most populated and studied amongst the various genomes of bacteria, archaea and eukaryotes and the Web-based system presented here is an approach to their classification. The DnaProt resource is an annotated and searchable collection of protein sequences for the families of DNA-binding proteins. The database contains 3238 full-length sequences (retrieved from the SWISS-PROT database, release 38) that include, at least, a DNA-binding domain. Sequence entries are organized into families defined by PROSITE patterns, PRINTS motifs and de novo excised signatures. Combining global similarities and functional motifs into a single classification scheme, DNA-binding proteins are classified into 33 unique classes, which helps to reveal comprehensive family relationships. To maximize family information retrieval, DnaProt contains a collection of multiple alignments for each DNA-binding family while the recognized motifs can be used as diagnostically functional fingerprints. All available structural class representatives have been referenced. The resource was developed as a Web-based management system for online free access of customized data sets. Entries are fully hyperlinked to facilitate easy retrieval of the original records from the source databases while functional and phylogenetic annotation will be applied to newly sequenced genomes. The database is freely available for online search of a library containing specific patterns of the identified DNA-binding protein classes and retrieval of individual entries from our WWW server (http://kronos.biol.uoa.gr/~mariak/dbDNA.html).
Collapse
Affiliation(s)
- M Karmirantzou
- Faculty of Biology, Department of Cell Biology and Biophysics, University of Athens, Panepistimiopolis, Athens 157 01, Greece
| | | |
Collapse
|
5
|
Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000; 297:233-49. [PMID: 10704319 DOI: 10.1006/jmbi.2000.3550] [Citation(s) in RCA: 277] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Measuring in a quantitative, statistical sense the degree to which structural and functional information can be "transferred" between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on approximately 30,000 pairs of protein domains with known structure and function. Our domain pairs, which are constructed according to the SCOP fold classification, range in similarity from just sharing a fold, to being nearly identical. Our results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, measured in percent identity. However, as the scale of our survey is much larger than any previous investigations, our results have greater statistical weight and precision. We have been able to express the relationship of sequence and structure similarity using more "modern scores," such as Smith-Waterman alignment scores and probabilistic P-values for both sequence and structure comparison. These modern scores address some of the problems with traditional scores, such as determining a conserved core and correcting for length dependency; they enable us to phrase the sequence-structure relationship in more precise and accurate terms. We found that the basic exponential sequence-structure relationship is very general: the same essential relationship is found in the different secondary-structure classes and is evident in all the scoring schemes. To relate function to sequence and structure we assigned various levels of functional similarity to the domain pairs, based on a simple functional classification scheme. This scheme was constructed by combining and augmenting annotations in the enzyme and fly functional classifications and comparing subsets of these to the Escherichia coli and yeast classifications. We found sigmoidal relationships between similarity in function and sequence, with clear thresholds for different levels of functional conservation. For pairs of domains that share the same fold, precise function appears to be conserved down to approximately 40 % sequence identity, whereas broad functional class is conserved to approximately 25 %. Interestingly, percent identity is more effective at quantifying functional conservation than the more modern scores (e.g. P-values). Results of all the pairwise comparisons and our combined functional classification scheme for protein structures can be accessed from a web database at http://bioinfo.mbb.yale.edu/alignCopyright 2000 Academic Press.
Collapse
Affiliation(s)
- C A Wilson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | | | | |
Collapse
|
6
|
Abstract
The recent growth in structural data, and ensuing analyses, have revealed the structural and functional versatility of protein families. With respect to enzymes, local active-site mutations, variations in surface loops and recruitment of additional domains accommodate the diverse substrate specificities and catalytic activities observed within several superfamilies. Conversely, some functions have more than one structural solution, having evolved independently several times during evolution. Combined with the existence of multi-functional genes, which have arisen by gene recruitment, these phenomena must be considered in the process of genome annotation.
Collapse
Affiliation(s)
- A E Todd
- Biomolecular Structure and Modelling Unit Department of Biochemistry and Molecular Biology, University College London, Gower Street, London,WC1E 6BT, UK
| | | | | |
Collapse
|
7
|
Hegyi H, Gerstein M. The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 1999; 288:147-64. [PMID: 10329133 DOI: 10.1006/jmbi.1999.2661] [Citation(s) in RCA: 269] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
For most proteins in the genome databases, function is predicted via sequence comparison. In spite of the popularity of this approach, the extent to which it can be reliably applied is unknown. We address this issue by systematically investigating the relationship between protein function and structure. We focus initially on enzymes functionally classified by the Enzyme Commission (EC) and relate these to by structurally classified domains the SCOP database. We find that the major SCOP fold classes have different propensities to carry out certain broad categories of functions. For instance, alpha/beta folds are disproportionately associated with enzymes, especially transferases and hydrolases, and all-alpha and small folds with non-enzymes, while alpha+beta folds have an equal tendency either way. These observations for the database overall are largely true for specific genomes. We focus, in particular, on yeast, analyzing it with many classifications in addition to SCOP and EC (i.e. COGs, CATH, MIPS), and find clear tendencies for fold-function association, across a broad spectrum of functions. Analysis with the COGs scheme also suggests that the functions of the most ancient proteins are more evenly distributed among different structural classes than those of more modern ones. For the database overall, we identify the most versatile functions, i.e. those that are associated with the most folds, and the most versatile folds, associated with the most functions. The two most versatile enzymatic functions (hydro-lyases and O-glycosyl glucosidases) are associated with seven folds each. The five most versatile folds (TIM-barrel, Rossmann, ferredoxin, alpha-beta hydrolase, and P-loop NTP hydrolase) are all mixed alpha-beta structures. They stand out as generic scaffolds, accommodating from six to as many as 16 functions (for the exceptional TIM-barrel). At the conclusion of our analysis we are able to construct a graph giving the chance that a functional annotation can be reliably transferred at different degrees of sequence and structural similarity. Supplemental information is available from http://bioinfo.mbb.yale.edu/genome/foldfunc++ +.
Collapse
Affiliation(s)
- H Hegyi
- Department of Molecular Biophysics & Biochemistry Yale University, 266 Whitney Avenue, New Haven, CT 06520, USA
| | | |
Collapse
|
8
|
Abstract
We have developed and implemented a method for computational gene identification called GIN (gene identification using neural nets and homology information) that has been particularly designed to avoid false positive predictions. It thus predicts 55% of all genes tested correctly, has a specificity of 99%, but also has an overall accuracy of 92% on a benchmark set of 570 vertebrate genes constructed by Burset and Guigo. The method combines homology searches in protein and expressed sequence tag databases with several neural networks designed to recognize start codons, Poly(A) signals, stop codons, and splice sites. Predicted exons are assembled into genes using a homology-based scoring function. GIN is able to recognize multiple genes within genomic DNA as demonstrated by the identification of a globin gene (gamma-globin-1(G)) that has not been annotated as a coding region in the widely used the test set of Burset and Guigo. Furthermore, GIN identifies more than 107 other protein hits in noncoding regions and classifies them into possible pseudogenes or splice variants.
Collapse
Affiliation(s)
- Y Cai
- EMBL, Meyerhofstrasse 1, Heidelberg, 69012, Germany
| | | |
Collapse
|
9
|
Maftahi M, Gaillardin C, Nicaud JM. Sticky-end polymerase chain reaction method for systematic gene disruption in Saccharomyces cerevisiae. Yeast 1998. [DOI: 10.1002/(sici)1097-0061(199607)12:9<859::aid-yea978>3.0.co;2-q] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
|
10
|
Abstract
Eight microbial genomes are compared in terms of protein structure. Specifically, yeast, H. influenzae, M. genitalium, M. jannaschii, Synechocystis, M. pneumoniae, H. pylori, and E. coli are compared in terms of patterns of fold usage-whether a given fold occurs in a particular organism. Of the approximately 340 soluble protein folds currently in the structure databank (PDB), 240 occur in at least one of the eight genomes, and 30 are shared amongst all eight. The shared folds are depleted in allhelical structure and enriched in mixed helix-sheet structure compared to the folds in the PDB. The top-10 most common of the shared 30 are enriched in superfolds, uniting many non-homologous sequence families, and are especially similar in overall architecture-eight having helices packed onto a central sheet. They are also very different from the common folds in the PBD, highlighting databank biases. Folds can be ranked in terms of expression as well as genome duplication. In yeast the top-10 most highly expressed folds are considerably different from the most highly duplicated folds. A tree can be constructed grouping genomes in terms of their shared folds. This has a remarkably similar topology to more conventional classifications, based on very different measures of relatedness. Finally, folds of membrane proteins can be analyzed through transmembrane-helix (TM) prediction. All the genomes appear to have similar usage patterns for these folds, with the occurrence of a particular fold falling off rapidly with increasing numbers of TM-elements, according to a "Zipf-like" law. This implies there are no marked preferences for proteins with particular numbers of TM-helices (e.g. 7-TM) in microbial genomes.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.
| |
Collapse
|
11
|
Sahasrabudhe PV, Tejero R, Kitao S, Furuichi Y, Montelione GT. Homology modeling of an RNP domain from a human RNA-binding protein: Homology-constrained energy optimization provides a criterion for distinguishing potential sequence alignments. Proteins 1998. [DOI: 10.1002/(sici)1097-0134(19981201)33:4<558::aid-prot8>3.0.co;2-z] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
12
|
Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol 1998; 283:707-25. [PMID: 9790834 DOI: 10.1006/jmbi.1998.2144] [Citation(s) in RCA: 262] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is currently best described in the context of molecular interactions. In the near future it will be possible to predict protein function in the context of higher order processes such as the regulation of gene expression, metabolic pathways and signalling cascades. The analysis of such higher levels of function description uses, besides the information from completely sequenced genomes, also the additional information from proteomics and expression data. The final goal will be to elucidate the mapping between genotype and phenotype.
Collapse
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Meyerhofstr. 1, Heidelberg, PF 10.2209, Germany.
| | | | | | | | | | | |
Collapse
|
13
|
Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22:277-304. [PMID: 10357579 DOI: 10.1111/j.1574-6976.1998.tb00371.x] [Citation(s) in RCA: 67] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
We give an overview of the emerging field of structural genomics, describing how genomes can be compared in terms of protein structure. As the number of genes in a genome and the total number of protein folds are both quite limited, these comparisons take the form of surveys of a finite parts list, similar in respects to demographic censuses. Fold surveys have many similarities with other whole-genome characterizations, e.g., analyses of motifs or pathways. However, structure has a number of aspects that make it particularly suitable for comparing genomes, namely the way it allows for the precise definition of a basic protein module and the fact that it has a better defined relationship to sequence similarity than does protein function. An essential requirement for a structure survey is a library of folds, which groups the known structures into 'fold families.' This library can be built up automatically using a structure comparison program, and we described how important objective statistical measures are for assessing similarities within the library and between the library and genome sequences. After building the library, one can use it to count the number of folds in genomes, expressing the results in the form of Venn diagrams and 'top-10' statistics for shared and common folds. Depending on the counting methodology employed, these statistics can reflect different aspects of the genome, such as the amount of internal duplication or gene expression. Previous analyses have shown that the common folds shared between very different microorganisms, i.e., in different kingdoms, have a remarkably similar structure, being comprised of repeated strand-helix-strand super-secondary structure units. A major difficulty with this sort of 'fold-counting' is that only a small subset of the structures in a complete genome are currently known and this subset is prone to sampling bias. One way of overcoming biases is through structure prediction, which can be applied uniformly and comprehensively to a whole genome. Various investigators have, in fact, already applied many of the existing techniques for predicting secondary structure and transmembrane (TM) helices to the recently sequenced genomes. The results have been consistent: microbial genomes have similar fractions of strands and helices even though they have significantly different amino acid composition. The fraction of membrane proteins with a given number of TM helices falls off rapidly with more TM elements, approximately according to a Zipf law. This latter finding indicates that there is no preference for the highly studied 7-TM proteins in microbial genomes. Continuously updated tables and further information pertinent to this review are available over the web at http://bioinfo.mbb.yale.edu/genome.
Collapse
Affiliation(s)
- M Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
14
|
Sowdhamini R, Burke DF, Huang JF, Mizuguchi K, Nagarajaram HA, Srinivasan N, Steward RE, Blundell TL. CAMPASS: a database of structurally aligned protein superfamilies. Structure 1998; 6:1087-94. [PMID: 9753697 DOI: 10.1016/s0969-2126(98)00110-5] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Affiliation(s)
- R Sowdhamini
- Department of Biochemistry University of Cambridge 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Frishman D, Mewes HW. Protein structural classes in five complete genomes. NATURE STRUCTURAL BIOLOGY 1997; 4:626-8. [PMID: 9253410 DOI: 10.1038/nsb0897-626] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The predicted distribution of globular proteins over folding types in five complete genomes differs from the tendencies observed in known protein structures. The ratio between the number of predicted membrane and globular proteins is conserved.
Collapse
|
16
|
Abstract
An ever increasing number of protein sequences are being compared, partly because of the availability of full sets of protein sequences from several completed genome-sequencing projects. The resulting problem of scale has shifted the emphasis of sequence analysis method development from sensitivity and flexibility, which relies on manual intervention and interpretation, to the automatic generation of results of known reliability.
Collapse
Affiliation(s)
- T J Hubbard
- Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
| |
Collapse
|
17
|
Krause DC, Proft T, Hedreyda CT, Hilbert H, Plagens H, Herrmann R. Transposon mutagenesis reinforces the correlation between Mycoplasma pneumoniae cytoskeletal protein HMW2 and cytadherence. J Bacteriol 1997; 179:2668-77. [PMID: 9098066 PMCID: PMC179017 DOI: 10.1128/jb.179.8.2668-2677.1997] [Citation(s) in RCA: 70] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
A new genetic locus associated with Mycoplasma pneumoniae cytadherence was previously identified by transposon mutagenesis with Tn4001. This locus maps approximately 160 kbp from the genes encoding cytadherence-associated proteins HMW1 and HMW3, and yet insertions therein result in loss of these proteins and a hemadsorption-negative (HA-) phenotype, prompting the designation cytadherence-regulatory locus (crl). In the current study, passage of transformants in the absence of antibiotic selection resulted in loss of the transposon, a wild-type protein profile, and a HA+ phenotype, underscoring the correlation between crl and M. pneumoniae cytadherence. Nucleotide sequence analysis of crl revealed open reading frames (ORFs) orfp65, orfp216, orfp41, and orfp24, arranged in tandem and flanked by a promoter-like and a terminator-like sequence, suggesting a single transcriptional unit, the P65 operon. The 5' end of orfp65 mRNA was mapped by primer extension, and a likely promoter was identified just upstream. The product of each ORF was identified by using antisera prepared against fusion proteins. The previously characterized surface protein P65 is encoded by orfp65, while the 190,000 Mr cytadherence-associated protein HMW2 is a product of orfp216. Proteins with sizes of 47,000 and 41,000 Mr and unknown function were identified for orfp41 and orfp24, respectively. Structural analyses of HMW2 predict a periodicity highly characteristic of a coiled-coil conformation and five leucine zipper motifs, indicating that HMW2 probably forms dimers in vivo, which is consistent with a structural role in cytadherence. Each transposon insertion mapped to orfp216 but affected the levels of all products of the P65 operon. HMW2 is thought to form a disulfide-linked dimer, formerly designated HMW5, and examination of an hmw2 deletion mutant confirms that HMW5 is a product of the hmw2 gene.
Collapse
Affiliation(s)
- D C Krause
- Department of Microbiology, University of Georgia, Athens 30602, USA.
| | | | | | | | | | | |
Collapse
|
18
|
Cedano J, Aloy P, Pérez-Pons JA, Querol E. Relation between amino acid composition and cellular location of proteins. J Mol Biol 1997; 266:594-600. [PMID: 9067612 DOI: 10.1006/jmbi.1996.0804] [Citation(s) in RCA: 282] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A correlation analysis of the amino acid composition and the cellular location of a protein is presented. The statistical analysis discriminates among the following five protein classes: integral membrane proteins, anchored membrane proteins, extracellular proteins, intracellular proteins and nuclear proteins. This segregation into protein classes related to their location can help researchers to design experimental work for testing hypotheses in order to find out the functionality of a reading frame in search of function. A program (ProtLock) to predict the cellular location of a protein has been designed.
Collapse
Affiliation(s)
- J Cedano
- Departament de Bioquímica i Biologia Molecular, Universitat Autonoma de Barcelona, Bellaterra, Spain
| | | | | | | |
Collapse
|
19
|
Bazan JF, Koch-Nolte F. Sequence and structural links between distant ADP-ribosyltransferase families. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 1997; 419:99-107. [PMID: 9193642 DOI: 10.1007/978-1-4419-8632-0_12] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The low resolution structure of the Pseudomonas aeroginosa exotoxin A (ETA) presented in 1986 provided the first tantalizing three-dimensional view of an ADP-ribosyl-transferase (ADPRT) catalytic domain. The major features of this protein fold have recurred in the more recently solved crystal structures of the cholera toxin-related heat-labile enterotoxin (LT), diphtheria toxin (DT) and pertussis toxin (PT). A core set of alpha + beta elements define a minimal, conserved scaffold with remarkably plastic sequence requirements-only a single glutamic acid residue critical to catalytic activity is invariant. Other interchangeable residues in locations important for catalysis and binding are suggested by the cocrystal structures of DT with the inhibitor ApUp, ETA with bound AMP and nicotinamide, and DT with substrate NAD-in close accord with labeling and mutagenic data. Faint sequence resemblances that were earlier noticed among prokaryotic ADPRTs have now been securely extended by the structural concordance between toxin folds; more recently, eukaryotic ADPRTs have surfaced and their sequences can be reliably threaded into the conserved core fold. We will briefly summarize efforts in Palo Alto and Hamburg to explore these latter relationships, and to mount a rigorous search for new ADPRT families in the growing sequence databases.
Collapse
Affiliation(s)
- J F Bazan
- Department of Molecular Biology, DNAX Research Institute, Palo Alto, California 94304-1104, USA
| | | |
Collapse
|
20
|
Maftahi M, Gaillardin C, Nicaud JM. Sticky-end polymerase chain reaction method for systematic gene disruption in Saccharomyces cerevisiae. Yeast 1996; 12:859-68. [PMID: 8840503 DOI: 10.1002/(sici)1097-0061(199607)12:9%3c859::aid-yea978%3e3.0.co;2-q] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
We describe a new procedure for the generation of plasmids containing a large promoter and terminator region of a gene of interest, useful for gene disruption. In a two-step polymerase chain reaction (PCR), a fragment, corresponding to the terminator and promoter regions separated by a 16 bp sequence containing a rare restriction site (e.g. AscI), is synthesized (T-P fragment). This PCR fragment is cloned in vectors presenting a rare blunt-end cloning site and a yeast marker for selection in Saccharomyces cerevisiae (TRP1, HIS3 and KanMX). The final plasmids are used directly for gene disruption after linearization by the enzyme (e.g. AscI) specific for the rare restriction site. This approach was used to disrupt three open reading frames identified during the sequencing of COS14-1 from chromosome XIV of S. cerevisiae.
Collapse
Affiliation(s)
- M Maftahi
- Institut National Agronomique Paris-Grignon, Laboratoire de Génétique Moléculaire et Cellulaire, INRA CNRS, Thiverval-Grignon, France
| | | | | |
Collapse
|
21
|
Guimarães MJ, Bazan JF, Castagnola J, Diaz S, Copeland NG, Gilbert DJ, Jenkins NA, Varki A, Zlotnik A. Molecular cloning and characterization of lysosomal sialic acid O-acetylesterase. J Biol Chem 1996; 271:13697-705. [PMID: 8662838 DOI: 10.1074/jbc.271.23.13697] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
O-Acetylation and de-O-acetylation of sialic acids have been implicated in the regulation of a variety of biological phenomena, including endogenous lectin recognition, tumor antigenicity, virus binding, and complement activation. Applying a strategy designed to identify genes preferentially expressed in active sites of embryonic hematopoiesis, we isolated a novel cDNA from the pluripotent hematopoietic cell line FDCPmixA4 whose open reading frame contained sequences homologous to peptide fragments of a lysosomal sialic acid O-acetylesterase (Lse) previously purified from rat liver, but with no evident similarity to endoplasmic reticulum-derived acetylesterases. The expressed Lse protein exhibits sialic-acid O-acetylesterase activity that is not attributable to a typical serine esterase active site. lse expression is spatially and temporally restricted during embryogenesis, and its mRNA levels correlate with differences in O-acetylesterase activity described in adult tissues and blood cell types. Using interspecific backcross analysis, we further mapped the lse gene to the central region of mouse chromosome 9. This constitutes the first report on the molecular cloning of a sialic acid-specific O-acetylesterase in vertebrates and suggests novel roles for the 9-O-acetyl modification of sialic acids during the development and differentiation of mammalian organisms.
Collapse
Affiliation(s)
- M J Guimarães
- DNAX Research Institute of Molecular and Cellular Biology, Palo Alto, California 94304, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Abstract
Every sequence comparison method requires a set of scores. For aligning protein sequences, substitution scores are based on models of amino acid conservation and properties, and matrices of these scores have substantially improved in recent years. Position-specific scoring matrices provide representations of sequence families that are capable of detecting subtle similarities. Comprehensive evaluations can effectively guide the choice of scores for sequence alignment and searching applications, including those that aid in the prediction of protein structures.
Collapse
Affiliation(s)
- S Henikoff
- Howard Hughes Medical Institute, Basic Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98104, USA.
| |
Collapse
|
23
|
Hilbert H, Himmelreich R, Plagens H, Herrmann R. Sequence analysis of 56 kb from the genome of the bacterium Mycoplasma pneumoniae comprising the dnaA region, the atp operon and a cluster of ribosomal protein genes. Nucleic Acids Res 1996; 24:628-39. [PMID: 8604303 PMCID: PMC145699 DOI: 10.1093/nar/24.4.628] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
To sequence the entire 800 kilobase pair genome of the bacterium Mycoplasma pneumoniae, a plasmid library was established with contained the majority of the EcoR1 fragments from M.pneumoniae. The EcoR1 fragments were subcloned from an ordered cosmid library comprising the complete M.pneumoniae genome. Individual plasmid clones were sequenced in an ordered fashion mainly by primer walking. We report here the initial results from the sequence analysis of -56 kb comprising the dnaA region as a potential origin of replication, the ATPase operon and a region coding for a cluster of ribosomal protein genes. The data were compared with the corresponding genes/operons from Bacillus subtilis, Escherichia coli, Mycoplasma capricolum and Mycoplasma gallisepticum.
Collapse
Affiliation(s)
- H Hilbert
- Zentrum für Molekulare Biologie Heidelberg, Universität Heidelberg, Germany
| | | | | | | |
Collapse
|
24
|
Abstract
An adequate set of computer procedures tailored to address the task of genome-scale analysis of protein sequences will greatly increase the beneficial impact of the genome sequencing projects on the progress of biological research. This is especially pertinent given the fact that, for model organisms, one-half or more of the putative gene products have not been functionally characterized. Here we described several programs that may comprise the core of such a set and their application to the analysis of about 3000 proteins comprising 75% of the E. coli gene products. We find that the protein sequences encoded in this model genome are a rich source of information, with biologically relevant similarities detected for more than 80% of them. In the majority of cases, these similarities become evident directly from the results of BLAST searches. However, methods for motif analysis provide for a significant increase in search sensitivity and are particularly important for the detection of ancient conserved regions. As a result of sequence similarity analysis, generalized functional predictions can be made for the majority of uncharacterized ORF products, allowing efficient focusing of experimental effort. Clustering of the E. coli proteins on the basis of sequence similarity shows that almost one-half of the bacterial proteins have at least one paralog and that the likelihood that a protein belongs to a small or a large cluster depends on the function of this particular protein.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
25
|
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
26
|
Abstract
The analysis of the 269 open reading frames of yeast chromosome VIII by computational methods has yielded 24 new significant sequence similarities to proteins of known function. The resulting predicted functions include three particularly interesting cases of translation-associated proteins: peptidyl-tRNA hydrolase, a ribosome recycling factor homologue, and a protein similar to cytochrome b translational activator CBS2. The methodological limits of the meaningful transfer of functional information between distant homologues are discussed.
Collapse
Affiliation(s)
- C Ouzounis
- European Molecular Biology Laboratory-Heidelberg, Germany
| | | | | | | |
Collapse
|
27
|
Lynch AS, Briggs D, Hope IA. Developmental expression pattern screen for genes predicted in the C. elegans genome sequencing project. Nat Genet 1995; 11:309-13. [PMID: 7581455 DOI: 10.1038/ng1195-309] [Citation(s) in RCA: 48] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Maximum use should be made of information generated in the genome sequencing projects. Toward this end, we have initiated a genome sequence-based, expression pattern screen of genes predicted from the Caenorhabditis elegans genome sequence data. We examined beta-galactosidase expression patterns in C. elegans lines transformed with lacZ reporter gene fusions constructed using predicted C. elegans gene promoter regions. Of the predicted genes in the cosmids analysed so far, 67% are amenable to the approach and 54% of examined genes yielded a developmental expression pattern. Expression pattern information is being made generally available using computer databases.
Collapse
Affiliation(s)
- A S Lynch
- Department of Pure and Applied Biology, University of Leeds, UK
| | | | | |
Collapse
|
28
|
Bork P, Holm L, Koonin EV, Sander C. The cytidylyltransferase superfamily: identification of the nucleotide-binding site and fold prediction. Proteins 1995; 22:259-66. [PMID: 7479698 DOI: 10.1002/prot.340220306] [Citation(s) in RCA: 72] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The crystal structure of glycerol-3-phosphate cytidylyltransferase from B. subtilis (TagD) is about to be solved. Here, we report a testable structure prediction based on the identification by sequence analysis of a superfamily of functionally diverse but structurally similar nucleotide-binding enzymes. We predict that TagD is a member of this family. The most conserved region in this superfamily resembles the ATP-binding HiGH motif of class I aminoacyl-tRNA synthetases. The predicted secondary structure of cytidylyltransferase and its homologues is compatible with the alpha/beta topography of the class I aminoacyl-tRNA synthetases. The hypothesis of similarity of fold is strengthened by sequence-structure alignment and 3D model building using the known structure of tyrosyl tRNA synthetase as template. The proposed 3D model of TagD is plausible both structurally, with a well packed hydrophobic core, and functionally, as the most conserved residues cluster around the putative nucleotide binding site. If correct, the model would imply a very ancient evolutionary link between class I tRNA synthetases and the novel cytidylyltransferase superfamily.
Collapse
Affiliation(s)
- P Bork
- EMBL, Heidelberg, Germany
| | | | | | | |
Collapse
|
29
|
Bork P, Ouzounis C, Casari G, Schneider R, Sander C, Dolan M, Gilbert W, Gillevet PM. Exploring the Mycoplasma capricolum genome: a minimal cell reveals its physiology. Mol Microbiol 1995; 16:955-67. [PMID: 7476192 DOI: 10.1111/j.1365-2958.1995.tb02321.x] [Citation(s) in RCA: 68] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We report on the analysis of 214kb of the parasitic eubacterium Mycoplasma capricolum sequenced by genomic walking techniques. The 287 putative proteins detected to date represent about half of the estimated total number of 500 predicted for this organism. A large fraction of these (75%) can be assigned a likely function as a result of similarity searches. Several important features of the functional organization of this small genome are already apparent. Among these are (i) the expected relatively large number of enzymes involved in metabolic transport and activation, for efficient use of host cell nutrients; (ii) the presence of anabolic enzymes; (iii) the unexpected diversity of enzymes involved in DNA replication and repair; and (iv) a sizeable number of orthologues (82 so far) in Escherichia coli. This survey is beginning to provide a detailed view of how M. capricolum manages to maintain essential cellular processes with a genome much smaller than that of its bacterial relatives.
Collapse
Affiliation(s)
- P Bork
- Max-Delbrück-Centre for Molecular Medicine, Berlin-Buch, Germany
| | | | | | | | | | | | | | | |
Collapse
|
30
|
Polycystic kidney disease: the complete structure of the PKD1 gene and its protein. The International Polycystic Kidney Disease Consortium. Cell 1995; 81:289-98. [PMID: 7736581 DOI: 10.1016/0092-8674(95)90339-9] [Citation(s) in RCA: 484] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Mutations in the PKD1 gene are the most common cause of autosomal dominant polycystic kidney disease (ADPKD). Other PKD1-like loci on chromosome 16 are approximately 97% identical to PKD1. To determine the authentic PKD1 sequence, we obtained the genomic sequence of the PKD1 locus and assembled a PKD1 transcript from the sequence of 46 exons. The 14.5 kb PKD1 transcript encodes a 4304 amino acid protein that has a novel domain architecture. The amino-terminal half of the protein consists of a mosaic of previously described domains, including leucine-rich repeats flanked by characteristic cysteine-rich structures, LDL-A and C-type lectin domains, and 14 units of a novel 80 amino acid domain. The presence of these domains suggests that the PKD1 protein is involved in adhesive protein-protein and protein-carbohydrate interactions in the extracellular compartment. We propose a hypothesis that links the predicted properties of the protein with the diverse phenotypic features of ADPKD.
Collapse
|
31
|
Affiliation(s)
- P Bork
- EMBL, Heidelberg, Germany
| | | | | |
Collapse
|
32
|
Abstract
Regular polypeptide conformations include secondary structural motifs such as α-helices and β-strands. The occurrence of some regular conformation is usually deduced from a local analysis of dihedral angles. However, the value of a dihedral angle in itself does not provide any information on the conformation's "shape." This drawback can be circumvented with global, rather than local, macromolecular shape descriptors. Recently, fractal exponents have been proposed as a source of such descriptors. Yet, this approach does not fully capture all essential shape features, since protein backbones are not fractal. In this work, we deal instead with a more "natural" characterization of the polymer's global shape that uses both the chain's geometry and "topology." For the geometry, we study the behaviour of molecular size and anisometry. For the chain's folding features, we study the self-entanglements in a polymer fold. We compute these descriptors for all relevant secondary structural motifs. By using self-entaglements and molecular geometry, we provide a view of secondary structure that is both conceptually appealing and also more discriminating than previous ones in the literature. Keywords: molecular shape analysis, protein secondary structure, self-entanglements.
Collapse
|
33
|
Abstract
As the protein sequence and structure databases expand rapidly a better understanding of the relationships between proteins is required. A classification is considered that extends the sequence-based superfamilies to include proteins with similar function and three-dimensional structures but no sequence similarity. So far there are only nine protein folds known to recur in proteins having neither sequence nor functional similarity. These folds dominate the structure database, representing more than 30 per cent of all determined structures. This observation has implications for protein-fold recognition.
Collapse
Affiliation(s)
- C A Orengo
- Biochemistry and Molecular Biology Department, University College London, UK
| | | | | |
Collapse
|
34
|
Wach A, Brachat A, Pöhlmann R, Philippsen P. New heterologous modules for classical or PCR-based gene disruptions in Saccharomyces cerevisiae. Yeast 1994; 10:1793-808. [PMID: 7747518 DOI: 10.1002/yea.320101310] [Citation(s) in RCA: 2170] [Impact Index Per Article: 72.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
We have constructed and tested a dominant resistance module, for selection of S. cerevisiae transformants, which entirely consists of heterologous DNA. This kanMX module contains the known kanr open reading-frame of the E. coli transposon Tn903 fused to transcriptional and translational control sequences of the TEF gene of the filamentous fungus Ashbya gossypii. This hybrid module permits efficient selection of transformants resistant against geneticin (G418). We also constructed a lacZMT reporter module in which the open reading-frame of the E. coli lacZ gene (lacking the first 9 codons) is fused at its 3' end to the S. cerevisiae ADH1 terminator. KanMX and the lacZMT module, or both modules together, were cloned in the center of a new multiple cloning sequence comprising 18 unique restriction sites flanked by Not I sites. Using the double module for constructions of in-frame substitutions of genes, only one transformation experiment is necessary to test the activity of the promotor and to search for phenotypes due to inactivation of this gene. To allow for repeated use of the G418 selection some kanMX modules are flanked by 470 bp direct repeats, promoting in vivo excision with frequencies of 10(-3)-10(-4). The 1.4 kb kanMX module was also shown to be very useful for PCR based gene disruptions. In an experiment in which a gene disruption was done with DNA molecules carrying PCR-added terminal sequences of only 35 bases homology to each target site, all twelve tested geneticin-resistant colonies carried the correctly integrated kanMX module.
Collapse
Affiliation(s)
- A Wach
- Institut für Angewandte Mikrobiologie, Universität Basel, Switzerland
| | | | | | | |
Collapse
|
35
|
Abstract
Bioinformatics involves both the automatic processing of large amounts of existing data and the creation of new types of information resource. Both will be required if the data are to be transformed into information and used to help in the discovery of drugs.
Collapse
|
36
|
Abstract
Using computer methods for database search and multiple alignment, statistically significant sequence similarities were identified between several nitrilases with distinct substrate specificity, cyanide hydratases, aliphatic amidases, beta-alanine synthase, and a few other proteins with unknown molecular function. All these proteins appear to be involved in the reduction of organic nitrogen compounds and ammonia production. Sequence conservation over the entire length, as well as the similarity in the reactions catalyzed by the known enzymes in this family, points to a common catalytic mechanism. The new family of enzymes is characterized by several conserved motifs, one of which contains an invariant cysteine that is part of the catalytic site in nitrilases. Another highly conserved motif includes an invariant glutamic acid that might also be involved in catalysis.
Collapse
Affiliation(s)
- P Bork
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|