201
|
Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J. The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 2004; 32:D235-9. [PMID: 14681402 PMCID: PMC308851 DOI: 10.1093/nar/gkh117] [Citation(s) in RCA: 179] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.
Collapse
Affiliation(s)
- Martin Madera
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK.
| | | | | | | | | |
Collapse
|
202
|
Qian B, Soyer OS, Neubig RR, Goldstein RA. Depicting a protein's two faces: GPCR classification by phylogenetic tree-based HMMs. FEBS Lett 2003; 554:95-9. [PMID: 14596921 DOI: 10.1016/s0014-5793(03)01112-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Related proteins with similar biological functions generally share common features, allowing us to extract the common sequence features. These common features enable us to build statistical models that can be used to classify proteins, to predict new members, and to study the sequence-function relationship of this protein function group. Although evolution underlies the basis of multiple sequence analysis methods, most methods ignore phylogenetic relationships and the evolutionary process in building these statistical models. Previously we have shown that a phylogenetic tree-based profile hidden Markov model (T-HMM) is superior in generating a profile for a group of similar proteins. In this study we used the method to generate common features of G protein-coupled receptors (GPCRs). The profile generated by T-HMM gives high accuracy in GPCR function classification, both by ligand and by coupled G protein.
Collapse
Affiliation(s)
- Bin Qian
- Biophysics Research Division, University of Michigan, Ann Arbor, MI 48105, USA
| | | | | | | |
Collapse
|
203
|
Sitbon E, Pietrokovski S. New types of conserved sequence domains in DNA-binding regions of homing endonucleases. Trends Biochem Sci 2003; 28:473-7. [PMID: 13678957 DOI: 10.1016/s0968-0004(03)00170-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
We have identified four new types of short conserved sequence domains in homing endonucleases and related proteins. These domains are modular, appearing in various combinations. One domain includes a motif known by structure as a novel sequence-specific DNA-binding helix. Sequence similarity shows two other domains to be new types of helix-turn-helix DNA-binding domains. We term the new domains nuclease-associated modular DNA-binding domains (NUMODs).
Collapse
Affiliation(s)
- Einat Sitbon
- Molecular Genetics Department, Weizmann Institute of Science, PO Box 26, Rehovot 76100, Israel
| | | |
Collapse
|
204
|
Liao L, Noble WS. Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003; 10:857-68. [PMID: 14980014 DOI: 10.1089/106652703322756113] [Citation(s) in RCA: 151] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One key element in understanding the molecular machinery of the cell is to understand the structure and function of each protein encoded in the genome. A very successful means of inferring the structure or function of a previously unannotated protein is via sequence similarity with one or more proteins whose structure or function is already known. Toward this end, we propose a means of representing proteins using pairwise sequence similarity scores. This representation, combined with a discriminative classification algorithm known as the support vector machine (SVM), provides a powerful means of detecting subtle structural and evolutionary relationships among proteins. The algorithm, called SVM-pairwise, when tested on its ability to recognize previously unseen families from the SCOP database, yields significantly better performance than SVM-Fisher, profile HMMs, and PSI-BLAST.
Collapse
Affiliation(s)
- Li Liao
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19716, USA
| | | |
Collapse
|
205
|
Marti‐Renom MA, Madhusudhan M, Eswar N, Pieper U, Shen M, Sali A, Fiser A, Mirkovic N, John B, Stuart A. Modeling Protein Structure from its Sequence. ACTA ACUST UNITED AC 2003. [DOI: 10.1002/0471250953.bi0501s03] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Marc A. Marti‐Renom
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - M.S. Madhusudhan
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Narayanan Eswar
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Ursula Pieper
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Min‐yi Shen
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andrej Sali
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andras Fiser
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics Albert Einstein College of Medicine Bronx New York
| | - Nebojsa Mirkovic
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Bino John
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Ashley Stuart
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| |
Collapse
|
206
|
Wlodawer A, Durell SR, Li M, Oyama H, Oda K, Dunn BM. A model of tripeptidyl-peptidase I (CLN2), a ubiquitous and highly conserved member of the sedolisin family of serine-carboxyl peptidases. BMC STRUCTURAL BIOLOGY 2003; 3:8. [PMID: 14609438 PMCID: PMC280685 DOI: 10.1186/1472-6807-3-8] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2003] [Accepted: 11/11/2003] [Indexed: 11/10/2022]
Abstract
Background Tripeptidyl-peptidase I, also known as CLN2, is a member of the family of sedolisins (serine-carboxyl peptidases). In humans, defects in expression of this enzyme lead to a fatal neurodegenerative disease, classical late-infantile neuronal ceroid lipofuscinosis. Similar enzymes have been found in the genomic sequences of several species, but neither systematic analyses of their distribution nor modeling of their structures have been previously attempted. Results We have analyzed the presence of orthologs of human CLN2 in the genomic sequences of a number of eukaryotic species. Enzymes with sequences sharing over 80% identity have been found in the genomes of macaque, mouse, rat, dog, and cow. Closely related, although clearly distinct, enzymes are present in fish (fugu and zebra), as well as in frogs (Xenopus tropicalis). A three-dimensional model of human CLN2 was built based mainly on the homology with Pseudomonas sp. 101 sedolisin. Conclusion CLN2 is very highly conserved and widely distributed among higher organisms and may play an important role in their life cycles. The model presented here indicates a very open and accessible active site that is almost completely conserved among all known CLN2 enzymes. This result is somehow surprising for a tripeptidase where the presence of a more constrained binding pocket was anticipated. This structural model should be useful in the search for the physiological substrates of these enzymes and in the design of more specific inhibitors of CLN2.
Collapse
Affiliation(s)
- Alexander Wlodawer
- Protein Structure Section, Macromolecular Crystallography Laboratory, National Cancer Institute at Frederick, Frederick, MD 21702, USA
| | - Stewart R Durell
- Laboratory of Experimental and Computational Biology, National Cancer Institute, Bethesda, MD 20892, USA
| | - Mi Li
- Protein Structure Section, Macromolecular Crystallography Laboratory, National Cancer Institute at Frederick, Frederick, MD 21702, USA
- Basic Research Program, SAIC-Frederick, Inc., National Cancer Institute at Frederick, Frederick, MD 21702, USA
| | - Hiroshi Oyama
- Department of Applied Biology, Faculty of Textile Science, Kyoto Institute of Technology, Sakyo-ku, Kyoto 606-8585, Japan
| | - Kohei Oda
- Department of Applied Biology, Faculty of Textile Science, Kyoto Institute of Technology, Sakyo-ku, Kyoto 606-8585, Japan
| | - Ben M Dunn
- Department of Biochemistry and Molecular Biology, University of Florida, Gainesville, Florida 32610, USA
| |
Collapse
|
207
|
Abstract
Telomere maintenance and end protection are essential for the survival and proliferation of eukaryotic cells, leading to the prediction that components of this system would be highly conserved. In practice, however, evidence for homology among these factors has been elusive, and, in the case of the known end-protection proteins, evolutionary relationships have been postulated largely on the basis of protein structural and functional similarity alone. Here we report support from sequence profile analyses for a significant and specific evolutionary relationship among OB-fold telomeric end-protection factors.
Collapse
|
208
|
Sandhya S, Kishore S, Sowdhamini R, Srinivasan N. Effective detection of remote homologues by searching in sequence dataset of a protein domain fold. FEBS Lett 2003; 552:225-30. [PMID: 14527691 DOI: 10.1016/s0014-5793(03)00929-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Profile matching methods are commonly used in searches in protein sequence databases to detect evolutionary relationships. We describe here a sensitive protocol, which detects remote similarities by searching in a specialized database of sequences belonging to a fold. We have assessed this protocol by exploring the relationships we detect among sequences known to belong to specific folds. We find that searches within sequences adopting a fold are more effective in detecting remote similarities and evolutionary connections than searches in a database of all sequences. We also discuss the implications of using this strategy to link sequence and structure space.
Collapse
Affiliation(s)
- S Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, 560 012 Bangalore, India
| | | | | | | |
Collapse
|
209
|
Grigoriev IV, Choi IG. Target selection for structural genomics: a single genome approach. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2003; 6:349-62. [PMID: 12626094 DOI: 10.1089/153623102321112773] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
We describe our strategy for selecting targets for protein structure determination in context of structural genomics of a single genome. In the course of target selection, we have studied two of the smallest microbial genomes, Mycoplasma genitalium and Mycoplasma pneumoniae. To our surprise, we found that only 71 Mycoplasma genes or their orthologues can be considered as easy targets for high-throughput structural studies--far fewer than expected. We discuss the methods and criteria used for target selection and the reasons explaining rarity of easy targets. First, despite the common opinion that protein folds can be predicted for only 30-50% of genes, the number of "truly unknown" structures is less than one-third. Second, due to the different codon usage, two thirds of Mycoplasma proteins cannot be directly expressed in E. coli in high-throughput manner and require substitution by their homologues from other organisms. Third, membrane or large multi-domain proteins are difficult targets because of solubility and size issues and often require identification and structure determination of protein domains. Finally, we propose different approaches to address the difficult targets.
Collapse
Affiliation(s)
- Igor V Grigoriev
- Department of Chemistry and E.O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA, USA.
| | | |
Collapse
|
210
|
Abstract
It is often desired to identify further homologs of a family of biological sequences from the ever-growing sequence databases. Profile hidden Markov models excel at capturing the common statistical features of a group of biological sequences. With these common features, we can search the biological database and find new homologous sequences. Most general profile hidden Markov model methods, however, treat the evolutionary relationships between the sequences in a homologous group in an ad-hoc manner. We hereby introduce a method to incorporate phylogenetic information directly into hidden Markov models, and demonstrate that the resulting model performs better than most of the current multiple sequence-based methods for finding distant homologs.
Collapse
Affiliation(s)
- Bin Qian
- Biophysics Research Division, University of Michigan, Ann Arbor, Michigan, USA
| | | |
Collapse
|
211
|
Abstract
A better description of the immune system can be afforded if the latest developments in bioinformatics are applied to integrate sequence with structure and function. Clear guidelines for the upgrade of the bioinformatic capability of the immunogenetics laboratory are discussed in the light of more powerful methods to detect homology, combined approaches to predict the three dimensional properties of a protein and a robust strategy to represent the biological role of a gene.
Collapse
Affiliation(s)
- Bernard de Bono
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK.
| | | |
Collapse
|
212
|
Parenicová L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, Cook HE, Ingram RM, Kater MM, Davies B, Angenent GC, Colombo L. Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: new openings to the MADS world. THE PLANT CELL 2003; 15:1538-51. [PMID: 12837945 PMCID: PMC165399 DOI: 10.1105/tpc.011544] [Citation(s) in RCA: 605] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2003] [Accepted: 04/21/2003] [Indexed: 05/18/2023]
Abstract
MADS-box transcription factors are key regulators of several plant development processes. Analysis of the complete Arabidopsis genome sequence revealed 107 genes encoding MADS-box proteins, of which 84% are of unknown function. Here, we provide a complete overview of this family, describing the gene structure, gene expression, genome localization, protein motif organization, and phylogenetic relationship of each member. We have divided this transcription factor family into five groups (named MIKC, Malpha, Mbeta, Mgamma, and Mdelta) based on the phylogenetic relationships of the conserved MADS-box domain. This study provides a solid base for functional genomics studies into this important family of plant regulatory genes, including the poorly characterized group of M-type MADS-box proteins. MADS-box genes also constitute an excellent system with which to study the evolution of complex gene families in higher plants.
Collapse
Affiliation(s)
- Lucie Parenicová
- Dipartimento di Biologia, Universitá degli Studi di Milano, 20133 Milan, Italy
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
213
|
Irving JA, Spithill TW, Pike RN, Whisstock JC, Smooker PM. The evolution of enzyme specificity in Fasciola spp. J Mol Evol 2003; 57:1-15. [PMID: 12962301 DOI: 10.1007/s00239-002-2434-x] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Fasciola spp., commonly known as liver fluke, are significant trematode parasites of livestock and humans. They secrete several cathepsin L-like cysteine proteases, some of which differ in enzymatic properties and timing of expression in the parasite's life cycle. A detailed sequence and evolutionary analysis is presented, based on 18 cathepsin L-like enzymes isolated from Fasciola spp. (including a novel clone identified in this study). The enzymes form a monophyletic group which has experienced several gene duplication events over the last approximately 135 million years, giving rise to the present-day enzymatic repertoire of the parasite. This timing of these duplications appears to correlate with important points in the evolution of the mammalian hosts. Furthermore, the dates suggest that Fasciola hepatica and Fasciola gigantica diverged around 19 million years ago. A novel analysis, based on the pattern of amino acid diversity, was used to identify sites in the enzyme that are predicted to be subject to positive adaptive evolution. Many of these sites occur within the active site cleft of the enzymes, and hence would be expected to lead to differences in substrate specificity. Using homology modeling, with reference to previously obtained biochemical data, we are able to predict S2 subsite specificity for these enzymes: specifically those that can accommodate bulky hydrophobic residues in the P2 position and those that cannot. A number of other positions subject to evolutionary pressure and potentially significant for enzyme function are also identified, including sites anticipated to diminish cystatin binding affinity.
Collapse
Affiliation(s)
- James A Irving
- Department of Biochemistry and Molecular Biology, Monash University, Victoria 3800, Australia
| | | | | | | | | |
Collapse
|
214
|
Abstract
Protein translations of over 100 complete genomes are now available. About half of these sequences can be provided with structural annotation, thereby enabling some profound insights into protein and pathway evolution. Whereas the major domain structure families are common to all kingdoms of life, these are combined in different ways in multidomain proteins to give various domain architectures that are specific to kingdoms or individual genomes, and contribute to the diverse phenotypes observed. These data argue for more targets in structural genomics initiatives and particularly for the selection of different domain architectures to gain better insights into protein functions.
Collapse
Affiliation(s)
- David Lee
- Department of Biochemistry and Molecular Biology, University College, Gower Street, WC1E 6BT, London, UK.
| | | | | | | |
Collapse
|
215
|
Karchin R, Cline M, Mandel-Gutfreund Y, Karplus K. Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins 2003; 51:504-14. [PMID: 12784210 DOI: 10.1002/prot.10369] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
An important problem in computational biology is predicting the structure of the large number of putative proteins discovered by genome sequencing projects. Fold-recognition methods attempt to solve the problem by relating the target proteins to known structures, searching for template proteins homologous to the target. Remote homologs that may have significant structural similarity are often not detectable by sequence similarities alone. To address this, we incorporated predicted local structure, a generalization of secondary structure, into two-track profile hidden Markov models (HMMs). We did not rely on a simple helix-strand-coil definition of secondary structure, but experimented with a variety of local structure descriptions, following a principled protocol to establish which descriptions are most useful for improving fold recognition and alignment quality. On a test set of 1298 nonhomologous proteins, HMMs incorporating a 3-letter STRIDE alphabet improved fold recognition accuracy by 15% over amino-acid-only HMMs and 23% over PSI-BLAST, measured by ROC-65 numbers. We compared two-track HMMs to amino-acid-only HMMs on a difficult alignment test set of 200 protein pairs (structurally similar with 3-24% sequence identity). HMMs with a 6-letter STRIDE secondary track improved alignment quality by 62%, relative to DALI structural alignments, while HMMs with an STR track (an expanded DSSP alphabet that subdivides strands into six states) improved by 40% relative to CE.
Collapse
Affiliation(s)
- Rachel Karchin
- Center for Biomolecular Science and Engineering, Baskin School of Engineering, University of California, Santa Cruz 95064, USA.
| | | | | | | |
Collapse
|
216
|
Abstract
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.
Collapse
Affiliation(s)
- Andreas Heger
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
217
|
Lolkema JS, Slotboom DJ. Classification of 29 families of secondary transport proteins into a single structural class using hydropathy profile analysis. J Mol Biol 2003; 327:901-9. [PMID: 12662917 DOI: 10.1016/s0022-2836(03)00214-6] [Citation(s) in RCA: 39] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A classification scheme for membrane proteins is proposed that clusters families of proteins into structural classes based on hydropathy profile analysis. The averaged hydropathy profiles of protein families are taken as fingerprints of the 3D structure of the proteins and, therefore, are able to detect more distant evolutionary relationships than amino acid sequences. A procedure was developed in which hydropathy profile analysis is used initially as a filter in a BLAST search of the NCBI protein database. The strength of the procedure is demonstrated by the classification of 29 families of secondary transporters into a single structural class, termed ST[3]. An exhaustive search of the database revealed that the 29 families contain 568 unique sequences. The proteins are predominantly from prokaryotic origin and most of the characterized transporters in ST[3] transport organic and inorganic anions and a smaller number are Na(+)/H(+) antiporters. All modes of energy coupling (symport, antiport, uniport) are found in structural class ST[3]. The relevance of the classification for structure/function prediction of uncharacterised transporters in the class is discussed.
Collapse
Affiliation(s)
- Juke S Lolkema
- Molecular Microbiology, Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751NN, Haren, The Netherlands.
| | | |
Collapse
|
218
|
Van Walle I, Lasters I, Wyns L. Consistency matrices: quantified structure alignments for sets of related proteins. Proteins 2003; 51:1-9. [PMID: 12596259 DOI: 10.1002/prot.10293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Comparing two remotely similar structures is a difficult problem: more often than not, resulting structure alignments will show ambiguities and a unique answer usually does not even exist. In addition, alignments in general have a limited information content because every aligned residue is considered equally important. To solve these issues to a certain extent, one can take the perspective of a whole group of similar structures and then evaluate common structural features. Here, we describe a consistency approach that, although not actually performing a multiple structure alignment, does produce the information that one would conceivably want from such an experiment: the key structural features of the group, e.g., a fold, which in this case are projected onto either a pair of proteins or a single protein. Both representations are useful for a number of applications, ranging from the detection of (partially) wrong structure alignments to protein structure classification and fold recognition. To demonstrate some of these applications, the procedure was applied to 195 SCOP folds containing a total of 1802 domains sharing very low sequence similarity.
Collapse
Affiliation(s)
- Ivo Van Walle
- Department of Ultrastructure, Vrije Universiteit Brussel, Sint-Genesius Rode, Belgium.
| | | | | |
Collapse
|
219
|
Gurung R, Tan A, Ooms LM, McGrath MJ, Huysmans RD, Munday AD, Prescott M, Whisstock JC, Mitchell CA. Identification of a novel domain in two mammalian inositol-polyphosphate 5-phosphatases that mediates membrane ruffle localization. The inositol 5-phosphatase skip localizes to the endoplasmic reticulum and translocates to membrane ruffles following epidermal growth factor stimulation. J Biol Chem 2003; 278:11376-85. [PMID: 12536145 DOI: 10.1074/jbc.m209991200] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
SKIP (skeletal muscle and kidney enriched inositol phosphatase) is a recently identified phosphatidylinositol 3,4,5-trisphosphate- and phosphatidylinositol 4,5-bisphosphate-specific 5-phosphatase. In this study, we investigated the intracellular localization of SKIP. Indirect immunofluorescence and subcellular fractionation showed that, in serum-starved cells, both endogenous and recombinant SKIP colocalized with markers of the endoplasmic reticulum (ER). Following epidermal growth factor (EGF) stimulation, SKIP transiently translocated to plasma membrane ruffles and colocalized with submembranous actin. Data base searching demonstrated a novel 128-amino acid domain in the C terminus of SKIP, designated SKICH for SKIP carboxyl homology, which is also found in the 107-kDa 5-phosphatase PIPP and in members of the TRAF6-binding protein family. Recombinant SKIP lacking the SKICH domain localized to the ER, but did not translocate to membrane ruffles following EGF stimulation. The SKIP SKICH domain showed perinuclear localization and mediated EGF-stimulated plasma membrane ruffle localization. The SKICH domain of the 5-phosphatase PIPP also mediated plasma membrane ruffle localization. Mutational analysis identified the core sequence within the SKICH domain that mediated constitutive membrane association and C-terminal sequences unique to SKIP that contributed to ER localization. Collectively, these studies demonstrate a novel membrane-targeting domain that serves to recruit SKIP and PIPP to membrane ruffles.
Collapse
Affiliation(s)
- Rajendra Gurung
- Department of Biochemistry and Molecular Biology, Monash University, Victoria 3800, Australia
| | | | | | | | | | | | | | | | | |
Collapse
|
220
|
Gille C, Goede A, Schlöetelburg C, Preissner R, Kloetzel PM, Göbel UB, Frömmel C. A comprehensive view on proteasomal sequences: implications for the evolution of the proteasome. J Mol Biol 2003; 326:1437-48. [PMID: 12595256 DOI: 10.1016/s0022-2836(02)01470-5] [Citation(s) in RCA: 73] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Proteasomes are large multimeric self-compartmentizing proteases, which play a crucial role in the clearance of misfolded proteins, breakdown of regulatory proteins, processing of proteins by specific partial proteolysis, cell cycle control as well as preparation of peptides for immune presentation. Two main types can be distinguished by their different tertiary structure: the 20S proteasome and the proteasome-like heat shock protein encoded by heat shock locus V, hslV. Usually, each biological kingdom is characterized by its specific type of proteasome. The 20S proteasomes occur in eukarya and archaea whereas hslV protease is prevalent in bacteria. To verify this rule we applied a genome-wide sequence search to identify proteasomal sequences in data of finished and yet unfinished genome projects. We found several exceptions to this paradigm: (1) Protista: in addition to the 20S proteasome, Leishmania, Trypanosoma and Plasmodium contained hslV, which may have been acquired from an alpha-proteobacterial progenitor of mitochondria. (2) Bacteria: for Magnetospirillum magnetotacticum and Enterococcus faecium we found that each contained two distinct hslVs due to gene duplication or horizontal transfer. Including unassembled data into the analyses we confirmed that a number of bacterial genomes do not contain any proteasomal sequence due to gene loss. (3) High G+C Gram-positives: we confirmed that high G+C Gram-positives possess 20S proteasomes rather than hslV proteases. The core of the 20S proteasome consists of two distinct main types of homologous monomers, alpha and beta, which differentiated into seven subtypes by further gene duplications. By looking at the genome of the intracellular pathogen Encephalitozoon cuniculi we were able to show that differentiation of beta-type subunits into different subtypes occurred earlier than that of alpha-subunits. Additionally, our search strategy had an important methodological consequence: a comprehensive sequence search for a particular protein should also include the raw sequence data when possible because proteins might be missed in the completed assembled genome. The structure-based multiple proteasomal alignment of 433 sequences from 143 organisms can be downloaded from the URL dagger and will be updated regularly.
Collapse
Affiliation(s)
- Christoph Gille
- Institute of Biochemistry, Medical Faculty Charité, Humboldt-University, D-10117, Berlin, Germany.
| | | | | | | | | | | | | |
Collapse
|
221
|
Abstract
In the past decade, bioinformatics has become an integral part of research and development in the biomedical sciences. Bioinformatics now has an essential role both in deciphering genomic, transcriptomic and proteomic data generated by high-throughput experimental technologies and in organizing information gathered from traditional biology. Sequence-based methods of analyzing individual genes or proteins have been elaborated and expanded, and methods have been developed for analyzing large numbers of genes or proteins simultaneously, such as in the identification of clusters of related genes and networks of interacting proteins. With the complete genome sequences for an increasing number of organisms at hand, bioinformatics is beginning to provide both conceptual bases and practical methods for detecting systemic functional behaviors of the cell and the organism.
Collapse
Affiliation(s)
- Minoru Kanehisa
- Bioinformatics Center, Kyoto University, Uji, Kyoto 611-0011, Japan.
| | | |
Collapse
|
222
|
Swalla BM, Gumport RI, Gardner JF. Conservation of structure and function among tyrosine recombinases: homology-based modeling of the lambda integrase core-binding domain. Nucleic Acids Res 2003; 31:805-18. [PMID: 12560475 PMCID: PMC149183 DOI: 10.1093/nar/gkg142] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Tyrosine recombinases participate in diverse biological processes by catalyzing recombination between specific DNA sites. Although a conserved protein fold has been described for the catalytic (CAT) domains of five recombinases, structural relationships between their core-binding (CB) domains remain unclear. Despite differences in the specificity and affinity of core-type DNA recognition, a conserved binding mechanism is suggested by the shared two-domain motif in crystal structure models of the recombinases Cre, XerD and Flp. We have found additional evidence for conservation of the CB domain fold. Comparison of XerD and Cre crystal structures showed that their CB domains are closely related; the three central alpha-helices of these domains are superposable to within 1.44 A. A structure-based multiple sequence alignment containing 25 diverse CB domain sequences provided evidence for widespread conservation of both structural and functional elements in this fold. Based upon the Cre and XerD crystal structures, we employed homology modeling to construct a three-dimensional structure for the lambda integrase CB domain. The model provides a conceptual framework within which many previously identified, functionally important amino acid residues were investigated. In addition, the model predicts new residues that may participate in core-type DNA binding or dimerization, thereby providing hypotheses for future genetic and biochemical experiments.
Collapse
|
223
|
Edwards YJK, Cottage A. Bioinformatics methods to predict protein structure and function. A practical approach. Mol Biotechnol 2003; 23:139-66. [PMID: 12632698 DOI: 10.1385/mb:23:2:139] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Protein structure prediction by using bioinformatics can involve sequence similarity searches, multiple sequence alignments, identification and characterization of domains, secondary structure prediction, solvent accessibility prediction, automatic protein fold recognition, constructing three-dimensional models to atomic detail, and model validation. Not all protein structure prediction projects involve the use of all these techniques. A central part of a typical protein structure prediction is the identification of a suitable structural target from which to extrapolate three-dimensional information for a query sequence. The way in which this is done defines three types of projects. The first involves the use of standard and well-understood techniques. If a structural template remains elusive, a second approach using nontrivial methods is required. If a target fold cannot be reliably identified because inconsistent results have been obtained from nontrivial data analyses, the project falls into the third type of project and will be virtually impossible to complete with any degree of reliability. In this article, a set of protocols to predict protein structure from sequence is presented and distinctions among the three types of project are given. These methods, if used appropriately, can provide valuable indicators of protein structure and function.
Collapse
Affiliation(s)
- Yvonne J K Edwards
- Research Division, UK Human Genome Mapping Project Resource Center, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10, 1SB, England, UK.
| | | |
Collapse
|
224
|
Panchenko AR. Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 2003; 31:683-9. [PMID: 12527777 PMCID: PMC140518 DOI: 10.1093/nar/gkg154] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
To improve the recognition of weak similarities between proteins a method of aligning two sequence profiles is proposed. It is shown that exploring the sequence space in the vicinity of the sequence with unknown properties significantly improves the performance of sequence alignment methods. Consistent with the previous observations the recognition sensitivity and alignment accuracy obtained by a profile-profile alignment method can be as much as 30% higher compared to the sequence-profile alignment method. It is demonstrated that the choice of score function and the diversity of the test profile are very important factors for achieving the maximum performance of the method, whereas the optimum range of these parameters depends on the level of similarity to be recognized.
Collapse
Affiliation(s)
- Anna R Panchenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Room 8N805, 8600 Rockville Pike, Bethesda, MD 20894, USA.
| |
Collapse
|
225
|
Sen S. Statistical analysis of pair-wise compatibility of spatially nearest neighbor and adjacent residues in alpha-helix and beta-strands: application to a minimal model for secondary structure prediction. Biophys Chem 2003; 103:35-49. [PMID: 12504253 DOI: 10.1016/s0301-4622(02)00230-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Secondary structural elements like alpha-helix and beta-strands possess distinctly different structural features and thus the relative positioning of the nearest neighbor residues, and also the sequence-wise adjacent residues is important in determining the structural preference. In the present work we have statistically examined the pair-wise compatibility pattern of physically nearest neighbors and separately the adjacent residue pairs along the sequence in between the nearest neighbor partners in alpha-helices and beta-strands. It has been demonstrated that the patterns and hence, the physical basis of the compatibility of adjacent residue pairs and the spatially nearest neighbors are significantly different in most cases. The influence of tertiary contacts on the pair-wise compatibility is shown to be significant for beta-strands while it is small for alpha-helices. Based on the compatibility of physically nearest neighbors and the sequence-wise adjacent residue pairs, a minimal model has been constructed to predict the alpha-helices, beta-strands and coils of a protein from its sequence. Application of this method to 100 sequences shows that it has a predictive capability comparable to that of other more sophisticated statistical methods.
Collapse
Affiliation(s)
- Srikanta Sen
- Human Genetics and Gemonics Division, Indian Institute of Chemical Biology, 4, Raja S C Mullick Road, Jadavpur, Calcutta 700032, India.
| |
Collapse
|
226
|
Cazalis R, Aussenac T, Rhazi L, Marin A, Gibrat JF. Homology modeling and molecular dynamics simulations of the N-terminal domain of wheat high molecular weight glutenin subunit 10. Protein Sci 2003; 12:34-43. [PMID: 12493826 PMCID: PMC2312395 DOI: 10.1110/ps.0229803] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
High molecular weight glutenin subunits (HMW-GS) are of a particular interest because of their biomechanical properties, which are important in many food systems such as breadmaking. Using fold-recognition techniques, we identified a fold compatible with the N-terminal domain of HMW-GS Dy10. This fold corresponds to the one adopted by proteins belonging to the cereal inhibitor family. Starting from three known protein structures of this family as templates, we built three models for the N-terminal domain of HMW-GS Dy10. We analyzed these models, and we propose a number of hypotheses regarding the N-terminal domain properties that can be tested experimentally. In particular, we discuss two possible ways of interaction between the N-terminal domains of the y-type HMW glutenin subunits. The first way consists in the creation of interchain disulfide bridges. According to our models, we propose two plausible scenarios: (1) the existence of an intrachain disulfide bridge between cysteines 22 and 44, leaving the three other cysteines free of engaging in intermolecular bonds; and (2) the creation of two intrachain disulfide bridges (involving cysteines 22-44 and cysteines 10-55), leaving a single cysteine (45) for creating an intermolecular disulfide bridge. We discuss these scenarios in relation to contradictory experimental results. The second way, although less likely, is nevertheless worth considering. There might exist a possibility for the N-terminal domain of Dy10, Nt-Dy10, to create oligomers, because homologous cereal inhibitor proteins are known to exist as monomers, homodimers, and heterooligomers. We also discuss, in relation to the function of the cereal inhibitor proteins, the possibility that this N-terminal domain has retained similar inhibitory functions.
Collapse
Affiliation(s)
- Roland Cazalis
- Laboratoire d'Agrophysiologie, UMR 1054 INRA, ESA Purpan, 31076 Toulouse cedex 3, France
| | | | | | | | | |
Collapse
|
227
|
Affiliation(s)
- András Fiser
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, Bronz, New York 10461, USA
| | | |
Collapse
|
228
|
Dickens NJ, Ponting CP. THoR: a tool for domain discovery and curation of multiple alignments. Genome Biol 2003; 4:R52. [PMID: 12914660 PMCID: PMC193644 DOI: 10.1186/gb-2003-4-8-r52] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2003] [Revised: 06/17/2003] [Accepted: 06/25/2003] [Indexed: 11/21/2022] Open
Abstract
We describe a tool, THoR, that automatically creates and curates multiple sequence alignments representing protein domains. This exploits both PSI-BLAST and HMMER algorithms and provides an accurate and comprehensive alignment for any domain family. The entire process is designed for use via a web-browser, with simple links and cross-references to relevant information, to assist the assessment of biological significance. THoR has been benchmarked for accuracy using the SMART and pufferfish genome databases.
Collapse
Affiliation(s)
- Nicholas J Dickens
- MRC Functional Genetics Unit, University of Oxford, Department of Human Anatomy and Genetics, South Parks Road, Oxford OX1 3QX, UK.
| | | |
Collapse
|
229
|
Krebs WG, Tsai J, Alexandrov V, Junker J, Jansen R, Gerstein M. Tools and Databases to Analyze Protein Flexibility; Approaches to Mapping Implied Features onto Sequences. Methods Enzymol 2003; 374:544-84. [PMID: 14696388 DOI: 10.1016/s0076-6879(03)74023-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Affiliation(s)
- W G Krebs
- San Diego Supercomputer Center, University of California San Diego, La Jolla, California 92093, USA
| | | | | | | | | | | |
Collapse
|
230
|
Abstract
To assess the reliability of fold assignments to protein sequences, we developed a fold recognition method called FROST (Fold Recognition-Oriented Search Tool) based on a series of filters and a database specifically designed as a benchmark for this new method under realistic conditions. This benchmark database consists of proteins for which there exists, at least, another protein with an extensively similar 3D structure in a database of representative 3D structures (i.e., more than 65% of the residues in both proteins can be structurally aligned). Because the testing of our method must be carried out under conditions similar to those of real fold recognition experiments, no protein pair with sequence similarity detectable using standard sequence comparison methods such as FASTA is included in the benchmark database. While using FROST, we achieved a coverage of 60% for a rate of error of 1%. To obtain a baseline for our method, we used PSI-BLAST and 3D-PSSM. Under the same conditions, for a 1% error rate, coverages for PSI-BLAST and 3D-PSSM were 33 and 56%, respectively.
Collapse
Affiliation(s)
- Antoine Marin
- Mathématique, Informatique et Génome, Centre de Recherche de Versailles, INRA, Route de St Cyr, 78026 Versailles, Cedex, France
| | | | | | | |
Collapse
|
231
|
Nair R, Rost B. Sequence conserved for subcellular localization. Protein Sci 2002; 11:2836-47. [PMID: 12441382 PMCID: PMC2373743 DOI: 10.1110/ps.0207402] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2002] [Revised: 09/05/2002] [Accepted: 09/10/2002] [Indexed: 10/27/2022]
Abstract
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes.
Collapse
Affiliation(s)
- Rajesh Nair
- Columbia University Bioinformatics Center (CUBIC), Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|
232
|
Udaka K, Mamitsuka H, Nakaseko Y, Abe N. Empirical evaluation of a dynamic experiment design method for prediction of MHC class I-binding peptides. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2002; 169:5744-53. [PMID: 12421954 DOI: 10.4049/jimmunol.169.10.5744] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The ability to predict MHC-binding peptides remains limited despite ever expanding demands for specific immunotherapy against cancers, infectious diseases, and autoimmune disorders. Previous analyses revealed position-specific preference of amino acids but failed to detect sequence patterns. Efforts to use computational analysis to identify sequence patterns have been hampered by the insufficiency of the number/quality of the peptide binding data. We propose here a dynamic experiment design to search for sequence patterns that are common to the MHC class I-binding peptides. The method is based on a committee-based framework of query learning using hidden Markov models as its component algorithm. It enables a comprehensive search of a large variety (20(9)) of peptides with a small number of experiments. The learning was conducted in seven rounds of feedback loops, in which our computational method was used to determine the next set of peptides to be analyzed based on the results of the earlier iterations. After these training cycles, the algorithm enabled a real number prediction of MHC binding peptides with an accuracy surpassing that of the hitherto best performing positional scanning method.
Collapse
Affiliation(s)
- Keiko Udaka
- Department of Biophysics, Kyoto University, Kyoto 606-8502, Japan.
| | | | | | | |
Collapse
|
233
|
Mateos A, Dopazo J, Jansen R, Tu Y, Gerstein M, Stolovitzky G. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res 2002; 12:1703-15. [PMID: 12421757 PMCID: PMC187551 DOI: 10.1101/gr.192502] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for ~100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only ~10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily "false" in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the "Borges effect" and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle.
Collapse
Affiliation(s)
- Alvaro Mateos
- Bioinformatics Unit, Centro Nacional de Investigaciones Oncologicas (CNIO), 28039, Madrid, Spain
| | | | | | | | | | | |
Collapse
|
234
|
Abstract
This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at http://www.sbg.bio.ic.ac.uk.
Collapse
Affiliation(s)
- Arne Müller
- Biomolecular Modelling Laboratory, Cancer Research UK, London, United Kingdom
| | | | | |
Collapse
|
235
|
Harton JA, Linhoff MW, Zhang J, Ting JPY. Cutting edge: CATERPILLER: a large family of mammalian genes containing CARD, pyrin, nucleotide-binding, and leucine-rich repeat domains. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2002; 169:4088-93. [PMID: 12370334 DOI: 10.4049/jimmunol.169.8.4088] [Citation(s) in RCA: 238] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Large mammalian proteins containing a nucleotide-binding domain (NBD) and C-terminal leucine-rich repeats (LRR) similar in structure to plant disease resistance proteins have been suggested as critical in innate immunity. Our interest in CIITA, a NBD/LRR protein, and recent reports linking mutations in two other NBD/LRR proteins to inflammatory disorders have prompted us to perform a search for other members. Twenty-two known and novel NBD/LRR genes are spread across eight human chromosomes, with multigene clusters occurring on 11, 16, and 19. Most of these are telomeric. Their N termini vary, but most have a pyrin domain. The genomic organization demonstrates a high degree of conservation of the NBD- and LRR-encoding exons. Except for CIITA, all the predicted NBD/LRR proteins are likely ATP-binding proteins. Some have broad tissue expression, whereas others are restricted to myeloid cells. The implications of these data on origins, expression, and function of these genes are discussed.
Collapse
Affiliation(s)
- Jonathan A Harton
- Department of Microbiology and Immunology, Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA
| | | | | | | |
Collapse
|
236
|
Schultz J, Pils B. Prediction of structure and functional residues for O-GlcNAcase, a divergent homologue of acetyltransferases. FEBS Lett 2002; 529:179-82. [PMID: 12372596 DOI: 10.1016/s0014-5793(02)03322-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
N-Acetyl-beta-D-glucosaminidase (O-GlcNAcase) is a key enzyme in the posttranslational modification of intracellular proteins by O-linked N-acetylglucosamine (O-GlcNAc). Here, we show that this protein contains two catalytic domains, one homologous to bacterial hyaluronidases and one belonging to the GCN5-related family of acetyltransferases (GNATs). Using sequence and structural information, we predict that the GNAT homologous region contains the O-GlcNAcase activity. Thus, O-GlcNAcase is the first member of the GNAT family not involved in transfer of acetyl groups, adding a new mode of evolution to this large protein family. Comparison with solved structures of different GNATs led to a reliable structure prediction and mapping of residues involved in binding of the GlcNAc-modified proteins and catalysis.
Collapse
Affiliation(s)
- Jörg Schultz
- Computational Molecular Biology Department, Max-Planck-Institute for Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany.
| | | |
Collapse
|
237
|
Madera M, Gough J. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res 2002; 30:4321-8. [PMID: 12364612 PMCID: PMC140544 DOI: 10.1093/nar/gkf544] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Profile hidden Markov models (HMMs) are amongst the most successful procedures for detecting remote homology between proteins. There are two popular profile HMM programs, HMMER and SAM. Little is known about their performance relative to each other and to the recently improved version of PSI-BLAST. Here we compare the two programs to each other and to non-HMM methods, to determine their relative performance and the features that are important for their success. The quality of the multiple sequence alignments used to build models was the most important factor affecting the overall performance of profile HMMs. The SAM T99 procedure is needed to produce high quality alignments automatically, and the lack of an equivalent component in HMMER makes it less complete as a package. Using the default options and parameters as would be expected of an inexpert user, it was found that from identical alignments SAM consistently produces better models than HMMER and that the relative performance of the model-scoring components varies. On average, HMMER was found to be between one and three times faster than SAM when searching databases larger than 2000 sequences, SAM being faster on smaller ones. Both methods were shown to have effective low complexity and repeat sequence masking using their null models, and the accuracy of their E-values was comparable. It was found that the SAM T99 iterative database search procedure performs better than the most recent version of PSI-BLAST, but that scoring of PSI-BLAST profiles is more than 30 times faster than scoring of SAM models.
Collapse
Affiliation(s)
- Martin Madera
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK.
| | | |
Collapse
|
238
|
George RA, Heringa J. Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins 2002; 48:672-81. [PMID: 12211035 DOI: 10.1002/prot.10175] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Protein sequences containing more than one structural domain are problematic when used in homology searches where they can either stop an iterative database search prematurely or cause an explosion of a search to common domains. We describe a method, DOMAINATION, that infers domains and their boundaries in a query sequence from local gapped alignments generated using PSI-BLAST. Through a new technique to recognize domain insertions and permutations, DOMAINATION submits delineated domains as successive database queries in further iterative steps. Assessed over a set of 452 multidomain proteins, the method predicts structural domain boundaries with an overall accuracy of 50% and improves finding distant homologies by 14% compared with PSI-BLAST. DOMAINATION is available as a web based tool at http://mathbio.nimr.mrc.ac.uk, and the source code is available from the authors upon request.
Collapse
Affiliation(s)
- Richard A George
- Division of Mathematical Biology, National Institute for Medical Research, The Ridgeway, Mill Hill, United Kingdom
| | | |
Collapse
|
239
|
Luo RY, Feng ZP, Liu JK. Prediction of protein structural class by amino acid and polypeptide composition. EUROPEAN JOURNAL OF BIOCHEMISTRY 2002; 269:4219-25. [PMID: 12199700 DOI: 10.1046/j.1432-1033.2002.03115.x] [Citation(s) in RCA: 110] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
A new approach of predicting structural classes of protein domain sequences is presented in this paper. Besides the amino acid composition, the composition of several dipeptides, tripeptides, tetrapeptides, pentapeptides and hexapeptides are taken into account based on the stepwise discriminant analysis. The result of jackknife test shows that this new approach can lead to higher predictive sensitivity and specificity for reduced sequence similarity datasets. Considering the dataset PDB40-B constructed by Brenner and colleagues, 75.2% protein domain sequences are correctly assigned in the jackknife test for the four structural classes: all-alpha, all-beta, alpha/beta and alpha + beta, which is improved by 19.4% in jackknife test and 25.5% in resubstitution test, in contrast with the component-coupled algorithm using amino acid composition alone (AAC approach) for the same dataset. In the cross-validation test with dataset PDB40-J constructed by Park and colleagues, more than 80% predictive accuracy is obtained. Furthermore, for the dataset constructed by Chou and Maggiona, the accuracy of 100% and 99.7% can be easily achieved, respectively, in the resubstitution test and in the jackknife test merely taking the composition of dipeptides into account. Therefore, this new method provides an effective tool to extract valuable information from protein sequences, which can be used for the systematic analysis of small or medium size protein sequences. The computer programs used in this paper are available on request.
Collapse
Affiliation(s)
- Rui-yan Luo
- Department of Mathematics, Tianjin University, Tianjin 300 072, China
| | | | | |
Collapse
|
240
|
Abstract
The importance of a residue for maintaining the structure and function of a protein can usually be inferred from how conserved it appears in a multiple sequence alignment of that protein and its homologues. A reliable metric for quantifying residue conservation is desirable. Over the last two decades many such scores have been proposed, but none has emerged as a generally accepted standard. This work surveys the range of scores that biologists, biochemists, and, more recently, bioinformatics workers have developed, and reviews the intrinsic problems associated with developing and evaluating such a score. A general formula is proposed that may be used to compare the properties of different particular conservation scores or as a measure of conservation in its own right.
Collapse
Affiliation(s)
- William S J Valdar
- Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
241
|
Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng Des Sel 2002; 15:643-9. [PMID: 12364578 DOI: 10.1093/protein/15.8.643] [Citation(s) in RCA: 44] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Sequence databases are rapidly growing, thereby increasing the coverage of protein sequence space, but this coverage is uneven because most sequencing efforts have concentrated on a small number of organisms. The resulting granularity of sequence space creates many problems for profile-based sequence comparison programs. In this paper, we suggest several strategies that address these problems, and at the same time speed up the searches for homologous proteins and improve the ability of profile methods to recognize distant homologies. One of our strategies combines database clustering, which removes highly redundant sequence, and a two-step PSI-BLAST (PDB-BLAST), which separates sequence spaces of profile composition and space of homology searching. The combination of these strategies improves distant homology recognitions by more than 100%, while using only 10% of the CPU time of the standard PSI-BLAST search. Another method, intermediate profile searches, allows for the exploration of additional search directions that are normally dominated by large protein sub-families within very diverse families. All methods are evaluated with a large fold-recognition benchmark.
Collapse
Affiliation(s)
- Weizhong Li
- The Burnham Institute, La Jolla, CA 92037, USA
| | | | | |
Collapse
|
242
|
Ponting CP, Russell RR. The natural history of protein domains. ANNUAL REVIEW OF BIOPHYSICS AND BIOMOLECULAR STRUCTURE 2002; 31:45-71. [PMID: 11988462 DOI: 10.1146/annurev.biophys.31.082901.134314] [Citation(s) in RCA: 199] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Genome sequencing and structural genomics projects are providing new insights into the evolutionary history ofprote in domains. As methods for sequence and structure comparison improve, more distantly related domains are shown to be homologous. Thus there is a need for domain families to be classified within a hierarchy similar to Linnaeus' Systema Naturae, the classification of species. With such a hierarchy in mind, we discuss the evolution of domains, their combination into proteins, and evidence as to the likely origin of protein domains. We also discuss when and how analysis of domains can be used to understand details of protein function. Unconventional features of domain evolution such as intragenomic competition, domain insertion, horizontal gene transfer, and convergent evolution are seen as analogs of organismal evolutionary events. These parallels illustrate how the concept of domains can be applied to provide insights into evolutionary biology.
Collapse
Affiliation(s)
- Chris P Ponting
- Department of Human Anatomy and Genetics, University of Oxford, MRC Functional Genetics Unit, South Parks Road, Oxford OX1 3QX, UK.
| | | |
Collapse
|
243
|
Mougous JD, Green RE, Williams SJ, Brenner SE, Bertozzi CR. Sulfotransferases and sulfatases in mycobacteria. CHEMISTRY & BIOLOGY 2002; 9:767-76. [PMID: 12144918 DOI: 10.1016/s1074-5521(02)00175-8] [Citation(s) in RCA: 92] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Analysis of the genomes of M. tuberculosis, M. leprae, M. smegmatis, and M. avium has revealed a large family of genes homologous to known sulfotransferases. Despite reports detailing a suite of sulfated glycolipids in many mycobacteria, a corresponding family of sulfotransferase genes remains uncharacterized. Here, a sequence-based analysis of newly discovered mycobacterial sulfotransferase genes, named stf1-stf10, is presented. Interestingly, two sulfotransferase genes are highly similar to mammalian sulfotransferases, increasing the list of mycobacterial eukaryotic-like protein families. The sulfotransferases join an equally complex family of mycobacterial sulfatases: a large family of sulfatase genes has been found in all of the mycobacterial genomes examined. As sulfated molecules are common mediators of cell-cell interactions, the sulfotransferases and sulfatases may be involved in regulating host-pathogen interactions.
Collapse
Affiliation(s)
- Joseph D Mougous
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| | | | | | | | | |
Collapse
|
244
|
Abstract
Fold recognition predicts protein three-dimensional structure by establishing relationships between a protein sequence and known protein structures. Most methods explicitly use information derived from the secondary and tertiary structure of the templates. Here we show that rigorous application of a sequence search method (PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold recognition methods. Since the method, SENSER, does not require knowledge of the three-dimensional structure, it can be used to infer relationships that are not tractable by methods dependent on structural templates.
Collapse
Affiliation(s)
- Kristin K Koretke
- Microbial Bioinformatics Group, GlaxoSmithKline, Collegeville, Pennsylvania 19426-0989, USA.
| | | | | |
Collapse
|
245
|
Vitale L, Casadei R, Canaider S, Lenzi L, Strippoli P, D'Addabbo P, Giannone S, Carinci P, Zannotti M. Cysteine and tyrosine-rich 1 (CYYR1), a novel unpredicted gene on human chromosome 21 (21q21.2), encodes a cysteine and tyrosine-rich protein and defines a new family of highly conserved vertebrate-specific genes. Gene 2002; 290:141-51. [PMID: 12062809 DOI: 10.1016/s0378-1119(02)00550-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
A novel human gene has been identified by in-depth bioinformatics analysis of chromosome 21 segment 40/105 (21q21.1), with no coding region predicted in any previous analysis. Brain-derived DNA complementary to RNA (cDNA) sequencing predicts a 154-amino acid product with no similarity to any known protein. The gene has been named cysteine and tyrosine-rich protein 1 gene (symbol cysteine and tyrosine-rich 1, CYYR1). The CYYR1 messenger RNA was found by Northern blot analysis in a broad range of tissues (two transcripts of 3.4 and 2.2 kb). The gene consists of four exons and spans about 107 kb, including a very large intron of 85.8 kb. Analysis of expressed sequence tags shows high CYYR1 expression in cells belonging to the amine precursor uptake and decarboxylation system. We also cloned the cDNA of the murine ortholog Cyyr1, which was mapped by a radiation hybrid panel on chromosome 16 within the region corresponding to that containing the respective human homolog on chromosome 21. Sequence and phylogenetic analysis led to identification of several genes encoding CYYR1 homologous proteins. The most prominent feature identified in the protein family is a central, unique cysteine and tyrosine-rich domain, which is strongly conserved from lower vertebrates (fishes) to humans but is absent in bacteria and invertebrates.
Collapse
MESH Headings
- Amino Acid Sequence
- Animals
- Blotting, Northern
- Chromosomes, Human, Pair 21/genetics
- DNA, Complementary/chemistry
- DNA, Complementary/genetics
- Databases, Nucleic Acid
- Evolution, Molecular
- Expressed Sequence Tags
- Female
- Gene Expression
- Humans
- Membrane Proteins
- Mice
- Molecular Sequence Data
- Phylogeny
- Proteins/genetics
- RNA, Messenger/genetics
- RNA, Messenger/metabolism
- Radiation Hybrid Mapping
- Sequence Alignment
- Sequence Analysis, DNA
- Sequence Homology, Amino Acid
- Vertebrates/genetics
Collapse
Affiliation(s)
- Lorenza Vitale
- Istituto di Istologia ed Embriologia Generale, Università di Bologna, Bologna-Centro di Ricerca in Genetica Molecolare Fondazione CARISBO, Bologna, Via Belmeloro, Bologna, Italy
| | | | | | | | | | | | | | | | | |
Collapse
|
246
|
Hegyi H, Lin J, Greenbaum D, Gerstein M. Structural genomics analysis: characteristics of atypical, common, and horizontally transferred folds. Proteins 2002; 47:126-41. [PMID: 11933060 DOI: 10.1002/prot.10078] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We conducted a structural genomics analysis of the folds and structural superfamilies in the first 20 completely sequenced genomes by focusing on the patterns of fold usage and trying to identify structural characteristics of typical and atypical folds. We assigned folds to sequences using PSI-blast, run with a systematic protocol to reduce the amount of computational overhead. On average, folds could be assigned to about a fourth of the ORFs in the genomes and about a fifth of the amino acids in the proteomes. More than 80% of all the folds in the SCOP structural classification were identified in one of the 20 organisms, with worm and E. coli having the largest number of distinct folds. Folds are particularly effective at comprehensively measuring levels of gene duplication, because they group together even very remote homologues. Using folds, we find the average level of duplication varies depending on the complexity of the organism, ranging from 2.4 in M. genitalium to 32 for the worm, values significantly higher than those observed based purely on sequence similarity. We rank the common folds in the 20 organisms, finding that the top three are the P-loop NTP hydrolase, the ferrodoxin fold, and the TIM-barrel, and discuss in detail the many factors that affect and bias these rankings. We also identify atypical folds that are "unique" to one of the organisms in our study and compare the characteristics of these folds with the most common ones. We find that common folds tend be more multifunctional and associated with more regular, "symmetrical" structures than the unique ones. In addition, many of the unique folds are associated with proteins involved in cell defense (e.g., toxins). We analyze specific patterns of fold occurrence in the genomes by associating some of them with instances of horizontal transfer and others with gene loss. In particular, we find three possible examples of transfer between archaea and bacteria and six between eukarya and bacteria. We make available our detailed results at http://genecensus.org/20.
Collapse
Affiliation(s)
- Hedi Hegyi
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, USA
| | | | | | | |
Collapse
|
247
|
Abstract
The level of sequence similarity that implies similarity in protein structure is well established. Recently, many groups proposed thresholds for similarity in sequence implying similarity in enzymatic function. All previous results suggest the strong conservation of enzymatic function above levels of 50% pairwise sequence identity. Here, I argue that all groups substantially overestimated the conservation of enzyme function because their data sets were either too biased, or too small. An unbiased analysis suggested that less than 30% of the pair fragments above 50% sequence identity have entirely identical EC numbers. Another surprising finding was that even BLAST E-values below 10(-50) did not suffice to automatically transfer enzyme function without errors. As expected, most misclassifications originated from similarities in relatively short regions and/or from transferring annotations for different domains. Both problems cannot be corrected easily by adjusting the thresholds for automatic transfer of genome annotations. A score relating sequence identity to alignment length (distance from HSSP-threshold) outperformed statistical BLAST scores for high sequence similarity. In particular, the distance score allowed error-free transfer of enzyme function for the 10% most similar enzyme pairs. The results illustrated how difficult it is to assess the conservation of protein function and to guarantee error-free genome annotations, in general: sets with millions of pair comparisons might not suffice to arrive at statistically significant conclusions. In practice, the revised detailed estimates for the sequence conservation of enzyme function may provide important benchmarks for everyday sequence analysis and for more cautious automatic genome annotations.
Collapse
Affiliation(s)
- Burkhard Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.
| |
Collapse
|
248
|
Karwath A, King RD. Homology induction: the use of machine learning to improve sequence similarity searches. BMC Bioinformatics 2002; 3:11. [PMID: 11972320 PMCID: PMC107726 DOI: 10.1186/1471-2105-3-11] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2001] [Accepted: 04/23/2002] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify approximately 50% of homologies (with a false positive rate set at 1/1000). RESULTS We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodology with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. CONCLUSIONS HI is a new technique for the detection of remote protein homology--a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.
Collapse
Affiliation(s)
- Andreas Karwath
- Department of Computer Sciences, University of Wales, Aberystwyth, SY23 3DB, UK
| | - Ross D King
- Department of Computer Sciences, University of Wales, Aberystwyth, SY23 3DB, UK
| |
Collapse
|
249
|
Buchan DWA, Shepherd AJ, Lee D, Pearl FMG, Rison SCG, Thornton JM, Orengo CA. Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res 2002; 12:503-14. [PMID: 11875040 PMCID: PMC155287 DOI: 10.1101/gr.213802] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
We present a novel web-based resource, Gene3D, of precalculated structural assignments to gene sequences and whole genomes. This resource assigns structural domains from the CATH database to whole genes and links these to their curated functional and structural annotations within the CATH domain structure database, the functional Dictionary of Homologous Superfamilies (DHS) and PDBsum. Currently Gene3D provides annotation for 36 complete genomes (two eukaryotes, six archaea, and 28 bacteria). On average, between 30% and 40% of the genes of a given genome can be structurally annotated. Matches to structural domains are found using the profile-based method (PSI-BLAST). and a novel protocol, DRange, is used to resolve conflicts in matches involving different homologous superfamilies.
Collapse
Affiliation(s)
- Daniel W A Buchan
- Biomolecular Structure and Modelling Group, Department of Biochemistry and Molecular Biology, University College London, London, WC1E 6BT, United Kingdom
| | | | | | | | | | | | | |
Collapse
|
250
|
Hedman M, Deloof H, Von Heijne G, Elofsson A. Improved detection of homologous membrane proteins by inclusion of information from topology predictions. Protein Sci 2002; 11:652-8. [PMID: 11847287 PMCID: PMC2373465 DOI: 10.1110/ps.39402] [Citation(s) in RCA: 20] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
A total of 20%-25% of the proteins in a typical genome are helical membrane proteins. The transmembrane regions of these proteins have markedly different properties when compared with globular proteins. This presents a problem when homology search algorithms optimized for globular proteins are applied to membrane proteins. Here we present modifications of the standard Smith-Waterman and profile search algorithms that significantly improve the detection of related membrane proteins. The improvement is based on the inclusion of information about predicted transmembrane segments in the alignment algorithm. This is done by simply increasing the alignment score if two residues predicted to belong to transmembrane segments are aligned with each other. Benchmarking over a test set of G-protein-coupled receptor sequences shows that the number of false positives is significantly reduced in this way, both when closely related and distantly related proteins are searched for.
Collapse
Affiliation(s)
- Maria Hedman
- Stockholm Bioinformatics Center, SCFAB, Stockholm University, SE-10691, Stockholm, Sweden
| | | | | | | |
Collapse
|