301
|
Lunn JE, Ashton AR, Hatch MD, Heldt HW. Purification, molecular cloning, and sequence analysis of sucrose-6F-phosphate phosphohydrolase from plants. Proc Natl Acad Sci U S A 2000; 97:12914-9. [PMID: 11050182 PMCID: PMC18864 DOI: 10.1073/pnas.230430197] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Sucrose-6(F)-phosphate phosphohydrolase (SPP; EC ) catalyzes the final step in the pathway of sucrose biosynthesis and is the only enzyme of photosynthetic carbon assimilation for which the gene has not been identified. The enzyme was purified to homogeneity from rice (Oryza sativa L.) leaves and partially sequenced. The rice leaf enzyme is a dimer with a native molecular mass of 100 kDa and a subunit molecular mass of 50 kDa. The enzyme is highly specific for sucrose 6(F)-phosphate with a K(m) of 65 microM and a specific activity of 1250 micromol min(-1) mg(-1) protein. The activity is dependent on Mg(2+) with a remarkably low K(a) of 8-9 microM and is weakly inhibited by sucrose. Three peptides from cleavage of the purified rice SPP with endoproteinase Lys-C showed similarity to the deduced amino acid sequences of three predicted open reading frames (ORF) in the Arabidopsis thaliana genome and one in the genome of the cyanobacterium Synechocystis sp. PCC6803, as well as cDNA clones from Arabidopsis, maize, and other species in the GenBank database of expressed sequence tags. The putative maize SPP cDNA clone contained an ORF encoding a 420-amino acid polypeptide. Heterologous expression in Escherichia coli showed that this cDNA clone encoded a functional SPP enzyme. The 260-amino acid N-terminal catalytic domain of the maize SPP is homologous to the C-terminal region of sucrose-phosphate synthase. A PSI-BLAST search of the GenBank database indicated that the maize SPP is a member of the haloacid dehalogenase hydrolase/phosphatase superfamily.
Collapse
Affiliation(s)
- J E Lunn
- Commonwealth Scientific and Industrial Research Organization Plant Industry, GPO Box 1600, Canberra, ACT 2601, Australia.
| | | | | | | |
Collapse
|
302
|
Abstract
The threading approach to protein fold recognition attempts to evaluate how well a query sequence fits into an already-solved fold. 3D-1D threaders rely on matching 1-dimensional strings of 3-dimensional information predicted from the query sequence with corresponding features of the target structure. In many cases this is combined with a sequence comparison. The combination of sequence and structure information has been shown to improve the accuracy of fold recognition, relative to the exclusive use of sequence or structure. In this paper, we review progress made since the introduction of threading methods a decade ago, highlighting recent advances. We focus on two emerging methods that are unconventional 3D-1D threaders: proximity correlation matrices and parallel cascade identification.
Collapse
Affiliation(s)
- R David
- Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| | | | | |
Collapse
|
303
|
Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST alignment accuracy in comparison to structural alignments. Protein Sci 2000; 9:2278-84. [PMID: 11152139 PMCID: PMC2144484 DOI: 10.1110/ps.9.11.2278] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The PSI-BLAST algorithm has been acknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. This has been demonstrated by its ability to recognize remote structural homologues and by the greatest coverage it enables in annotation of a complete genome. Although recognizing the correct fold of a sequence is of major importance, the accuracy of the alignment is crucial for the success of modeling one sequence by the structure of its remote homologue. Here we assess the accuracy of PSI-BLAST alignments on a stringent database of 123 structurally similar, sequence-dissimilar pairs of proteins, by comparing them to the alignments defined on a structural basis. Each protein sequence is compared to a nonredundant database of the protein sequences by PSI-BLAST. Whenever a pair member detects its pair-mate, the positions that are aligned both in the sequential and structural alignments are determined, and the alignment sensitivity is expressed as the percentage of these positions out of the structural alignment. Fifty-two sequences detected their pair-mates (for 16 pairs the success was bi-directional when either pair member was used as a query). The average percentage of correctly aligned residues per structural alignment was 43.5+/-2.2%. Other properties of the alignments were also examined, such as the sensitivity vs. specificity and the change in these parameters over consecutive iterations. Notably, there is an improvement in alignment sensitivity over consecutive iterations, reaching an average of 50.9+/-2.5% within the five iterations tested in the current study.
Collapse
Affiliation(s)
- I Friedberg
- Department of Molecular Genetics and Biotechnology, The Hebrew University, Hadassah Medical School, Jerusalem, Israel
| | | | | |
Collapse
|
304
|
Remm M, Sonnhammer E. Classification of transmembrane protein families in the Caenorhabditis elegans genome and identification of human orthologs. Genome Res 2000; 10:1679-89. [PMID: 11076853 PMCID: PMC310950 DOI: 10.1101/gr.gr-1491r] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The complete genome sequence of the nematode Caenorhabditis elegans provides an excellent basis for studying the distribution and evolution of protein families in higher eukaryotes. Three fundamental questions are as follows: How many paralog clusters exist in one species, how many of these are shared with other species, and how many proteins can be assigned a functional counterpart in other species? We have addressed these questions in a detailed study of predicted membrane proteins in C. elegans and their mammalian homologs. All worm proteins predicted to contain at least two transmembrane segments were clustered on the basis of sequence similarity. This resulted in 189 groups with two or more sequences, containing, in total, 2647 worm proteins. Hidden Markov models (HMMs) were created for each family, and were used to retrieve mammalian homologs from the SWISSPROT, TREMBL, and VTS databases. About one-half of these clusters had mammalian homologs. Putative worm-mammalian orthologs were extracted by use of nine different phylogenetic methods and BLAST. Eight clusters initially thought to be worm-specific were assigned mammalian homologs after searching EST and genomic sequences. A compilation of 174 orthology assignments made with high confidence is presented.
Collapse
Affiliation(s)
- M Remm
- Center for Genomics Research, Karolinska Institute, Stockholm, 17177 Sweden
| | | |
Collapse
|
305
|
Abstract
We present a protein fold-recognition method that uses a comprehensive statistical interpretation of structural Hidden Markov Models (HMMs). The structure/fold recognition is done by summing the probabilities of all sequence-to-structure alignments. The optimal alignment can be defined as the most probable, but suboptimal alignments may have comparable probabilities. These suboptimal alignments can be interpreted as optimal alignments to the "other" structures from the ensemble or optimal alignments under minor fluctuations in the scoring function. Summing probabilities for all alignments gives a complete estimate of sequence-model compatibility. In the case of HMMs that produce a sequence, this reflects the fact that due to our indifference to exactly how the HMM produced the sequence, we should sum over all possibilities. We have built a set of structural HMMs for 188 protein structures and have compared two methods for identifying the structure compatible with a sequence: by the optimal alignment probability and by the total probability. Fold recognition by total probability was 40% more accurate than fold recognition by the optimal alignment probability. Proteins 2000;40:451-462.
Collapse
Affiliation(s)
- J R Bienkowska
- BioMolecular Engineering Research Center, College of Engineering, Boston University, Boston, Massachusetts 02215, USA.
| | | | | | | | | |
Collapse
|
306
|
Abstract
The effect of training a neural network secondary structure prediction algorithm with different types of multiple sequence alignment profiles derived from the same sequences, is shown to provide a range of accuracy from 70.5% to 76.4%. The best accuracy of 76.4% (standard deviation 8.4%), is 3.1% (Q(3)) and 4.4% (SOV2) better than the PHD algorithm run on the same set of 406 sequence non-redundant proteins that were not used to train either method. Residues predicted by the new method with a confidence value of 5 or greater, have an average Q(3) accuracy of 84%, and cover 68% of the residues. Relative solvent accessibility based on a two state model, for 25, 5, and 0% accessibility are predicted at 76.2, 79.8, and 86. 6% accuracy respectively. The source of the improvements obtained from training with different representations of the same alignment data are described in detail. The new Jnet prediction method resulting from this study is available in the Jpred secondary structure prediction server, and as a stand-alone computer program from: http://barton.ebi.ac.uk/. Proteins 2000;40:502-511.
Collapse
Affiliation(s)
- J A Cuff
- Laboratory of Molecular Biophysics, Oxford, United Kingdom
| | | |
Collapse
|
307
|
Abstract
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.
Collapse
Affiliation(s)
- L Jaroszewski
- The Burnham Institute, La Jolla, California 92037, USA
| | | | | |
Collapse
|
308
|
Bateman A, Birney E. Searching databases to find protein domain organization. ADVANCES IN PROTEIN CHEMISTRY 2000; 54:137-57. [PMID: 10829227 DOI: 10.1016/s0065-3233(00)54005-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Affiliation(s)
- A Bateman
- Sanger Centre, Hinxton, United Kingdom
| | | |
Collapse
|
309
|
Koonin EV, Wolf YI, Aravind L. Protein fold recognition using sequence profiles and its application in structural genomics. ADVANCES IN PROTEIN CHEMISTRY 2000; 54:245-75. [PMID: 10829230 DOI: 10.1016/s0065-3233(00)54008-x] [Citation(s) in RCA: 69] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
310
|
Ponting CP, Schultz J, Copley RR, Andrade MA, Bork P. Evolution of domain families. ADVANCES IN PROTEIN CHEMISTRY 2000; 54:185-244. [PMID: 10829229 DOI: 10.1016/s0065-3233(00)54007-8] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- C P Ponting
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | | | | | | | | |
Collapse
|
311
|
Abstract
A simple general approximation for the distribution of gapped local alignment scores is presented, suitable for assessing significance of comparisons between two protein sequences or a sequence and a profile. The approximation takes account of the scoring scheme (i.e. gap penalty and substitution matrix or profile), sequence composition and length. Use of this formula means it is unnecessary to fit an extreme-value distribution to simulations or to the results of databank searches. The method is based on the theoretical ideas introduced by R. Mott and R. Tribe in 1999. Extensive simulation studies show that score-thresholds produced by the method are accurate to within +/-5 % 95 % of the time. We also investigate factors which effect the accuracy of alignment statistics, and show that any method based on asymptotic theory is limited because asymptotic behaviour is not strictly achieved for many real protein sequences, due to extreme composition effects. Consequently, it may not be practicable to find a general formula that is significantly more accurate until the sub-asymptotic behaviour of alignments is better understood.
Collapse
Affiliation(s)
- R Mott
- Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK.
| |
Collapse
|
312
|
Abstract
Sequence alignment programs such as BLAST and PSI-BLAST are used routinely in pairwise, profile-based, or intermediate-sequence-search (ISS) methods to detect remote homologies for the purposes of fold assignment and comparative modeling. Yet, the sequence alignment quality of these methods at low sequence identity is not known. We have used the CE structure alignment program (Shindyalov and Bourne, Prot Eng 1998;11:739) to derive sequence alignments for all superfamily and family-level related proteins in the SCOP domain database. CE aligns structures and their sequences based on distances within each protein, rather than on interprotein distances. We compared BLAST, PSI-BLAST, CLUSTALW, and ISS alignments with the CE structural alignments. We found that global alignments with CLUSTALW were very poor at low sequence identity (<25%), as judged by the CE alignments. We used PSI-BLAST to search the nonredundant sequence database (nr) with every sequence in SCOP using up to four iterations. The resulting matrix was used to search a database of SCOP sequences. PSI-BLAST is only slightly better than BLAST in alignment accuracy on a per-residue basis, but PSI-BLAST matrix alignments are much longer than BLAST's, and so align correctly a larger fraction of the total number of aligned residues in the structure alignments. Any two SCOP sequences in the same superfamily that shared a hit or hits in the nr PSI-BLAST searches were identified as linked by the shared intermediate sequence. We examined the quality of the longest SCOP-query/ SCOP-hit alignment via an intermediate sequence, and found that ISS produced longer alignments than PSI-BLAST searches alone, of nearly comparable per-residue quality. At 10-15% sequence identity, BLAST correctly aligns 28%, PSI-BLAST 40%, and ISS 46% of residues according to the structure alignments. We also compared CE structure alignments with FSSP structure alignments generated by the DALI program. In contrast to the sequence methods, CE and structure alignments from the FSSP database identically align 75% of residue pairs at the 10-15% level of sequence identity, indicating that there is substantial room for improvement in these sequence alignment methods. BLAST produced alignments for 8% of the 10,665 nonimmunoglobulin SCOP superfamily sequence pairs (nearly all <25% sequence identity), PSI-BLAST matched 17% and the double-PSI-BLAST ISS method aligned 38% with E-values <10.0. The results indicate that intermediate sequences may be useful not only in fold assignment but also in achieving more complete sequence alignments for comparative modeling.
Collapse
Affiliation(s)
- J M Sauder
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | | | | |
Collapse
|
313
|
Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520. [PMID: 10860755 DOI: 10.1006/jmbi.2000.3741] [Citation(s) in RCA: 1119] [Impact Index Per Article: 44.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A method (three-dimensional position-specific scoring matrix, 3D-PSSM) to recognise remote protein sequence homologues is described. The method combines the power of multiple sequence profiles with knowledge of protein structure to provide enhanced recognition and thus functional assignment of newly sequenced genomes. The method uses structural alignments of homologous proteins of similar three-dimensional structure in the structural classification of proteins (SCOP) database to obtain a structural equivalence of residues. These equivalences are used to extend multiply aligned sequences obtained by standard sequence searches. The resulting large superfamily-based multiple alignment is converted into a PSSM. Combined with secondary structure matching and solvation potentials, 3D-PSSM can recognise structural and functional relationships beyond state-of-the-art sequence methods. In a cross-validated benchmark on 136 homologous relationships unambiguously undetectable by position-specific iterated basic local alignment search tool (PSI-Blast), 3D-PSSM can confidently assign 18 %. The method was applied to the remaining unassigned regions of the Mycoplasma genitalium genome and an additional 13 regions were assigned with 95 % confidence. 3D-PSSM is available to the community as a web server: http://www.bmm.icnet.uk/servers/3dpssm
Collapse
Affiliation(s)
- L A Kelley
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England
| | | | | |
Collapse
|
314
|
Selzer PM, Brutsche S, Wiesner P, Schmid P, Müllner H. Target-based drug discovery for the development of novel antiinfectives. Int J Med Microbiol 2000; 290:191-201. [PMID: 11045924 DOI: 10.1016/s1438-4221(00)80090-9] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
Abstract
In the 20th century and especially during the last 50 years, antiinfectives have been increasingly used to control and prevent infectious diseases. Unfortunately the resistance of microorganisms to these pharmaceuticals has increased as well. At the same time the discovery process for novel antiinfectives, the so-called "conventional" screening approach, involves testing natural products or derivatives of known compounds in in vitro cultures. By now it is obvious that this screening approach did not meet the expectations to generate a sufficient number of novel drug candidates. Consequently, studies for selective antiinfectives with new modes of action, which are able to break resistance, are highly desirable for human and animal health. The enormous advance in sequencing technologies--leading to a constantly growing number of known microbial genomes--together with the rapid development of computer power and bioinformatic software tools, now makes it possible to identify genes and gene products that are essential to the pathogenic organisms and are therefore considered to be novel targets for the development of new antiinfectives. When these potential targets have been validated by sophisticated laboratory methods, large diverse compound libraries can be tested in in vitro assays using high-throughput screening. This approach will most likely generate an increasing number of novel lead structures that will be specifically optimized by modern combinatorial chemistry and subsequently lead to new antiinfective candidates strengthening the armoury of weapons available to fight infectious diseases in humans and animals.
Collapse
Affiliation(s)
- P M Selzer
- Intervet International GmbH, Department of Research Pharmaceuticals, Frankfurt am Main, Germany
| | | | | | | | | |
Collapse
|
315
|
Domingues FS, Lackner P, Andreeva A, Sippl MJ. Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J Mol Biol 2000; 297:1003-13. [PMID: 10736233 DOI: 10.1006/jmbi.2000.3615] [Citation(s) in RCA: 66] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity. Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.
Collapse
Affiliation(s)
- F S Domingues
- Center for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob Haringer Strasse 3, Salzburg, A-5020, Austria
| | | | | | | |
Collapse
|
316
|
Abstract
The sequencing of the human genome and numerous pathogen genomes has resulted in an explosion of potential drug targets. These targets represent both an unprecedented opportunity and a technological challenge for the pharmaceutical industry. A new strategy is required to initiate small-molecule drug discovery with sets of incompletely characterized, disease-associated proteins. One such strategy is the early application of combinatorial chemistry and other technologies to the discovery of bioactive small-molecule ligands that act on candidate drug targets. Therapeutically active ligands serve to concurrently validate a target and provide lead structures for downstream drug development, thereby accelerating the drug discovery process.
Collapse
Affiliation(s)
- GR Lenz
- NeoGenesis, 840 Memorial Drive, Cambridge, MA 02139, USA
| | | | | |
Collapse
|
317
|
Abstract
Bioinformatics has, out of necessity, become a key aspect of drug discovery in the genomic revolution, contributing to both target discovery and target validation. The author describes the role that bioinformatics has played and will continue to play in response to the waves of genome-wide data sources that have become available to the industry, including expressed sequence tags, microbial genome sequences, model organism sequences, polymorphisms, gene expression data and proteomics. However, these knowledge sources must be intelligently integrated.
Collapse
|
318
|
Wilson CA, Kreychman J, Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000; 297:233-49. [PMID: 10704319 DOI: 10.1006/jmbi.2000.3550] [Citation(s) in RCA: 241] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Measuring in a quantitative, statistical sense the degree to which structural and functional information can be "transferred" between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on approximately 30,000 pairs of protein domains with known structure and function. Our domain pairs, which are constructed according to the SCOP fold classification, range in similarity from just sharing a fold, to being nearly identical. Our results show that traditional scores for sequence and structure similarity have the same basic exponential relationship as observed previously, with structural divergence, measured in RMS, being exponentially related to sequence divergence, measured in percent identity. However, as the scale of our survey is much larger than any previous investigations, our results have greater statistical weight and precision. We have been able to express the relationship of sequence and structure similarity using more "modern scores," such as Smith-Waterman alignment scores and probabilistic P-values for both sequence and structure comparison. These modern scores address some of the problems with traditional scores, such as determining a conserved core and correcting for length dependency; they enable us to phrase the sequence-structure relationship in more precise and accurate terms. We found that the basic exponential sequence-structure relationship is very general: the same essential relationship is found in the different secondary-structure classes and is evident in all the scoring schemes. To relate function to sequence and structure we assigned various levels of functional similarity to the domain pairs, based on a simple functional classification scheme. This scheme was constructed by combining and augmenting annotations in the enzyme and fly functional classifications and comparing subsets of these to the Escherichia coli and yeast classifications. We found sigmoidal relationships between similarity in function and sequence, with clear thresholds for different levels of functional conservation. For pairs of domains that share the same fold, precise function appears to be conserved down to approximately 40 % sequence identity, whereas broad functional class is conserved to approximately 25 %. Interestingly, percent identity is more effective at quantifying functional conservation than the more modern scores (e.g. P-values). Results of all the pairwise comparisons and our combined functional classification scheme for protein structures can be accessed from a web database at http://bioinfo.mbb.yale.edu/alignCopyright 2000 Academic Press.
Collapse
Affiliation(s)
- C A Wilson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | | | | |
Collapse
|
319
|
Strippoli P, Lenzi L, Petrini M, Carinci P, Zannotti M. A new gene family including DSCR1 (Down Syndrome Candidate Region 1) and ZAKI-4: characterization from yeast to human and identification of DSCR1-like 2, a novel human member (DSCR1L2). Genomics 2000; 64:252-63. [PMID: 10756093 DOI: 10.1006/geno.2000.6127] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A new gene family has been identified on the basis of in-depth bioinformatics analysis of the Down syndrome candidate region 1 (DSCR1) gene, located on 21q22.1. We have determined the complete coding sequences of similar genes in Saccharomyces cerevisiae and Caenorhabditis elegans, as well as that of a novel human gene, named DSCR1L2 (DSCR1-like 2). Peripheral blood leukocyte cDNA sequencing predicts as its product a 241-amino-acid protein highly similar to products of the human genes DSCR1 and ZAKI-4 (HGMW-approved symbol DSCR1L1). The highest level of expression of DSCR1L2 mRNA was found by Northern blot analysis in heart and skeletal muscles, liver, kidney, and peripheral blood leukocytes (three transcripts of 3.2, 5. 2, and 7.5 kb). The gene consists of four exons and spans about 22 kb on chromosome 1 (1p33-p35.3) (Human Chromosome 1, Sanger Centre). Exon/intron organization is highly conserved between DSCR1 and DSCR1L2. Two alternative DSCR1L2 mRNA splicing forms have been recognized, with one lacking 10 amino acids in the middle of the protein. Analysis of expressed sequence tags (ESTs) shows DSCR1L2 expression in fetal tissues (heart, liver, and spleen) and in adenocarcinomas. ESTs related to the murine DSCR1L2 orthologue are found in the 2-cell stage mouse embryo, in developing brain stem and spinal cord, and in thymus and T cells. The most prominent feature identified in the protein family is a central short, unique serine-proline motif (including an ISPPXSPP box), which is strongly conserved from yeast to human but is absent in bacteria. Moreover, homology with the RNA-binding domain was weakly but consistently detected in a stretch of 80 amino acids at the amino-terminus by fine sequence analysis based on tools utilizing both hidden Markov models and BLAST. The identification of this new gene family should allow a better understanding of the functions of the genes belonging to it.
Collapse
Affiliation(s)
- P Strippoli
- Istituto di Istologia ed Embriologia Generale, Università di Bologna, Bologna, Italy
| | | | | | | | | |
Collapse
|
320
|
Abstract
The predicted proteins of the genome of Caenorhabditis elegans were analysed by various sequence comparison methods to identify the repertoire of proteins that are members of the immunoglobulin superfamily (IgSF). The IgSF is one of the largest families of protein domain in this genome and likely to be one of the major families in other multicellular eukaryotes too. This is because members of the superfamily are involved in a variety of functions including cell-cell recognition, cell-surface receptors, muscle structure and, in higher organisms, the immune system. Sixty-four proteins with 488 I set IgSF domains were identified largely by using Hidden Markov models. The domain architectures of the protein products of these 64 genes are described. Twenty-one of these had been characterised previously. We show that another 25 are related to proteins of known function. The C. elegans IgSF proteins can be classified into five broad categories: muscle proteins, protein kinases and phosphatases, three categories of proteins involved in the development of the nervous system, leucine-rich repeat containing proteins and proteins without homologues of known function, of which there are 18. The 19 proteins involved in nervous system development that are not kinases or phosphatases are homologues of neuroglian, axonin, NCAM, wrapper, klingon, ICCR and nephrin or belong to the recently identified zig gene family. Out of the set of 64 genes, 22 are on the X chromosome. This study should be seen as an initial description of the IgSF repertoire in C. elegans, because the current gene definitions may contain a number of errors, especially in the case of long sequences, and there may be IgSF genes that have not yet been detected. However, the proteins described here do provide an overview of the bulk of the repertoire of immunoglobulin superfamily members in C. elegans, a framework for refinement and extension of the repertoire as gene and protein definitions improve, and the basis for investigations of their function and for comparisons with the repertoires of other organisms.
Collapse
Affiliation(s)
- S A Teichmann
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
| | | |
Collapse
|
321
|
|
322
|
Bray JE, Todd AE, Pearl FM, Thornton JM, Orengo CA. The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues. PROTEIN ENGINEERING 2000; 13:153-65. [PMID: 10775657 DOI: 10.1093/protein/13.3.153] [Citation(s) in RCA: 41] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
A consensus approach has been developed for identifying distant structural homologues. This is based on the CATH Dictionary of Homologous Superfamilies (DHS), a database of validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies (URL: http://www. biochem.ucl.ac.uk/bsm/dhs). Multiple structural alignments have been generated for 362 well-populated superfamilies in the CATH structural domain database and annotated with secondary structure, physicochemical properties, functional sequence patterns and protein-ligand interaction data. Consensus functional information for each superfamily includes descriptions and keywords extracted from SWISS-PROT and the ENZYME database. The Dictionary provides a powerful resource to validate, examine and visualize key structural and functional features of each homologous superfamily. The value of the DHS, for assessing functional variability and identifying distant evolutionary relationships, is illustrated using the pyridoxal-5'-phosphate (PLP) binding aspartate aminotransferase superfamily. The DHS also provides a tool for examining sequence-structure relationships for proteins within each fold group.
Collapse
Affiliation(s)
- J E Bray
- Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street,London WC1E 6BT, UK.
| | | | | | | | | |
Collapse
|
323
|
Rychlewski L, Jaroszewski L, Li W, Godzik A. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000; 9:232-41. [PMID: 10716175 PMCID: PMC2144550 DOI: 10.1110/ps.9.2.232] [Citation(s) in RCA: 363] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Distant homologies between proteins are often discovered only after three-dimensional structures of both proteins are solved. The sequence divergence for such proteins can be so large that simple comparison of their sequences fails to identify any similarity. New generation of sensitive alignment tools use averaged sequences of entire homologous families (profiles) to detect such homologies. Several algorithms, including the newest generation of BLAST algorithms and BASIC, an algorithm used in our group to assign fold predictions for proteins from several genomes, are compared to each other on the large set of structurally similar proteins with little sequence similarity. Proteins in the benchmark are classified according to the level of their similarity, which allows us to demonstrate that most of the improvement of the new algorithms is achieved for proteins with strong functional similarities, with almost no progress in recognizing distant fold similarities. It is also shown that details of profile calculation strongly influence its sensitivity in recognizing distant homologies. The most important choice is how to include information from diverging members of the family, avoiding generating false predictions, while accounting for entire sequence divergence within a family. PSI-BLAST takes a conservative approach, deriving a profile from core members of the family, providing a solid improvement without almost any false predictions. BASIC strives for better sensitivity by increasing the weight of divergent family members and paying the price in lower reliability. A new FFAS algorithm introduced here uses a new procedure for profile generation that takes into account all the relations within the family and matches BASIC sensitivity with PSI-BLAST like reliability.
Collapse
Affiliation(s)
- L Rychlewski
- San Diego Supercomputer Center, La Jolla, California 92093, USA
| | | | | | | |
Collapse
|
324
|
Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. J Comput Biol 2000; 7:95-114. [PMID: 10890390 DOI: 10.1089/10665270050081405] [Citation(s) in RCA: 148] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A new method for detecting remote protein homologies is introduced and shown to perform well in classifying protein domains by SCOP superfamily. The method is a variant of support vector machines using a new kernel function. The kernel function is derived from a generative statistical model for a protein family, in this case a hidden Markov model. This general approach of combining generative models like HMMs with discriminative methods such as support vector machines may have applications in other areas of biosequence analysis as well.
Collapse
Affiliation(s)
- T Jaakkola
- MIT Artificial Intelligence Laboratory, Cambridge, MA 02139, USA
| | | | | |
Collapse
|
325
|
Abstract
Proteins might have considerable structural similarities even when no evolutionary relationship of their sequences can be detected. This property is often referred to as the proteins sharing only a "fold". Of course, there are also sequences of common origin in each fold, called a "superfamily", and in them groups of sequences with clear similarities, designated "family". Developing algorithms to reliably identify proteins related at any level is one of the most important challenges in the fast growing field of bioinformatics today. However, it is not at all certain that a method proficient at finding sequence similarities performs well at the other levels, or vice versa.Here, we have compared the performance of various search methods on these different levels of similarity. As expected, we show that it becomes much harder to detect proteins as their sequences diverge. For family related sequences the best method gets 75% of the top hits correct. When the sequences differ but the proteins belong to the same superfamily this drops to 29%, and in the case of proteins with only fold similarity it is as low as 15%. We have made a more complete analysis of the performance of different algorithms than earlier studies, also including threading methods in the comparison. Using this method a more detailed picture emerges, showing multiple sequence information to improve detection on the two closer levels of relationship. We have also compared the different methods of including this information in prediction algorithms. For lower specificities, the best scheme to use is a linking method connecting proteins through an intermediate hit. For higher specificities, better performance is obtained by PSI-BLAST and some procedures using hidden Markov models. We also show that a threading method, THREADER, performs significantly better than any other method at fold recognition.
Collapse
Affiliation(s)
- E Lindahl
- Royal Institute of Technology, Stockholm, SE-100 44, Sweden
| | | |
Collapse
|
326
|
Brenner SE, Koehl P, Levitt M. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000; 28:254-6. [PMID: 10592239 PMCID: PMC102434 DOI: 10.1093/nar/28.1.254] [Citation(s) in RCA: 328] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/1999] [Revised: 10/13/1999] [Accepted: 10/13/1999] [Indexed: 11/12/2022] Open
Abstract
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. The SPACI scores included in the system summarize the overall characteristics of a protein structure. A structural alignments database indicates residue equivalencies in superimposed protein domain structures. The PDB sequence-map files provide a linkage between the amino acid sequence of the molecule studied (SEQRES records in a database entry) and the sequence of the atoms experimentally observed in the structure (ATOM records). These maps are combined with information in the SCOPdatabase to provide sequences of protein domains. Selected subsets of the domain database, with varying degrees of similarity measured in several different ways, are also available. ASTRALmay be accessed at http://astral.stanford.edu/
Collapse
Affiliation(s)
- S E Brenner
- Department of Structural Biology, Stanford University, Fairchild Building D-109, Stanford, CA 94305-5126, USA.
| | | | | |
Collapse
|
327
|
|
328
|
Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG, Chothia C. SCOP: a structural classification of proteins database. Nucleic Acids Res 2000; 28:257-9. [PMID: 10592240 PMCID: PMC102479 DOI: 10.1093/nar/28.1.257] [Citation(s) in RCA: 415] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far. The sequences of proteins in SCOP provide the basis of the ASTRAL sequence libraries that can be used as a source of data to calibrate sequence search algorithms and for the generation of statistics on, or selections of, protein structures. Links can be made from SCOP to PDB-ISL: a library containing sequences homologous to proteins of known structure. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.mrc-lmb.cam.ac.uk/scop/
Collapse
Affiliation(s)
- L Lo Conte
- MRC Laboratory of Molecular Biology, Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK.
| | | | | | | | | | | |
Collapse
|
329
|
|
330
|
Pearl FM, Lee D, Bray JE, Sillitoe I, Todd AE, Harrison AP, Thornton JM, Orengo CA. Assigning genomic sequences to CATH. Nucleic Acids Res 2000; 28:277-82. [PMID: 10592246 PMCID: PMC102424 DOI: 10.1093/nar/28.1.277] [Citation(s) in RCA: 125] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/1999] [Accepted: 10/06/1999] [Indexed: 11/12/2022] Open
Abstract
We report the latest release (version 1.6) of the CATH protein domains database (http://www.biochem.ucl. ac.uk/bsm/cath ). This is a hierarchical classification of 18 577 domains into evolutionary families and structural groupings. We have identified 1028 homo-logous superfamilies in which the proteins have both structural, and sequence or functional similarity. These can be further clustered into 672 fold groups and 35 distinct architectures. Recent developments of the database include the generation of 3D templates for recognising structural relatives in each fold group, which has led to significant improvements in the speed and accuracy of updating the database and also means that less manual validation is required. We also report the establishment of the CATH-PFDB (Protein Family Database), which associates 1D sequences with the 3D homologous superfamilies. Sequences showing identifiable homology to entries in CATH have been extracted from GenBank using PSI-BLAST. A CATH-PSIBLAST server has been established, which allows you to scan a new sequence against the database. The CATH Dictionary of Homologous Superfamilies (DHS), which contains validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies, has been updated to include annotations associated with sequence relatives identified in GenBank. The DHS is a powerful tool for considering the variation of functional properties within a given CATH superfamily and in deciding what functional properties may be reliably inherited by a newly identified relative.
Collapse
Affiliation(s)
- F M Pearl
- Department of Biochemistry, University College London, University of London, Gower Street, London WC1E 6BT, UK.
| | | | | | | | | | | | | | | |
Collapse
|
331
|
|
332
|
Grigoriev IV, Kim SH. Detection of protein fold similarity based on correlation of amino acid properties. Proc Natl Acad Sci U S A 1999; 96:14318-23. [PMID: 10588703 PMCID: PMC24434 DOI: 10.1073/pnas.96.25.14318] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
An increasing number of proteins with weak sequence similarity have been found to assume similar three-dimensional fold and often have similar or related biochemical or biophysical functions. We propose a method for detecting the fold similarity between two proteins with low sequence similarity based on their amino acid properties alone. The method, the proximity correlation matrix (PCM) method, is built on the observation that the physical properties of neighboring amino acid residues in sequence at structurally equivalent positions of two proteins of similar fold are often correlated even when amino acid sequences are different. The hydrophobicity is shown to be the most strongly correlated property for all protein fold classes. The PCM method was tested on 420 proteins belonging to 64 different known folds, each having at least three proteins with little sequence similarity. The method was able to detect fold similarities for 40% of the 420 sequences. Compared with sequence comparison and several fold-recognition methods, the method demonstrates good performance in detecting fold similarities among the proteins with low sequence identity. Applied to the complete genome of Methanococcus jannaschii, the method recognized the folds for 22 hypothetical proteins.
Collapse
Affiliation(s)
- I V Grigoriev
- Department of Chemistry and E. O. Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA
| | | |
Collapse
|
333
|
Abstract
The current state of the art in modeling protein structure has been assessed, based on the results of the CASP (Critical Assessment of protein Structure Prediction) experiments. In comparative modeling, improvements have been made in sequence alignment, sidechain orientation and loop building. Refinement of the models remains a serious challenge. Improved sequence profile methods have had a large impact in fold recognition. Although there has been some progress in alignment quality, this factor still limits model usefulness. In ab initio structure prediction, there has been notable progress in building approximately correct structures of 40-60 residue-long protein fragments. There is still a long way to go before the general ab initio prediction problem is solved. Overall, the field is maturing into a practical technology, able to deliver useful models for a large number of sequences.
Collapse
Affiliation(s)
- J Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, MD 20850, USA.
| |
Collapse
|
334
|
Abstract
The recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations.
Collapse
Affiliation(s)
- A Müller
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England
| | | | | |
Collapse
|
335
|
Xu Y, Xu D, Crawford OH, Larimer F, Uberbacher E, Unseren MA, Zhang G. Protein threading by PROSPECT: a prediction experiment in CASP3. PROTEIN ENGINEERING 1999; 12:899-907. [PMID: 10585495 DOI: 10.1093/protein/12.11.899] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
We present an analysis of the protein fold recognition experiment using PROSPECT in The Third Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP3). PROSPECT is a computer program we have recently developed for finding an optimal alignment between a protein sequence and a protein structural fold. Two unique features of PROSPECT are (a) that it guarantees to find the globally optimal sequence-structure alignment and does so in an efficient manner, when the alignment-scoring function consists of three additive terms: (i) a singleton fitness term, (ii) a pairwise contact preference term between residues that are spatially close (</=15 A between their beta-carbons) and (iii) an alignment gap penalty; and (b) that it guarantees to find the globally-optimal alignment under various constraints on the unknown protein specified by the user. In the CASP3 experiment, PROSPECT correctly identified the most similar folds for 11 targets and predicted closely-similar folds for five other targets among the 23 targets which can be classified into the category of fold-recognition problems and also had their experimentally-determined structures available. Among the 11 correctly identified folds, PROSPECT obtained good sequence-structure alignments for nine of them. On three of the five ab initio prediction problems, PROSPECT successfully located partial structures from our template library, which align accurately with the corresponding targets.
Collapse
Affiliation(s)
- Y Xu
- Computational Protein Structure Group, Computational Biosciences Section, Life Sciences Division and Center for Engineering Science Advanced Research, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830
| | | | | | | | | | | | | |
Collapse
|
336
|
Koonin EV, Aravind L, Hofmann K, Tschopp J, Dixit VM. Apoptosis. Searching for FLASH domains. Nature 1999; 401:662; discussion 662-3. [PMID: 10537104 DOI: 10.1038/44317] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- E V Koonin
- NCBI, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | | | |
Collapse
|
337
|
Abstract
Elucidation of interrelationships among sequence, structure, function, and evolution (FESS relationships) of a family of genes or gene products is a central theme of modern molecular biology. Multiple sequence alignment has been proven to be a powerful tool for many fields of studies such as phylogenetic reconstruction, illumination of functionally important regions, and prediction of higher order structures of proteins and RNAs. However, it is far too trivial to automatically construct a multiple alignment from a set of related sequences. A variety of methods for solving this computationally difficult problem are reviewed. Several important applications of multiple alignment for elucidation of the FESS relationships are also discussed. For a long period, progressive methods have been the only practical means to solve a multiple alignment problem of appreciable size. This situation is now changing with the development of new techniques including several classes of iterative methods. Today's progress in multiple sequence alignment methods has been made by the multidisciplinary endeavors of mathematicians, computer scientists, and biologists in various fields including biophysicists in particular. The ideas are also originated from various backgrounds, pure algorithmics, statistics, thermodynamics, and others. The outcomes are now enjoyed by researchers in many fields of biological sciences. In the near future, generalized multiple alignment may play a central role in studies of FESS relationships. The organized mixture of knowledge from multiple fields will ferment to develop fruitful results which would be hard to obtain within each area. I hope this review provides a useful information resource for future development of theory and practice in this rapidly expanding area of bioinformatics.
Collapse
Affiliation(s)
- O Gotoh
- Saitama Cancer Center Research Institute, Japan
| |
Collapse
|
338
|
Geetha V, Di Francesco V, Garnier J, Munson PJ. Comparing protein sequence-based and predicted secondary structure-based methods for identification of remote homologs. PROTEIN ENGINEERING 1999; 12:527-34. [PMID: 10436078 DOI: 10.1093/protein/12.7.527] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
We have compared a novel sequence-structure matching technique, FORESST, for detecting remote homologs to three existing sequence based methods, including local amino acid sequence similarity by BLASTP, hidden Markov models (HMMs) of sequences of protein families using SAM, HMMs based on sequence motifs identified using meta-MEME. FORESST compares predicted secondary structures to a library of structural families of proteins, using HMMs. Altogether 45 proteins from nine structural families in the database CATH were used in a cross-validated test of the fold assignment accuracy of each method. Local sequence similarity of a query sequence to a protein family is measured by the highest segment pair (HSP) score. Each of the HMM-based approaches (FORESST, MEME, amino acid sequence-based HMM) yielded log-odds score for the query sequence. In order to make a fair comparison among these methods, the scores for each method were converted to Z-scores in a uniform way by comparing the raw scores of a query protein with the corresponding scores for a set of unrelated proteins. Z-Scores were analyzed as a function of the maximum pairwise sequence identity (MPSID) of the query sequence to sequences used in training the model. For MPSID above 20%, the Z-scores increase linearly with MPSID for the sequence-based methods but remain roughly constant for FORESST. Below 15%, average Z-scores are close to zero for the sequence-based methods, whereas the FORESST method yielded average Z-scores of 1.8 and 1.1, using observed and predicted secondary structures, respectively. This demonstrates the advantage of the sequence-structure method for detecting remote homologs.
Collapse
Affiliation(s)
- V Geetha
- ABS/MSCL/CIT, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | |
Collapse
|
339
|
Ponting CP, Aravind L, Schultz J, Bork P, Koonin EV. Eukaryotic signalling domain homologues in archaea and bacteria. Ancient ancestry and horizontal gene transfer. J Mol Biol 1999; 289:729-45. [PMID: 10369758 DOI: 10.1006/jmbi.1999.2827] [Citation(s) in RCA: 245] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Phyletic distributions of eukaryotic signalling domains were studied using recently developed sensitive methods for protein sequence analysis, with an emphasis on the detection and accurate enumeration of homologues in bacteria and archaea. A major difference was found between the distributions of enzyme families that are typically found in all three divisions of cellular life and non-enzymatic domain families that are usually eukaryote-specific. Previously undetected bacterial homologues were identified for# plant pathogenesis-related proteins, Pad1, von Willebrand factor type A, src homology 3 and YWTD repeat-containing domains. Comparisons of the domain distributions in eukaryotes and prokaryotes enabled distinctions to be made between the domains originating prior to the last common ancestor of all known life forms and those apparently originating as consequences of horizontal gene transfer events. A number of transfers of signalling domains from eukaryotes to bacteria were confidently identified, in contrast to only a single case of apparent transfer from eukaryotes to archaea.
Collapse
Affiliation(s)
- C P Ponting
- National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, Bldg. 38A, Bethesda, MD, 20894, USA.
| | | | | | | | | |
Collapse
|
340
|
Abstract
New computational techniques have allowed protein folds to be assigned to all or parts of between a quarter (Caenorhabditis elegans) and a half (Mycoplasma genitalium) of the individual protein sequences in different genomes. These assignments give a new perspective on domain structures, gene duplications, protein families and protein folds in genome sequences.
Collapse
Affiliation(s)
- S A Teichmann
- MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.
| | | | | |
Collapse
|
341
|
Sternberg MJ, Bates PA, Kelley LA, MacCallum RM. Progress in protein structure prediction: assessment of CASP3. Curr Opin Struct Biol 1999; 9:368-73. [PMID: 10361096 DOI: 10.1016/s0959-440x(99)80050-5] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The third comparative assessment of techniques of protein structure prediction (CASP3) was held during 1998. This is a blind trial in which structures are predicted prior to having knowledge of the coordinates, which are then revealed to enable the assessment. Three sections at the meeting evaluated different methodologies - comparative modelling, fold recognition and ab initio methods. For some, but not all of the target coordinates, high quality models were submitted in each of these sections. There have been improvements in prediction techniques since CASP2 in 1996, most notably for ab initio methods.
Collapse
Affiliation(s)
- M J Sternberg
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, UK.
| | | | | | | |
Collapse
|
342
|
|
343
|
|
344
|
Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K, Rost B, Rychlewski L, Sternberg M. CAFASP-1: Critical assessment of fully automated structure prediction methods. Proteins 1999. [DOI: 10.1002/(sici)1097-0134(1999)37:3+<209::aid-prot27>3.0.co;2-y] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
345
|
Teichmann SA, Park J, Chothia C. Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 1998; 95:14658-63. [PMID: 9843945 PMCID: PMC24505 DOI: 10.1073/pnas.95.25.14658] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The parasitic bacterium Mycoplasma genitalium has a small, reduced genome with close to a basic set of genes. As a first step toward determining the families of protein domains that form the products of these genes, we have used the multiple sequence programs PSI-BLAST and GEANFAMMER to match the sequences of the 467 gene products of M. genitalium to the sequences of the domains that form proteins of known structure [Protein Data Bank (PDB) sequences]. PDB sequences (274) match all of 106 M. genitalium sequences and some parts of another 85; thus, 41% of its total sequences are matched in all or part. The evolutionary relationships of the PDB domains that match M. genitalium are described in the structural classification of proteins (SCOP) database. Using this information, we show that the domains in the matched M. genitalium sequences come from 114 superfamilies and that 58% of them have arisen by gene duplication. This level of duplication is more than twice that found by using pairwise sequence comparisons. The PDB domain matches also describe the domain structure of the matched sequences: just over a quarter contain one domain and the rest have combinations of two or more domains.
Collapse
Affiliation(s)
- S A Teichmann
- Medical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, United Kingdom.
| | | | | |
Collapse
|