101
|
Gerstein M, Levitt M. A structural census of the current population of protein sequences. Proc Natl Acad Sci U S A 1997; 94:11911-6. [PMID: 9342336 PMCID: PMC23653 DOI: 10.1073/pnas.94.22.11911] [Citation(s) in RCA: 78] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/1997] [Accepted: 08/13/1997] [Indexed: 02/05/2023] Open
Abstract
We examine the occurrence of the approximately 300 known protein folds in different groups of organisms. To do this, we characterize a large fraction of the currently known protein sequences ( approximately 140,000) in structural terms, by matching them to known structures via sequence comparison (or by secondary-structure class prediction for those without structural homologues). Overall, we find that an appreciable fraction of the known folds are present in each of the major groups of organisms (e.g., bacteria and eukaryotes share 156 of 275 folds), and most of the common folds are associated with many families of nonhomologous sequences (i.e., >10 sequence families for each common fold). However, different groups of organisms have characteristically distinct distributions of folds. So, for instance, some of the most common folds in vertebrates, such as globins or zinc fingers, are rare or absent in bacteria. Many of these differences in fold usage are biologically reasonable, such as the folds of metabolic enzymes being common in bacteria and those associated with extracellular transport and communication being common in animals. They also have important implications for database-based methods for fold recognition, suggesting that an unknown sequence from a plant is more likely to have a certain fold (e.g., a TIM barrel) than an unknown sequence from an animal.
Collapse
Affiliation(s)
- M Gerstein
- Molecular Biophysics and Biochemistry Department, P.O. Box 208114, Yale University, New Haven, CT 06520-8114, USA.
| | | |
Collapse
|
102
|
Drabløs F, Petersen SB. Identification of conserved residues in family of esterase and lipase sequences. Methods Enzymol 1997; 284:28-61. [PMID: 9379940 DOI: 10.1016/s0076-6879(97)84004-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
103
|
Abstract
We propose a new characterization of protein structure based on the natural tetrahedral geometry of the beta-carbon and a new geometric measure of structural similarity, called visible volume. In our model, the side-chains are replaced by an ideal tetrahedron, the orientation of which is fixed with respect to the backbone and corresponds to the preferred rotamer directions. Visible volume is a measure of the non-occluded empty space surrounding each residue position after the side-chains have been removed. It is a robust, parameter-free, locally computed quantity that accounts for many of the spatial constraints that are of relevance to the corresponding position in the native structure. When computing visible volume, we ignore the nature of both the residue observed at each side and the ones surrounding it. We focus instead on the space that, together, these residues could occupy. By doing so, we are able to quantify a new kind of invariance beyond the apparent variations within protein families, namely, the conservation of the physical space available at structurally equivalent positions for side-chain packing. Corresponding positions in native structures are likely to be of interest in protein structure prediction, protein design, and homology modeling. Visible volume is related to the degree of exposure of a residue position and to the actual rotamers in native proteins. Here, we discuss the properties of this new measure, namely, its robustness with respect to both crystallographic uncertainties and naturally occurring variations in atomic coordinates, and the remarkable fact that it is essentially independent of the choice of the parameters used in calculating it. We also show how visible volume can be used to align protein structures, to identify structurally equivalent positions that are conserved in a family of proteins, and to single out positions in a protein that are likely to be of biological interest. These properties qualify visible volume as a powerful tool in a variety of applications, from the detailed analysis of protein structure to homology modeling, protein structural alignment, and the definition of better scoring functions for threading purposes.
Collapse
Affiliation(s)
- L Lo Conte
- Computer Science Department, Boston University, MA 02215, USA
| | | |
Collapse
|
104
|
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25:3389-402. [PMID: 9254694 PMCID: PMC146917 DOI: 10.1093/nar/25.17.3389] [Citation(s) in RCA: 51420] [Impact Index Per Article: 1904.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Collapse
Affiliation(s)
- S F Altschul
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | | | | | | | | | |
Collapse
|
105
|
|
106
|
Abstract
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences.
Collapse
Affiliation(s)
- E L Sonnhammer
- Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | | | | |
Collapse
|
107
|
Abstract
To investigate how the properties of individual amino acids result in proteins with particular structures and functions, we have examined the correlations between previously derived structure-dependent mutation rates and changes in various physical-chemical properties of the amino acids such as volume, charge, alpha-helical and beta-sheet propensity, and hydrophobicity. In most cases we found the delta G of transfer from octanol to water to be the best model for evolutionary constraints, in contrast to the much weaker correlation with the delta G of transfer from cyclohexane to water, a property found to be highly correlated to changes in stability in site-directed mutagenesis studies. This suggests that natural evolution may follow different rules than those suggested by results obtained in the laboratory. A high degree of conservation of a surface residue's relative hydrophobicity was also observed, a fact that cannot be explained by constraints on protein stability but that may reflect the consequences of the reverse-hydrophobic effect. Local propensity, especially alpha-helical propensity, is rather poorly conserved during evolution, indicating that non-local interactions dominate protein structure formation. We found that changes in volume were important in specific cases, most significantly in transitions among the hydrophobic residues in buried locations. To demonstrate how these techniques could be used to understand particular protein families, we derived and analyzed mutation matrices for the hypervariable and framework regions of antibody light chain V regions. We found surprisingly high conservation of hydrophobicity in the hypervariable region, possibly indicating an important role for hydrophobicity in antigen recognition.
Collapse
Affiliation(s)
- J M Koshi
- Biophysics Research Division, University of Michigan, Ann Arbor 48109-1055, USA
| | | |
Collapse
|
108
|
Abstract
The last stage of protein folding, the "endgame," involves the ordering of amino acid side-chains into a well defined and closely packed configuration. We review a number of topics related to this process. We first describe how the observed packing in protein crystal structures is measured. Such measurements show that the protein interior is packed exceptionally tightly, more so than the protein surface or surrounding solvent and even more efficiently than crystals of simple organic molecules. In vitro protein folding experiments also show that the protein is close-packed in solution and that the tight packing and intercalation of side-chains is a final and essential step in the folding pathway. These experimental observations, in turn, suggest that a folded protein structure can be described as a kind of three-dimensional jigsaw puzzle and that predicting side-chain packing is possible in the sense of solving this puzzle. The major difficulty that must be overcome in predicting side-chain packing is a combinatorial "explosion" in the number of possible configurations. There has been much recent progress towards overcoming this problem, and we survey a variety of the approaches. These approaches differ principally in whether they use ab initio (physical) or more knowledge-based methods, how they divide up and search conformational space, and how they evaluate candidate configurations (using scoring functions). The accuracy of side-chain prediction depends crucially on the (assumed) positioning of the main-chain. Methods for predicting main-chain conformation are, in a sense, not as developed as that for side-chains. We conclude by surveying these methods. As with side-chain prediction, there are a great variety of approaches, which differ in how they divide up and search space and in how they score candidate conformations.
Collapse
Affiliation(s)
- M Levitt
- Department of Structural Biology, Stanford University School of Medicine, California 94305, USA
| | | | | | | | | |
Collapse
|
109
|
Abstract
As the number of protein molecules with known, high-resolution structures increases, it becomes necessary to organize these structures for rapid retrieval, comparison, and analysis. The Protein Data Bank (PDB) currently contains nearly 5,000 entries and is growing exponentially. Most new structures are similar structurally to ones reported previously and can be grouped into families. As the number of members in each family increases, it becomes possible to summarize, statistically, the commonalities and differences within each family. We reported previously a method for finding the atoms in a family alignment that have low spatial variance and those that have higher spatial variance (i.e., the "core" atoms that have the same relative position in all family members and the "non-core" atoms that do not). The core structures we compute have biological significance and provide an excellent quantitative and visual summary of a multiple structural alignment. In order to extend their utility, we have constructed a library of protein family cores, accessible over the World Wide Web at http:/ /www-smi.stanford.edu/projects/helix/LPFC/. This library is generated automatically with publicly available computer programs requiring only a set of multiple alignments as input. It contains quantitative analysis of the spatial variation of atoms within each protein family, the coordinates of the average core structures derived from the families, and display files (in bitmap and VRML formats). Here, we describe the resource and illustrate its applicability by comparing three multiple alignments of the globin family. These three alignments are found to be similar, but with some significant differences related to the diversity of family members and the specific method used for alignment.
Collapse
Affiliation(s)
- R Schmidt
- Section on Medical Informatics, Stanford University, California 94305-5479, USA
| | | | | |
Collapse
|
110
|
|
111
|
Lamy JN, Green BN, Toulmond A, Wall JS, Weber RE, Vinogradov SN. Giant Hexagonal Bilayer Hemoglobins. Chem Rev 1996; 96:3113-3124. [PMID: 11848854 DOI: 10.1021/cr9600058] [Citation(s) in RCA: 76] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Jean N. Lamy
- Laboratoire des Protéines Complexes, CNRS URA 1334, Université de Tours, 37032 Tours, France, Micromass UK Limited, 3 Tudor Road, Altrincham, Cheshire WA14 5RZ, UK, Equipe d'Ecophysiologie, Station Biologique, UPMC-CNRS-INSU, BP 74, 29682 Roscoff, France, Biology Department, Brookhaven National Laboratory, Upton, New York 11973, Department of Zoophysiology, Institute of Biological Sciences, Aarhus University, 8000 Aarhus C, Denmark, and Department of Biochemistry and Molecular Biology, Wayne State University School of Medicine, Detroit, Michigan 48201
| | | | | | | | | | | |
Collapse
|
112
|
Abstract
Structurally similar but sequentially unrelated proteins have been discovered and rediscovered by many researchers, using a variety of structure comparison tools. For several pairs of such proteins, existing structural alignments obtained from the literature, as well as alignments prepared using several different similarity criteria, are compared with each other. It is shown that, in general, they differ from each other, with differences increasing with diminishing sequence similarity. Differences are particularly strong between alignments optimizing global similarity measures, such as RMS deviation between C alpha atoms, and alignments focusing on more local features, such as packing or interaction pattern similarity. Simply speaking, by putting emphasis on different aspects of structure, different structural alignments show the unquestionable similarity in a different way. With differences between various alignments extending to a point where they can differ at all positions, analysis of structural similarities leads to contradictory results reported by groups using different alignment techniques. The problem of uniqueness and stability of structural alignments is further studied with the help of visualization of the suboptimal alignments. It is shown that alignments are often degenerate and whole families of alignments can be generated with almost the same score as the "optimal alignment." However, for some similarity criteria, specially those based on side-chain positions, rather than C alpha positions, alignments in some areas of the protein are unique. This opens the question of how and if the structural alignments can be used as "standards of truth" for protein comparison.
Collapse
Affiliation(s)
- A Godzik
- Department of Molecular Biology MB-1, Scripps Research Institute, La Jolla, California 92037, USA. ; www: http://www.scripps.edu/adam/mosaic/adam.html
| |
Collapse
|
113
|
Novotny J, Bajorath J. Computational biochemistry of antibodies and T-cell receptors. ADVANCES IN PROTEIN CHEMISTRY 1996; 49:149-260. [PMID: 8908299 DOI: 10.1016/s0065-3233(08)60490-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- J Novotny
- Department of Macromolecular Modeling, Bristol-Myers Squibb Research Institute, Princeton, New Jersey 08540, USA
| | | |
Collapse
|
114
|
Riis SK, Krogh A. Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. J Comput Biol 1996; 3:163-83. [PMID: 8697234 DOI: 10.1089/cmb.1996.3.163] [Citation(s) in RCA: 103] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
The prediction of protein secondary structure by use of carefully structured neural networks and multiple sequence alignments has been investigated. Separate networks are used for predicting the three secondary structures alpha-helix, beta-strand, and coil. The networks are designed using a priori knowledge of amino acid properties with respect to the secondary structure and the characteristic periodicity in alpha-helices. Since these single-structure networks all have less than 600 adjustable weights, overfitting is avoided. To obtain a three-state prediction of alpha-helix, beta-strand, or coil, ensembles of single-structure networks are combined with another neural network. This method gives an overall prediction accuracy of 66.3% when using 7-fold cross-validation on a database of 126 nonhomologous globular proteins. Applying the method to multiple sequence alignments of homologous proteins increases the prediction accuracy significantly to 71.3% with corresponding Matthew's correlation coefficients C alpha = 0.59, C beta = 0.52, and Cc = 0.50. More than 72% of the residues in the database are predicted with an accuracy of 80%. It is shown that the network outputs can be interpreted as estimated probabilities of correct prediction, and, therefore, these numbers indicate which residues are predicted with high confidence.
Collapse
Affiliation(s)
- S K Riis
- Electronics Institute, Technical University of Denmark, Lyngby, Denmark
| | | |
Collapse
|
115
|
Abstract
Protein blocks consist of multiply aligned sequence segments without gaps that represent the most highly conserved regions of protein families. A database of blocks has been constructed by successive application of the fully automated PROTOMAT system to lists of protein family members obtained from Prosite documentation. Currently, Blocks 8.0 based on protein families documented in Prosite 12 consists of 2884 blocks representing 770 families. Searches of the Blocks Database are carried out using protein or DNA sequence queries, and results are returned with measures of significance for both single and multiple block hits. The databse has also proved useful for derivation of amino acid substitution matrices (the Blosum series) and other sets of parameters. WWW and E-mail servers provide access to the database and associated functions, including a block maker for sequences provided by the user.
Collapse
Affiliation(s)
- J G Henikoff
- Fred Hutchinson Cancer Research Center, Seattle, Washington 98104, USA
| | | |
Collapse
|
116
|
Kapp OH, Moens L, Vanfleteren J, Trotman CN, Suzuki T, Vinogradov SN. Alignment of 700 globin sequences: extent of amino acid substitution and its correlation with variation in volume. Protein Sci 1995; 4:2179-90. [PMID: 8535255 PMCID: PMC2142974 DOI: 10.1002/pro.5560041024] [Citation(s) in RCA: 68] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Seven-hundred globin sequences, including 146 nonvertebrate sequences, were aligned on the basis of conservation of secondary structure and the avoidance of gap penalties. Of the 182 positions needed to accommodate all the globin sequences, only 84 are common to all, including the absolutely conserved PheCD1 and HisF8. The mean number of amino acid substitutions per position ranges from 8 to 13 for all globins and 5 to 9 for internal positions. Although the total sequence volumes have a variation approximately 2-3%, the variation in volume per position ranges from approximately 13% for the internal to approximately 21% for the surface positions. Plausible correlations exist between amino acid substitution and the variation in volume per position for the 84 common and the internal but not the surface positions. The amino acid substitution matrix derived from the 84 common positions was used to evaluate sequence similarity within the globins and between the globins and phycocyanins C and colicins A, via calculation of pairwise similarity scores. The scores for globin-globin comparisons over the 84 common positions overlap the globin-phycocyanin and globin-colicin scores, with the former being intermediate. For the subset of internal positions, overlap is minimal between the three groups of scores. These results imply a continuum of amino acid sequences able to assume the common three-on-three alpha-helical structure and suggest that the determinants of the latter include sites other than those inaccessible to solvent.
Collapse
Affiliation(s)
- O H Kapp
- Department of Radiology, University of Chicago, Illinois 60637, USA
| | | | | | | | | | | |
Collapse
|
117
|
Abstract
The packing of a protein's constituent atoms and the attendant constraints placed upon them form the basis of many attempts to understand and predict protein structure, stability, folding and even function. Although the significance of packing is yet to be fully comprehended, recent experimental and theoretical investigations have increased our understanding through the description of mutational effects on structure and stability, determination of the limits of packing constraints for both protein folding and structure prediction, and delineation of packing guidelines on the basis of observed cavities in the native protein folds. These advances and allowing protein modellers, engineers and designers to tackle their problems from a more rational perspective.
Collapse
Affiliation(s)
- S J Hubbard
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
118
|
Abstract
The folding of short alanine-based peptides with different numbers of lysine residues is simulated at constant temperature (274 K) using the rigid-element Monte Carlo method. The solvent-referenced potential has prevented the multiple-minima problem in helix folding. From various initial structures, the peptides with three lysine residues fold into helix-dominated conformations with the calculated average helicity in the range of 60-80%. The peptide with six lysine residues shows only 8-14% helicity. These results agree well with experimental observations. The intramolecular electrostatic interaction of the charged lysine side chains and their electrostatic hydration destabilize the helical conformations of the peptide with six lysine residues, whereas these effects on the peptides with three lysine residues are small. The simulations provide insight into the helix-folding mechanism, including the beta-bend intermediate in helix initiation, the (i, i + 3) hydrogen bonds, the asymmetrical helix propagation, and the asymmetrical helicities in the N- and C-terminal regions. These findings are consistent with previous studies.
Collapse
Affiliation(s)
- S S Sung
- Research Institute, Cleveland Clinic Foundation, Ohio 44195, USA
| |
Collapse
|
119
|
Eddy SR, Mitchison G, Durbin R. Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 1995; 2:9-23. [PMID: 7497123 DOI: 10.1089/cmb.1995.2.9] [Citation(s) in RCA: 165] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
We introduce a maximum discrimination method for building hidden Markov models (HMMs) of protein or nucleic acid primary sequence consensus. The method compensates for biased representation in sequence data sets, superseding the need for sequence weighting methods. Maximum discrimination HMMs are more sensitive for detecting distant sequence homologs than various other HMM methods or BLAST when tested on globin and protein kinase catalytic domain sequences.
Collapse
Affiliation(s)
- S R Eddy
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| | | | | |
Collapse
|
120
|
Abstract
An analysis of internal packing defects or "cavities" (both empty and water-containing) within protein structures has been undertaken and includes 3 cavity classes: within domains, between domains, and between protein subunits. We confirm several basic features common to all cavity types but also find a number of new characteristics, including those that distinguish the classes. The total cavity volume remains only a small fraction of the total protein volume and yet increases with protein size. Water-filled "cavities" possess a more polar surface and are typically larger. Their constituent waters are necessary to satisfy the local hydrogen bonding potential. Cavity-surrounding atoms are observed to be, on average, less flexible than their environments. Intersubunit and interdomain cavities are on average larger than the intradomain cavities, occupy a larger fraction of their resident surfaces, and are more frequently water-filled. We observe increased cavity volume at domain-domain interfaces involved with shear type domain motions. The significance of interfacial cavities upon subunit and domain shape complementarity and the protein docking problem, as well as in their structural and functional role in oligomeric proteins, will be discussed. The results concerning cavity size, polarity, solvation, general abundance, and residue type constituency should provide useful guidelines for protein modeling and design.
Collapse
Affiliation(s)
- S J Hubbard
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
121
|
Abstract
Although the 'structure from sequence' prediction problem remains fundamentally unsolved, new and promising methods in one, two and three dimensions have reopened the field. Significantly improved one-dimensional prediction of secondary structure from multiple sequence alignments is now in routine use. In the two-dimensional approach, inter-residue contacts can be detected by analysis of correlated mutations, albeit with low accuracy. Finally, three-dimensional methods, in which pseudopotentials or information values are derived from the databases, are proving their value for distinguishing between correct and incorrect models.
Collapse
Affiliation(s)
- B Rost
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|