1
|
Madej T, Lanczycki CJ, Zhang D, Thiessen PA, Geer RC, Marchler-Bauer A, Bryant SH. MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 2013; 42:D297-303. [PMID: 24319143 PMCID: PMC3965051 DOI: 10.1093/nar/gkt1208] [Citation(s) in RCA: 213] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
The computational detection of similarities between protein 3D structures has become an indispensable tool for the detection of homologous relationships, the classification of protein families and functional inference. Consequently, numerous algorithms have been developed that facilitate structure comparison, including rapid searches against a steadily growing collection of protein structures. To this end, NCBI’s Molecular Modeling Database (MMDB), which is based on the Protein Data Bank (PDB), maintains a comprehensive and up-to-date archive of protein structure similarities computed with the Vector Alignment Search Tool (VAST). These similarities have been recorded on the level of single proteins and protein domains, comprising in excess of 1.5 billion pairwise alignments. Here we present VAST+, an extension to the existing VAST service, which summarizes and presents structural similarity on the level of biological assemblies or macromolecular complexes. VAST+ simplifies structure neighboring results and shows, for macromolecular complexes tracked in MMDB, lists of similar complexes ranked by the extent of similarity. VAST+ replaces the previous VAST service as the default presentation of structure neighboring data in NCBI’s Entrez query and retrieval system. MMDB and VAST+ can be accessed via http://www.ncbi.nlm.nih.gov/Structure.
Collapse
Affiliation(s)
- Thomas Madej
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, Room 8N805, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | | | | | | | | | | |
Collapse
|
2
|
Abstract
The Sixth Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP6) held in December 2004 focused on the prediction of the structures of 90 protein domains from 64 targets. Thirty-eight of these were classified as "fold recognition," defined as being similar in fold to proteins of known structure at the time of submission of the predictions. Only the "first" predictions and those longer than 20 amino acids for each domain were assessed, resulting in 4527 predictions from 165 groups. The assessment was accomplished by the use of six structure alignment programs and three scoring measures based on these alignments. The use of a variety of measures resulted in scoring insensitive to the peculiarities of any one alignment method. The top-ranked methods in the prediction of structures that were clearly homologous to proteins in the Protein Data Bank primarily used servers and other programs based on achieving a consensus of many remote homology detection and fold recognition methods. The top-ranked methods in prediction of structures less clearly related or unrelated to proteins of known structures used fragment building methods in addition to the fold recognition meta methods.
Collapse
Affiliation(s)
- Guoli Wang
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania 19111, USA
| | | | | |
Collapse
|
3
|
Panchenko AR, Wolf YI, Panchenko LA, Madej T. Evolutionary plasticity of protein families: coupling between sequence and structure variation. Proteins 2006; 61:535-44. [PMID: 16184609 PMCID: PMC1941674 DOI: 10.1002/prot.20644] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In this work we examine how protein structural changes are coupled with sequence variation in the course of evolution of a family of homologs. The sequence-structure correlation analysis performed on 81 homologous protein families shows that the majority of them exhibit statistically significant linear correlation between the measures of sequence and structural similarity. We observed, however, that there are cases where structural variability cannot be mainly explained by sequence variation, such as protein families with a number of disulfide bonds. To understand whether structures from different families and/or folds evolve in the same manner, we compared the degrees of structural change per unit of sequence change ("the evolutionary plasticity of structure") between those families with a significant linear correlation. Using rigorous statistical procedures we find that, with a few exceptions, evolutionary plasticity does not show a statistically significant difference between protein families. Similar sequence-structure analysis performed for protein loop regions shows that evolutionary plasticity of loop regions is greater than for the protein core.
Collapse
Affiliation(s)
- Anna R Panchenko
- Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | | | |
Collapse
|
4
|
Skolnick J, Kihara D, Zhang Y. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 2004; 56:502-18. [PMID: 15229883 DOI: 10.1002/prot.20106] [Citation(s) in RCA: 118] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This article describes the PROSPECTOR_3 threading algorithm, which combines various scoring functions designed to match structurally related target/template pairs. Each variant described was found to have a Z-score above which most identified templates have good structural (threading) alignments, Z(struct) (Z(good)). 'Easy' targets with accurate threading alignments are identified as single templates with Z > Z(good) or two templates, each with Z > Z(struct), having a good consensus structure in mutually aligned regions. 'Medium' targets have a pair of templates lacking a consensus structure, or a single template for which Z(struct) < Z < Z(good). PROSPECTOR_3 was applied to a comprehensive Protein Data Bank (PDB) benchmark composed of 1491 single domain proteins, 41-200 residues long and no more than 30% identical to any threading template. Of the proteins, 878 were found to be easy targets, with 761 having a root mean square deviation (RMSD) from native of less than 6.5 A. The average contact prediction accuracy was 46%, and on average 17.6 residue continuous fragments were predicted with RMSD values of 2.0 A. There were 606 medium targets identified, 87% (31%) of which had good structural (threading) alignments. On average, 9.1 residue, continuous fragments with RMSD of 2.5 A were predicted. Combining easy and medium sets, 63% (91%) of the targets had good threading (structural) alignments compared to native; the average target/template sequence identity was 22%. Only nine targets lacked matched templates. Moreover, PROSPECTOR_3 consistently outperforms PSIBLAST. Similar results were predicted for open reading frames (ORFS) < or =200 residues in the M. genitalium, E. coli and S. cerevisiae genomes. Thus, progress has been made in identification of weakly homologous/analogous proteins, with very high alignment coverage, both in a comprehensive PDB benchmark as well as in genomes.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington St., Suite 300, Buffalo, NY 14203, USA.
| | | | | |
Collapse
|
5
|
|
6
|
Abstract
A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance-dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar to (but not identical with) the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10% more structures identified correctly as the most likely structural match in a fold library, and 20% more structures correctly narrowed down to a set of five possible candidates. JThread also improves the average sequence alignment accuracy significantly, from 53% to 62% of residues aligned correctly. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.
Collapse
Affiliation(s)
- John Marc Chandonia
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94143-2240, USA
| | | |
Collapse
|
7
|
Venclovas C, Zemla A, Fidelis K, Moult J. Comparison of performance in successive CASP experiments. Proteins 2002; Suppl 5:163-70. [PMID: 11835494 DOI: 10.1002/prot.10053] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
As the number of completed CASP (Critical Assessment of Protein Structure Prediction) experiments grows, so does the need for stable, standard methods for comparing performance in successive experiments. It is critical to develop methods for determining the areas in which there is progress and in which areas are static. We have added an analysis of the CASP4 results to that previously published for CASPs 1, 2, and 3. We again use a unified difficulty scale to permit comparison of performance as a function of target difficulty in the different CASPs. The scale is used to compare performance in aligning target sequences to a structural template. There was a clear improvement in alignment quality between CASP1 (1994) and CASP2 (1996). No change is apparent between CASP2 and CASP3 (1998). There is a small barely detectable improvement between CASP3 and the latest experiment (CASP4, 2000). Alignment remains the major source of error in all models based on less than about 30% sequence identity. Comparison of performance in the new fold modeling regime is complicated by issues in devising an objective target difficulty scale. We have found limited numerical support for significant progress between CASP3 and CASP4 in this area. More subjectively, most observers are convinced that there has been substantial progress. Progress is dominated by a single group.
Collapse
Affiliation(s)
- C Venclovas
- Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory, Livermore, California, USA
| | | | | | | |
Collapse
|
8
|
Marchler-Bauer A, Panchenko AR, Ariel N, Bryant SH. Comparison of sequence and structure alignments for protein domains. Proteins 2002; 48:439-46. [PMID: 12112669 DOI: 10.1002/prot.10163] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Profile search methods based on protein domain alignments have proven to be useful tools in comparative sequence analysis. Domain alignments used by currently available search methods have been computed by sequence comparison. With the growth of the protein structure database, however, alignments of many domain pairs have also been computed by structure comparison. Here, we examine the extent to which information from these two sources agrees. We measure agreement with respect to identification of homologous regions in each protein, that is, with respect to the location of domain boundaries. We also measure agreement with respect to identification of homologous residue sites by comparing alignments and assessing the accuracy of the molecular models they predict. We find that domain alignments in publicly available collections based on sequence and structure comparison are largely consistent. However, the homologous regions identified by sequence comparison are often shorter than those identified by 3D structure comparison. In addition, when overall sequence similarity is low alignments from sequence comparison produce less accurate molecular models, suggesting that they less accurately identify homologous sites. These observations suggest that structure comparison results might be used to improve the overall accuracy of domain alignment collections and the performance of profile search methods based on them.
Collapse
Affiliation(s)
- Aron Marchler-Bauer
- Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | |
Collapse
|
9
|
Bieńkowska JR, Rogers RG, Smith TF. Performance of threading scoring functions designed using new optimization method. J Comput Biol 2001; 6:299-311. [PMID: 10582568 DOI: 10.1089/106652799318283] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a new procedure for optimization of a threading scoring function. A scoring function is usually formulated in terms of the structural environment states that describe the protein fold model. We propose a method for the optimal selection of those structural environment states that naturally follows from the probabilistic description of the threading problem and is done prior to threading experiments. We demonstrate the selection of the optimal structural environment states for the solvent exposure of the amino acid position, and present the results of threading experiments performed using scoring functions designed with and without the optimization of the structural environment states. These results confirm that the optimal scoring function predicts the sequence-to-structure alignments most accurately. Threading experiments performed with 15 optimally designed scoring functions show that the correlation coefficient between the information content of the amino acid distribution that determines the scoring function and the accuracy of the optimal sequence-to-structure alignment is 0.94.
Collapse
Affiliation(s)
- J R Bieńkowska
- BioMolecular Engineering Research Center, College of Engineering, Boston University, Massachusetts 02215, USA.
| | | | | |
Collapse
|
10
|
Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST alignment accuracy in comparison to structural alignments. Protein Sci 2000; 9:2278-84. [PMID: 11152139 PMCID: PMC2144484 DOI: 10.1110/ps.9.11.2278] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
The PSI-BLAST algorithm has been acknowledged as one of the most powerful tools for detecting remote evolutionary relationships by sequence considerations only. This has been demonstrated by its ability to recognize remote structural homologues and by the greatest coverage it enables in annotation of a complete genome. Although recognizing the correct fold of a sequence is of major importance, the accuracy of the alignment is crucial for the success of modeling one sequence by the structure of its remote homologue. Here we assess the accuracy of PSI-BLAST alignments on a stringent database of 123 structurally similar, sequence-dissimilar pairs of proteins, by comparing them to the alignments defined on a structural basis. Each protein sequence is compared to a nonredundant database of the protein sequences by PSI-BLAST. Whenever a pair member detects its pair-mate, the positions that are aligned both in the sequential and structural alignments are determined, and the alignment sensitivity is expressed as the percentage of these positions out of the structural alignment. Fifty-two sequences detected their pair-mates (for 16 pairs the success was bi-directional when either pair member was used as a query). The average percentage of correctly aligned residues per structural alignment was 43.5+/-2.2%. Other properties of the alignments were also examined, such as the sensitivity vs. specificity and the change in these parameters over consecutive iterations. Notably, there is an improvement in alignment sensitivity over consecutive iterations, reaching an average of 50.9+/-2.5% within the five iterations tested in the current study.
Collapse
Affiliation(s)
- I Friedberg
- Department of Molecular Genetics and Biotechnology, The Hebrew University, Hadassah Medical School, Jerusalem, Israel
| | | | | |
Collapse
|
11
|
Jung J, Lee B. Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci 2000; 9:1576-88. [PMID: 10975579 PMCID: PMC2144723 DOI: 10.1110/ps.9.8.1576] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results.
Collapse
Affiliation(s)
- J Jung
- Laboratory of Molecular Biology, Division of Basic Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | |
Collapse
|
12
|
Wolf YI, Grishin NV, Koonin EV. Estimating the number of protein folds and families from complete genome data. J Mol Biol 2000; 299:897-905. [PMID: 10843846 DOI: 10.1006/jmbi.2000.3786] [Citation(s) in RCA: 137] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Using the data on proteins encoded in complete genomes, combined with a rigorous theory of the sampling process, we estimate the total number of protein folds and families, as well as the number of folds and families in each genome. The total number of folds in globular, water- soluble proteins is estimated at about 1000, with structural information currently available for about one-third of the number. The sequenced genomes of unicellular organisms encode from approximately 25%, for the minimal genomes of the Mycoplasmas, to 70-80% for larger genomes, such as Escherichia coli and yeast, of the total number of folds. The number of protein families with significant sequence conservation was estimated to be between 4000 and 7000, with structures available for about 20% of these.
Collapse
Affiliation(s)
- Y I Wolf
- National Center for Biotechnology Information National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | | | | |
Collapse
|
13
|
Panchenko AR, Marchler-Bauer A, Bryant SH. Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 2000; 296:1319-31. [PMID: 10698636 DOI: 10.1006/jmbi.2000.3541] [Citation(s) in RCA: 102] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.
Collapse
Affiliation(s)
- A R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, Building 38A, Room 8N805, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
14
|
Mendes J, Baptista AM, Carrondo MA, Soares CM. Improved modeling of side-chains in proteins with rotamer-based methods: a flexible rotamer model. Proteins 1999; 37:530-43. [PMID: 10651269 DOI: 10.1002/(sici)1097-0134(19991201)37:4<530::aid-prot4>3.0.co;2-h] [Citation(s) in RCA: 63] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Side-chain modeling has a widespread application in many current methods for protein tertiary structure determination, prediction, and design. Of the existing side-chain modeling methods, rotamer-based methods are the fastest and most efficient. Classically, a rotamer is conceived as a single, rigid conformation of an amino acid sidechain. Here, we present a flexible rotamer model in which a rotamer is a continuous ensemble of conformations that cluster around the classic rigid rotamer. We have developed a thermodynamically based method for calculating effective energies for the flexible rotamer. These energies have a one-to-one correspondence with the potential energies of the rigid rotamer. Therefore, the flexible rotamer model is completely general and may be used with any rotamer-based method in substitution of the rigid rotamer model. We have compared the performance of the flexible and rigid rotamer models with one side-chain modeling method in particular (the self-consistent mean field theory method) on a set of 20 high quality crystallographic protein structures. For the flexible rotamer model, we obtained average predictions of 85.8% for chi1, 76.5% for chi1+2 and 1.34 A for root-mean-square deviation (RMSD); the corresponding values for core residues were 93.0%, 87.7% and 0.70 A, respectively. These values represent improvements of 7.3% for chi1, 8.1% for chi1+2 and 0.23 A for RMSD over the predictions obtained with the rigid rotamer model under otherwise identical conditions; the corresponding improvements for core residues were 6.9%, 10.5% and 0.43 A, respectively. We found that the predictions obtained with the flexible rotamer model were also significantly better than those obtained for the same set of proteins with another state-of-the-art side-chain placement method in the literature, especially for core residues. The flexible rotamer model represents a considerable improvement over the classic rigid rotamer model. It can, therefore, be used with considerable advantage in all rotamer-based methods commonly applied to protein tertiary structure determination, prediction, and design and also in predictions of free energies in mutational studies.
Collapse
Affiliation(s)
- J Mendes
- Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, Oeiras, Portugal
| | | | | | | |
Collapse
|
15
|
|
16
|
Abstract
BACKGROUND A principal goal of structure prediction is the elucidation of function. We have studied the ability of computed models to preserve the microenvironments of functional sites. In particular, 653 model structures of a calcium-binding protein (generated using an ab initio folding protocol) were analyzed, and the degree to which calcium-binding sites were recognizable was assessed. RESULTS While some model structures preserve the calcium-binding microenvironments, many others, including some with low root mean square deviations (rmsds) from the crystal structure of the native protein, do not. There is a very weak correlation between the overall rmsd of a structure and the preservation of calcium-binding sites. Only when the quality of the model structure is high (rmsd less than 2 A for atoms in the 7 A local neighborhood around calcium) does the modeling of the binding sites become reliable. CONCLUSIONS Protein structure prediction methods need to be assessed in terms of their preservation of functional sites. High-resolution structures are necessary for identifying binding sites such as calcium-binding sites.
Collapse
Affiliation(s)
- L Wei
- Stanford Medical Informatics, Stanford University School of Medicine, CA 94305-5479, USA
| | | | | |
Collapse
|
17
|
Application of Reduced Models to Protein Structure Prediction. ACTA ACUST UNITED AC 1999. [DOI: 10.1016/s1380-7323(99)80086-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
18
|
|
19
|
|
20
|
|
21
|
|
22
|
|
23
|
|
24
|
|
25
|
|