1
|
Kann MG, Thiessen PA, Panchenko AR, Schäffer AA, Altschul SF, Bryant SH. A structure-based method for protein sequence alignment. Bioinformatics 2004; 21:1451-6. [PMID: 15613392 DOI: 10.1093/bioinformatics/bti233] [Citation(s) in RCA: 18] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION With the continuing rapid growth of protein sequence data, protein sequence comparison methods have become the most widely used tools of bioinformatics. Among these methods are those that use position-specific scoring matrices (PSSMs) to describe protein families. PSSMs can capture information about conserved patterns within families, which can be used to increase the sensitivity of searches for related sequences. Certain types of structural information, however, are not generally captured by PSSM search methods. Here we introduce a program, Structure-based ALignment TOol (SALTO), that aligns protein query sequences to PSSMs using rules for placing and scoring gaps that are consistent with the conserved regions of domain alignments from NCBI's Conserved Domain Database. RESULTS In most cases, the alignment scores obtained using the local alignment version follow an extreme value distribution. SALTO's performance in finding related sequences and producing accurate alignments is similar to or better than that of IMPALA; one advantage of SALTO is that it imposes an explicit gapping model on each protein family. AVAILABILITY A stand-alone version of the program that can generate global or local alignments is available by ftp distribution (ftp://ftp.ncbi.nih.gov/pub/SALTO/), and has been incorporated to Cn3D structure/alignment viewer. CONTACT bryant@ncbi.nlm.nih.gov.
Collapse
Affiliation(s)
- Maricel G Kann
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20894, USA
| | | | | | | | | | | |
Collapse
|
2
|
Reinhardt A, Eisenberg D. DPANN: Improved sequence to structure alignments following fold recognition. Proteins 2004; 56:528-38. [PMID: 15229885 DOI: 10.1002/prot.20144] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In fold recognition (FR) a protein sequence of unknown structure is assigned to the closest known three-dimensional (3D) fold. Although FR programs can often identify among all possible folds the one a sequence adopts, they frequently fail to align the sequence to the equivalent residue positions in that fold. Such failures frustrate the next step in structure prediction, protein model building. Hence it is desirable to improve the quality of the alignments between the sequence and the identified structure. We have used artificial neural networks (ANN) to derive a substitution matrix to create alignments between a protein sequence and a protein structure through dynamic programming (DPANN: Dynamic Programming meets Artificial Neural Networks). The matrix is based on the amino acid type and the secondary structure state of each residue. In a database of protein pairs that have the same fold but lack sequences-similarity, DPANN aligns over 30% of all sequences to the paired structure, resembling closely the structural superposition of the pair. In over half of these cases the DPANN alignment is close to the structural superposition, although the initial alignment from the step of fold recognition is not close. Conversely, the alignment created during fold recognition outperforms DPANN in only 10% of all cases. Thus application of DPANN after fold recognition leads to substantial improvements in alignment accuracy, which in turn provides more useful templates for the modeling of protein structures. In the artificial case of using actual instead of predicted secondary structures for the probe protein, over 50% of the alignments are successful.
Collapse
|
3
|
Flohil JA, Vriend G, Berendsen HJC. Completion and refinement of 3-D homology models with restricted molecular dynamics: application to targets 47, 58, and 111 in the CASP modeling competition and posterior analysis. Proteins 2002; 48:593-604. [PMID: 12211026 DOI: 10.1002/prot.10105] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A method is presented to refine models built by homology by the use of restricted molecular dynamics (MD) techniques. The basic idea behind this method is the use of structure validation software to determine for each residue the likelihood that it is modeled correctly. This information is used to determine constraints and restraints in an MD simulation including explicit solvent molecules, which is used for model refinement. The procedure is based on the idea that residues that the validation software identifies as correctly positioned should be strongly constrained or restrained in the MD simulations, whereas residues that are likely to be positioned wrongly should move freely. Two different protocols are compared: one (applied to CASP3 target T58) using full structural constraints with separate optimization of each short fragment and the other (applied to T47) allowing some freedom using harmonic restraining potentials, with automatic optimization of the whole molecule. Structures along the MD trajectory that scored best in structural checks were selected for the construction of models that appeared to be successful in the CASP3 competition. Model refinement with MD in general leads to a model that is less like the experimental structure (Levitt et al. Nature Struct Biol 1999;6:108-111). Actually, refined T47 was slightly improved compared to the starting model; changes in model T58 led not to further enhancement. After the X-ray structure of the modeled proteins became known, the procedure was evaluated for two targets (T47 and the CASP4 target T111) by comparing a long simulation in water with the experimental target structures. It was found that structural improvements could be obtained on a nanosecond time scale by allowing appropriate freedom in the simulation. Structural checks applied to fast fluctuations do not appear to be informative for the correctness of the structure. However, both a simple hydrogen bond count and a simple compactness measure, if averaged over times of typically 300 ps, correlate well with structural correctness and we suggest that criteria based on these properties may be used in computational folding strategies.
Collapse
Affiliation(s)
- J A Flohil
- Groningen Biomolecular Sciences and Biotechnology Institute (GBB), Department of Biophysical Chemistry, University of Groningen, Groningen, The Netherlands
| | | | | |
Collapse
|
4
|
Bieńkowska JR, Rogers RG, Smith TF. Performance of threading scoring functions designed using new optimization method. J Comput Biol 2001; 6:299-311. [PMID: 10582568 DOI: 10.1089/106652799318283] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We present a new procedure for optimization of a threading scoring function. A scoring function is usually formulated in terms of the structural environment states that describe the protein fold model. We propose a method for the optimal selection of those structural environment states that naturally follows from the probabilistic description of the threading problem and is done prior to threading experiments. We demonstrate the selection of the optimal structural environment states for the solvent exposure of the amino acid position, and present the results of threading experiments performed using scoring functions designed with and without the optimization of the structural environment states. These results confirm that the optimal scoring function predicts the sequence-to-structure alignments most accurately. Threading experiments performed with 15 optimally designed scoring functions show that the correlation coefficient between the information content of the amino acid distribution that determines the scoring function and the accuracy of the optimal sequence-to-structure alignment is 0.94.
Collapse
Affiliation(s)
- J R Bieńkowska
- BioMolecular Engineering Research Center, College of Engineering, Boston University, Massachusetts 02215, USA.
| | | | | |
Collapse
|
5
|
Jung J, Lee B. Use of residue pairs in protein sequence-sequence and sequence-structure alignments. Protein Sci 2000; 9:1576-88. [PMID: 10975579 PMCID: PMC2144723 DOI: 10.1110/ps.9.8.1576] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Two new sets of scoring matrices are introduced: H2 for the protein sequence comparison and T2 for the protein sequence-structure correlation. Each element of H2 or T2 measures the frequency with which a pair of amino acid types in one protein, k-residues apart in the sequence, is aligned with another pair of residues, of given amino acid types (for H2) or in given structural states (for T2), in other structurally homologous proteins. There are four types, corresponding to the k-values of 1 to 4, for both H2 and T2. These matrices were set up using a large number of structurally homologous protein pairs, with little sequence homology between the pair, that were recently generated using the structure comparison program SHEBA. The two scoring matrices were incorporated into the main body of the sequence alignment program SSEARCH in the FASTA package and tested in a fold recognition setting in which a set of 107 test sequences were aligned to each of a panel of 3,539 domains that represent all known protein structures. Six procedures were tested; the straight Smith-Waterman (SW) and FASTA procedures, which used the Blosum62 single residue type substitution matrix; BLAST and PSI-BLAST procedures, which also used the Blosum62 matrix; PASH, which used Blosum62 and H2 matrices; and PASSC, which used Blosum62, H2, and T2 matrices. All procedures gave similar results when the probe and target sequences had greater than 30% sequence identity. However, when the sequence identity was below 30%, a similar structure could be found for more sequences using PASSC than using any other procedure. PASH and PSI-BLAST gave the next best results.
Collapse
Affiliation(s)
- J Jung
- Laboratory of Molecular Biology, Division of Basic Sciences, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | |
Collapse
|
6
|
Panchenko AR, Marchler-Bauer A, Bryant SH. Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 2000; 296:1319-31. [PMID: 10698636 DOI: 10.1006/jmbi.2000.3541] [Citation(s) in RCA: 102] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.
Collapse
Affiliation(s)
- A R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, Building 38A, Room 8N805, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
7
|
|
8
|
Takano K, Ota M, Ogasahara K, Yamagata Y, Nishikawa K, Yutani K. Experimental verification of the 'stability profile of mutant protein' (SPMP) data using mutant human lysozymes. PROTEIN ENGINEERING 1999; 12:663-72. [PMID: 10469827 DOI: 10.1093/protein/12.8.663] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The stability profile of mutant protein (SPMP) (Ota,M., Kanaya,S. and Nishikawa,K., 1995, J. Mol. Biol., 248, 733-738) estimates the changes in conformational stability due to single amino acid substitutions using a pseudo-energy potential developed for evaluating structure-sequence compatibility in the structure prediction method, the 3D-1D compatibility evaluation. Nine mutant human lysozymes expected to significantly increase in stability from SPMP were constructed, in order to experimentally verify the reliability of SPMP. The thermodynamic parameters for denaturation and crystal structures of these mutant proteins were determined. One mutant protein was stabilized as expected, compared with the wild-type protein. However, the others were not stabilized even though the structural changes were subtle, indicating that SPMP overestimates the increase in stability or underestimates negative effects due to substitution. The stability changes in the other mutant human lysozymes previously reported were also analyzed by SPMP. The correlation of the stability changes between the experiment and prediction depended on the types of substitution: there were some correlations for proline mutants and cavity-creating mutants, but no correlation for mutants related to side-chain hydrogen bonds. The present results may indicate some additional factors that should be considered in the calculation of SPMP, suggesting that SPMP can be refined further.
Collapse
Affiliation(s)
- K Takano
- Institute for Protein Research, Osaka University, Yamadaoka, Suita, Osaka 565-0871, Japan
| | | | | | | | | | | |
Collapse
|
9
|
Abstract
We present the recursive dynamic programming (RDP) method for the threading approach to three-dimensional protein structure prediction. RDP is based on the divide-and-conquer paradigm and maps the protein sequence whose backbone structure is to be found (the protein target) onto the known backbone structure of a model protein (the protein template) in a stepwise fashion, a technique that is similar to computing local alignments but utilising different cost functions. We begin by mapping parts of the target onto the template that show statistically significant similarity with the template sequence. After mapping, the template structure is modified in order to account for the mapped target residues. Then significant similarities between the yet unmapped parts of the target and the modified template are searched, and the resulting segments of the target are mapped onto the template. This recursive process of identifying segments in the target to be mapped onto the template and modifying the template is continued until no significant similarities between the remaining parts of target and template are found. Those parts which are left unmapped by the procedure are interpreted as gaps. The RDP method is robust in the sense that different local alignment methods can be used, several alternatives of mapping parts of the target onto the template can be handled and compared in the process, and the cost functions can be dynamically adapted to biological needs. Our computer experiments show that the RDP procedure is efficient and effective. We can thread a typical protein sequence against a database of 887 template domains in about 12 hours even on a low-cost workstation (SUN Ultra 5). In statistical evaluations on databases of known protein structures, RDP significantly outperforms competing methods. RDP has been especially valuable in providing accurate alignments for modeling active sites of proteins.RDP is part of the ToPLign system (GMD Toolbox for protein alignment) and can be accessed via the WWW independently or in concert with other ToPLign tools at http://cartan.gmd.de/ToPLign.html.
Collapse
Affiliation(s)
- R Thiele
- German National Research Center for Information Technology (GMD), Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, D-53754, Germany
| | | | | |
Collapse
|
10
|
Abstract
BACKGROUND A principal goal of structure prediction is the elucidation of function. We have studied the ability of computed models to preserve the microenvironments of functional sites. In particular, 653 model structures of a calcium-binding protein (generated using an ab initio folding protocol) were analyzed, and the degree to which calcium-binding sites were recognizable was assessed. RESULTS While some model structures preserve the calcium-binding microenvironments, many others, including some with low root mean square deviations (rmsds) from the crystal structure of the native protein, do not. There is a very weak correlation between the overall rmsd of a structure and the preservation of calcium-binding sites. Only when the quality of the model structure is high (rmsd less than 2 A for atoms in the 7 A local neighborhood around calcium) does the modeling of the binding sites become reliable. CONCLUSIONS Protein structure prediction methods need to be assessed in terms of their preservation of functional sites. High-resolution structures are necessary for identifying binding sites such as calcium-binding sites.
Collapse
Affiliation(s)
- L Wei
- Stanford Medical Informatics, Stanford University School of Medicine, CA 94305-5479, USA
| | | | | |
Collapse
|
11
|
Lazaridis T, Karplus M. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol 1999; 288:477-87. [PMID: 10329155 DOI: 10.1006/jmbi.1999.2685] [Citation(s) in RCA: 226] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
An essential requirement for theoretical protein structure prediction is an energy function that can discriminate the native from non-native protein conformations. To date most of the energy functions used for this purpose have been extracted from a statistical analysis of the protein structure database, without explicit reference to the physical interactions responsible for protein stability. The use of the statistical functions has been supported by the widespread belief that they are superior for such discrimination to physics-based energy functions. An effective energy function which combined the CHARMM vacuum potential with a Gaussian model for the solvation free energy is tested for its ability to discriminate the native structure of a protein from misfolded conformations; the results are compared with those obtained with the vacuum CHARMM potential. The test is performed on several sets of misfolded structures prepared by others, including sets of about 650 good decoys for six proteins, as well as on misfolded structures of chymotrypsin inhibitor 2. The vacuum CHARMM potential is successful in most cases when energy minimized conformations are considered, but fails when applied to structures relaxed by molecular dynamics. With the effective energy function the native state is always more stable than grossly misfolded conformations both in energy minimized and molecular dynamics-relaxed structures. The present results suggest that molecular mechanics (physics-based) energy functions, complemented by a simple model for the solvation free energy, should be tested for use in the inverse folding problem, and supports their use in studies of the effective energy surface of proteins in solution. Moreover, the study suggests that the belief in the superiority of statistical functions for these purposes may be ill founded.
Collapse
Affiliation(s)
- T Lazaridis
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford St, Cambridge, MA, 02138, USA
| | | |
Collapse
|
12
|
Abstract
Methods for protein structure (3D)-sequence (1D) compatibility evaluation (threading) have been developed during the past decade. The protocol in which a sequence can recognize its compatible structure in the structural library (i.e., the fold recognition or the forward-folding search) is available for the structure prediction of new proteins. However, the reverse protocol, in which a structure recognizes its homologous sequences among a sequence database, named the inverse-folding search, is a more difficult application. In this study, we have investigated the feasibility of the latter approach. A structural library, composed of about 400 well-resolved structures with mutually dissimilar sequences, was prepared, and 163 of them had remote homologs in the library. We examined whether they could correctly seek their homologs by both forward- and inverse-folding searches. The results showed that the inverse-folding protocol is more effective than the forward-folding protocol, once the reference states of the compatibility functions are appropriately adjusted. This adjustment only slightly affects the ability of the forward-folding search. We noticed that the scoring, in which a given sequence is re-mounted onto a structure according to the 3D-1D alignment determined by the dynamic programming method, is only effective in the forward-folding protocol and not in the inverse-folding protocol. Namely, the inverse-folding search works significantly better with the score given by the 3D-1D alignment per se, rather than that obtained by the re-mounting. The implications of these results are discussed.
Collapse
Affiliation(s)
- M Ota
- National Institute of Genetics, Mishima, Shizuoka, Japan.
| | | |
Collapse
|
13
|
Ayers DJ, Gooley PR, Widmer-Cooper A, Torda AE. Enhanced protein fold recognition using secondary structure information from NMR. Protein Sci 1999; 8:1127-33. [PMID: 10338023 PMCID: PMC2144327 DOI: 10.1110/ps.8.5.1127] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
NMR offers the possibility of accurate secondary structure for proteins that would be too large for structure determination. In the absence of an X-ray crystal structure, this information should be useful as an adjunct to protein fold recognition methods based on low resolution force fields. The value of this information has been tested by adding varying amounts of artificial secondary structure data and threading a sequence through a library of candidate folds. Using a literature test set, the threading method alone has only a one-third chance of producing a correct answer among the top ten guesses. With realistic secondary structure information, one can expect a 60-80% chance of finding a homologous structure. The method has then been applied to examples with published estimates of secondary structure. This implementation is completely independent of sequence homology, and sequences are optimally aligned to candidate structures with gaps and insertions allowed. Unlike work using predicted secondary structure, we test the effect of differing amounts of relatively reliable data.
Collapse
Affiliation(s)
- D J Ayers
- Research School of Chemistry, Australian National University, Canberra ACT
| | | | | | | |
Collapse
|
14
|
Le Novère N, Corringer PJ, Changeux JP. Improved secondary structure predictions for a nicotinic receptor subunit: incorporation of solvent accessibility and experimental data into a two-dimensional representation. Biophys J 1999; 76:2329-45. [PMID: 10233052 PMCID: PMC1300207 DOI: 10.1016/s0006-3495(99)77390-x] [Citation(s) in RCA: 87] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Abstract A refined prediction of the nicotinic acetylcholine receptor (nAChR) subunits' secondary structure was computed with third-generation algorithms. The four selected programs, PHD, Predator, DSC, and NNSSP, based on different prediction approaches, were applied to each sequence of an alignment of nAChR and 5-HT3 receptor subunits, as well as a larger alignment with related subunit sequences from glycine and GABA receptors. A consensus prediction was computed for the nAChR subunits through a "winner takes all" method. By integrating the probabilities obtained with PHD, DSC, and NNSSP, this prediction was filtered in order to eliminate the singletons and to more precisely establish the structure limits (only 4% of the residues were modified). The final consensus secondary structure includes nine alpha-helices (24.2% of the residues, with an average length of 13.9 residues) and 17 beta-strands (22.5% of the residues, with an average length of 6.6 residues). The large extracellular domain is predicted to be mainly composed of beta-strands, with only two helices at the amino-terminal end. The transmembrane segments are predicted to be in a mixed alpha/beta topology (with a predominance of alpha-helices), with no known equivalent in the current protein database. The cytoplasmic domain is predicted to consist of two well-conserved amphipathic helices joined together by an unfolded stretch of variable length and sequence. In general, the segments predicted to occur in a periodic structure correspond to the more conserved regions, as defined by an analysis of sequence conservation per position performed on 152 superfamily members. The solvent accessibility of each residue was predicted from the multiple alignments with PHDacc. Each segment with more than three exposed residues was assumed to be external to the core protein. Overall, these data constitute an envelope of structural constraints. In a subsequent step, experimental data relative to the extracellular portion of the complete receptor were incorporated into the model. This led to a proposed two-dimensional representation of the secondary structure in which the peptide chain of the extracellular domain winds alternatively between the two interfaces of the subunit. Although this representation is not a tertiary structure and does not lead to predictions of specific beta-beta interaction, it should provide a basic framework for further mutagenesis investigations and for fold recognition (threading) searches.
Collapse
Affiliation(s)
- N Le Novère
- Centre National de la Recherche Scientifique URA D1284 Neurobiologie Moléculaire, Institut Pasteur, 75015 Paris, France.
| | | | | |
Collapse
|
15
|
de la Cruz X, Thornton JM. Factors limiting the performance of prediction-based fold recognition methods. Protein Sci 1999; 8:750-9. [PMID: 10211821 PMCID: PMC2144320 DOI: 10.1110/ps.8.4.750] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
In the past few years, a new generation of fold recognition methods has been developed, in which the classical sequence information is combined with information obtained from secondary structure and, sometimes, accessibility predictions. The results are promising, indicating that this approach may compete with potential-based methods (Rost B et al., 1997, J Mol Biol 270:471-480). Here we present a systematic study of the different factors contributing to the performance of these methods, in particular when applied to the problem of fold recognition of remote homologues. Our results indicate that secondary structure and accessibility prediction methods have reached an accuracy level where they are not the major factor limiting the accuracy of fold recognition. The pattern degeneracy problem is confirmed as the major source of error of these methods. On the basis of these results, we study three different options to overcome these limitations: normalization schemes, mapping of the coil state into the different zones of the Ramachandran plot, and post-threading graphical analysis.
Collapse
Affiliation(s)
- X de la Cruz
- Department of Biochemistry and Molecular Biology, University College, London, United Kingdom
| | | |
Collapse
|
16
|
|
17
|
|
18
|
|
19
|
|
20
|
|