1
|
Lu J, Xu G, Zhang S, Lu B. An effective sequence-alignment-free superpositioning of pairwise or multiple structures with missing data. Algorithms Mol Biol 2016; 11:18. [PMID: 27330544 PMCID: PMC4915111 DOI: 10.1186/s13015-016-0079-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2015] [Accepted: 05/18/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Superpositioning is an important problem in structural biology. Determining an optimal superposition requires a one-to-one correspondence between the atoms of two proteins structures. However, in practice, some atoms are missing from their original structures. Current superposition implementations address the missing data crudely by ignoring such atoms from their structures. RESULTS In this paper, we propose an effective method for superpositioning pairwise and multiple structures without sequence alignment. It is a two-stage procedure including data reduction and data registration. CONCLUSIONS Numerical experiments demonstrated that our method is effective and efficient. The code package of protein structure superposition method for addressing the cases with missing data is implemented by MATLAB, and it is freely available from: http://sourceforge.net/projects/pssm123/files/?source=navbar.
Collapse
Affiliation(s)
- Jianbo Lu
- />Human Genetics Resource Center, National Research Institute for Family Planning, Beijing, 100081 China
- />Graduate School of Peking Union Medical College, Beijing, 100730 China
| | - Guoliang Xu
- />National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190 China
| | - Shihua Zhang
- />National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190 China
| | - Benzhuo Lu
- />National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190 China
| |
Collapse
|
2
|
Ma J, Wang S. Algorithms, Applications, and Challenges of Protein Structure Alignment. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:121-75. [DOI: 10.1016/b978-0-12-800168-4.00005-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
3
|
Léonard S, Joseph AP, Srinivasan N, Gelly JC, de Brevern AG. mulPBA: an efficient multiple protein structure alignment method based on a structural alphabet. J Biomol Struct Dyn 2013; 32:661-8. [PMID: 23659291 DOI: 10.1080/07391102.2013.787026] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
The increasing number of available protein structures requires efficient tools for multiple structure comparison. Indeed, multiple structural alignments are essential for the analysis of function, evolution and architecture of protein structures. For this purpose, we proposed a new web server called multiple Protein Block Alignment (mulPBA). This server implements a method based on a structural alphabet to describe the backbone conformation of a protein chain in terms of dihedral angles. This 'sequence-like' representation enables the use of powerful sequence alignment methods for primary structure comparison, followed by an iterative refinement of the structural superposition. This approach yields alignments superior to most of the rigid-body alignment methods and highly comparable with the flexible structure comparison approaches. We implement this method in a web server designed to do multiple structure superimpositions from a set of structures given by the user. Outputs are given as both sequence alignment and superposed 3D structures visualized directly by static images generated by PyMol or through a Jmol applet allowing dynamic interaction. Multiple global quality measures are given. Relatedness between structures is indicated by a distance dendogram. Superimposed structures in PDB format can be also downloaded, and the results are quickly obtained. mulPBA server can be accessed at www.dsimb.inserm.fr/dsimb_tools/mulpba/ .
Collapse
Affiliation(s)
- Sylvain Léonard
- a INSERM UMR-S 665, DSIMB , 6, rue Alexandre Cabanel, F-75739 , Paris , France
| | | | | | | | | |
Collapse
|
4
|
Joseph AP, Srinivasan N, de Brevern AG. Progressive structure-based alignment of homologous proteins: Adopting sequence comparison strategies. Biochimie 2012; 94:2025-34. [PMID: 22676903 DOI: 10.1016/j.biochi.2012.05.028] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2011] [Accepted: 05/21/2012] [Indexed: 12/30/2022]
Abstract
Comparison of multiple protein structures has a broad range of applications in the analysis of protein structure, function and evolution. Multiple structure alignment tools (MSTAs) are necessary to obtain a simultaneous comparison of a family of related folds. In this study, we have developed a method for multiple structure comparison largely based on sequence alignment techniques. A widely used Structural Alphabet named Protein Blocks (PBs) was used to transform the information on 3D protein backbone conformation as a 1D sequence string. A progressive alignment strategy similar to CLUSTALW was adopted for multiple PB sequence alignment (mulPBA). Highly similar stretches identified by the pairwise alignments are given higher weights during the alignment. The residue equivalences from PB based alignments are used to obtain a three dimensional fit of the structures followed by an iterative refinement of the structural superposition. Systematic comparisons using benchmark datasets of MSTAs underlines that the alignment quality is better than MULTIPROT, MUSTANG and the alignments in HOMSTRAD, in more than 85% of the cases. Comparison with other rigid-body and flexible MSTAs also indicate that mulPBA alignments are superior to most of the rigid-body MSTAs and highly comparable to the flexible alignment methods.
Collapse
Affiliation(s)
- Agnel Praveen Joseph
- INSERM, UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques, 6, rue Alexandre Cabanel, 75739 Paris Cedex 15, France
| | | | | |
Collapse
|
5
|
Shealy P, Valafar H. Multiple structure alignment with msTALI. BMC Bioinformatics 2012; 13:105. [PMID: 22607234 PMCID: PMC3473313 DOI: 10.1186/1471-2105-13-105] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2011] [Accepted: 04/18/2012] [Indexed: 11/10/2022] Open
Abstract
Background Multiple structure alignments have received increasing attention in recent years as an alternative to multiple sequence alignments. Although multiple structure alignment algorithms can potentially be applied to a number of problems, they have primarily been used for protein core identification. A method that is capable of solving a variety of problems using structure comparison is still absent. Here we introduce a program msTALI for aligning multiple protein structures. Our algorithm uses several informative features to guide its alignments: torsion angles, backbone Cα atom positions, secondary structure, residue type, surface accessibility, and properties of nearby atoms. The algorithm allows the user to weight the types of information used to generate the alignment, which expands its utility to a wide variety of problems. Results msTALI exhibits competitive results on 824 families from the Homstrad and SABmark databases when compared to Matt and Mustang. We also demonstrate success at building a database of protein cores using 341 randomly selected CATH domains and highlight the contribution of msTALI compared to the CATH classifications. Finally, we present an example applying msTALI to the problem of detecting hinges in a protein undergoing rigid-body motion. Conclusions msTALI is an effective algorithm for multiple structure alignment. In addition to its performance on standard comparison databases, it utilizes clear, informative features, allowing further customization for domain-specific applications. The C++ source code for msTALI is available for Linux on the web at
http://ifestos.cse.sc.edu/mstali.
Collapse
Affiliation(s)
- Paul Shealy
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA
| | | |
Collapse
|
6
|
Movaghar AF, Launay G, Schbath S, Gibrat JF, Rodolphe F. Statistical significance of threading scores. J Comput Biol 2011; 19:13-29. [PMID: 22149633 DOI: 10.1089/cmb.2011.0236] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present a general method for assessing threading score significance. The threading score of a protein sequence, thread onto a given structure, should be compared with the threading score distribution of a random amino-acid sequence, of the same length, thread on the same structure; small p-values point significantly high scores. We claim that, due to general protein contact map properties, this reference distribution is a Weibull extreme value distribution whose parameters depend on the threading method, the structure, the length of the query and the random sequence simulation model used. These parameters can be estimated off-line with simulated sequence samples, for different sequence lengths. They can further be interpolated at the exact length of a query, enabling the quick computation of the p-value.
Collapse
|
7
|
McDonnell AV, Menke M, Palmer N, King J, Cowen L, Berger B. Fold recognition and accurate sequence-structure alignment of sequences directing beta-sheet proteins. Proteins 2006; 63:976-85. [PMID: 16547930 DOI: 10.1002/prot.20942] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The ability to predict structure from sequence is particularly important for toxins, virulence factors, allergens, cytokines, and other proteins of public health importance. Many such functions are represented in the parallel beta-helix and beta-trefoil families. A method using pairwise beta-strand interaction probabilities coupled with evolutionary information represented by sequence profiles is developed to tackle these problems for the beta-helix and beta-trefoil folds. The algorithm BetaWrapPro employs a "wrapping" component that may capture folding processes with an initiation stage followed by processive interaction of the sequence with the already-formed motifs. BetaWrapPro outperforms all previous motif recognition programs for these folds, recognizing the beta-helix with 100% sensitivity and 99.7% specificity and the beta-trefoil with 100% sensitivity and 92.5% specificity, in crossvalidation on a database of all nonredundant known positive and negative examples of these fold classes in the PDB. It additionally aligns 88% of residues for the beta-helices and 86% for the beta-trefoils accurately (within four residues of the exact position) to the structural template, which is then used with the side-chain packing program SCWRL to produce 3D structure predictions. One striking result has been the prediction of an unexpected parallel beta-helix structure for a pollen allergen, and its recent confirmation through solution of its structure. A Web server running BetaWrapPro is available and outputs putative PDB-style coordinates for sequences predicted to form the target folds.
Collapse
Affiliation(s)
- Andrew V McDonnell
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | | | | | | | | | | |
Collapse
|
8
|
Lubec G, Afjehi-Sadat L, Yang JW, John JPP. Searching for hypothetical proteins: theory and practice based upon original data and literature. Prog Neurobiol 2005; 77:90-127. [PMID: 16271823 DOI: 10.1016/j.pneurobio.2005.10.001] [Citation(s) in RCA: 133] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Revised: 09/18/2005] [Accepted: 10/02/2005] [Indexed: 12/29/2022]
Abstract
A large part of mammalian proteomes is represented by hypothetical proteins (HP), i.e. proteins predicted from nucleic acid sequences only and protein sequences with unknown function. Databases are far from being complete and errors are expected. The legion of HP is awaiting experiments to show their existence at the protein level and subsequent bioinformatic handling in order to assign proteins a tentative function is mandatory. Two-dimensional gel-electrophoresis with subsequent mass spectrometrical identification of protein spots is an appropriate tool to search for HP in the high-throughput mode. Spots are identified by MS or by MS/MS measurements (MALDI-TOF, MALDI-TOF-TOF) and subsequent software as e.g. Mascot or ProFound. In many cases proteins can thus be unambiguously identified and characterised; if this is not the case, de novo sequencing or Q-TOF analysis is warranted. If the protein is not identified, the sequence is being sent to databases for BLAST searches to determine identities/similarities or homologies to known proteins. If no significant identity to known structures is observed, the protein sequence is examined for the presence of functional domains (databases PROSITE, PRINTS, InterPro, ProDom, Pfam and SMART), subjected to searches for motifs (ELM) and finally protein-protein interaction databases (InterWeaver, STRING) are consulted or predictions from conformations are performed. We here provide information about hypothetical proteins in terms of protein chemical analysis, independent of antibody availability and specificity and bioinformatic handling to contribute to the extension/completion of protein databases and include original work on HP in the brain to illustrate the processes of HP identification and functional assignment.
Collapse
Affiliation(s)
- Gert Lubec
- Department of Pediatrics, Division of Basic Sciences, Medical University of Vienna, Waehringer Guertel 18-20, A-1090, Vienna, Austria.
| | | | | | | |
Collapse
|
9
|
Abstract
Sequence alignment profiles have been shown to be very powerful in creating accurate sequence alignments. Profiles are often used to search a sequence database with a local alignment algorithm. More accurate and longer alignments have been obtained with profile-to-profile comparison. There are several steps that must be performed in creating profile-profile alignments, and each involves choices in parameters and algorithms. These steps include (1) what sequences to include in a multiple alignment used to build each profile, (2) how to weight similar sequences in the multiple alignment and how to determine amino acid frequencies from the weighted alignment, (3) how to score a column from one profile aligned to a column of the other profile, (4) how to score gaps in the profile-profile alignment, and (5) how to include structural information. Large-scale benchmarks consisting of pairs of homologous proteins with structurally determined sequence alignments are necessary for evaluating the efficacy of each scoring scheme. With such a benchmark, we have investigated the properties of profile-profile alignments and found that (1) with optimized gap penalties, most column-column scoring functions behave similarly to one another in alignment accuracy; (2) some functions, however, have much higher search sensitivity and specificity; (3) position-specific weighting schemes in determining amino acid counts in columns of multiple sequence alignments are better than sequence-specific schemes; (4) removing positions in the profile with gaps in the query sequence results in better alignments; and (5) adding predicted and known secondary structure information improves alignments.
Collapse
Affiliation(s)
- Guoli Wang
- Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA
| | | |
Collapse
|
10
|
|
11
|
Abstract
Here we present a simplified form of threading that uses only a 20 x 20 two-body residue-based potential and restricted number of gaps. Despite its simplicity and transparency the Monte Carlo-based threading algorithm performs very well in a rigorous test of fold recognition. The results suggest that by simplifying and constraining the decoy space, one can achieve better fold recognition. Fold recognition results are compared with and supplemented by a PSI-BLAST search. The statistical significance of threading results is rigorously evaluated from statistics of extremes by comparison with optimal alignments of a large set of randomly shuffled sequences. The statistical theory, based on the Random Energy Model, yields a cumulative statistical parameter, epsilon, that attests to the likelihood of correct fold recognition. A large epsilon indicates a significant energy gap between the optimal alignment and decoy alignments and, consequently, a high probability that the fold is correctly recognized. For a particular number of gaps, the epsilon parameter reaches its maximal value, and the fold is recognized. As the number of gaps further increases, the likelihood of correct fold recognition drops off. This is because the decoy space is small when gaps are restricted to a small number, but the native alignment is still well approximated, whereas unrestricted increase of the number of gaps leads to rapid growth of the number of decoys and their statistical dominance over the correct alignment. It is shown that best results are obtained when a combination of one-, two-, and three-gap threading is used. To this end, use of the epsilon parameter is crucial for rigorous comparison of results across the different decoy spaces belonging to a different number of gaps.
Collapse
Affiliation(s)
- William Chen
- Department of Biophysics, Harvard University, Boston, Massachusetts, USA
| | | | | |
Collapse
|
12
|
Wangikar PP, Tendulkar AV, Ramya S, Mali DN, Sarawagi S. Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J Mol Biol 2003; 326:955-78. [PMID: 12581652 DOI: 10.1016/s0022-2836(02)01384-0] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
We report a method for detection of recurring side-chain patterns (DRESPAT) using an unbiased and automated graph theoretic approach. We first list all structural patterns as sub-graphs where the protein is represented as a graph. The patterns from proteins are compared pair-wise to detect patterns common to a protein pair based on content and geometry criteria. The recurring pattern is then detected using an automated search algorithm from the all-against-all pair-wise comparison data of proteins. Intra-protein pattern comparison data are used to enable detection of patterns recurring within a protein. A method has been proposed for empirical calculation of statistical significance of recurring pattern. The method was tested on 17 protein sets of varying size, composed of non-redundant representatives from SCOP superfamilies. Recurring patterns in serine proteases, cysteine proteases, lipases, cupredoxin, ferredoxin, ferritin, cytochrome c, aspartoyl proteases, peroxidases, phospholipase A2, endonuclease, SH3 domain, EF-hand and lectins show additional residues conserved in the vicinity of the known functional sites. On the basis of the recurring patterns in ferritin, EF-hand and lectins, we could separate proteins or domains that are structurally similar yet different in metal ion-binding characteristics. In addition, novel recurring patterns were observed in glutathione-S-transferase, phospholipase A2 and ferredoxin with potential structural/functional roles. The results are discussed in relation to the known functional sites in each family. Between 2000 and 50,000 patterns were enumerated from each protein with between ten and 500 patterns detected as common to an evolutionarily related protein pair. Our results show that unbiased extraction of functional site pattern is not feasible from an evolutionarily related protein pair but is feasible from protein sets comprising five or more proteins. The DRESPAT method does not require a user-defined pattern, size or location of the pattern and therefore, has the potential to uncover new functional sites in protein families.
Collapse
Affiliation(s)
- Pramod P Wangikar
- Department of Chemical Engineering, Indian Institute of Technology, Bombay, Powai Mumbai 400 076, India.
| | | | | | | | | |
Collapse
|
13
|
Klepeis JL, Floudas CA. Prediction of beta-sheet topology and disulfide bridges in polypeptides. J Comput Chem 2003; 24:191-208. [PMID: 12497599 DOI: 10.1002/jcc.10167] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
An ab initio method has been developed to predict beta architectures in polypeptides. The approach predicts the topology of beta-sheets and disulfide bridges through a novel superstructure-based mathematical framework originally established for chemical process synthesis problems. Two types of superstructure are introduced, both of which emanate from the principle that hydrophobic interactions drive the formation of a beta-structure. The mathematical formulation of the problem results in a set of integer linear programming (ILP) problems that can be solved to global optimality to identify the optimal beta-configuration. These (ILP) models can also predict a ranked ordered list of the best, second-best, third-best, etc., topologies of beta-sheets and disulfide bridges. The approach is shown to perform very well for several benchmark polypeptide systems, as well as polypeptides exhibiting challenging nonsequential beta-sheet topologies folds (56 to 187 amino acids).
Collapse
Affiliation(s)
- J L Klepeis
- Department of Chemical Engineering, Princeton University, Princeton, New Jersey 08544-5263, USA
| | | |
Collapse
|
14
|
Saunders CT, Baker D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J Mol Biol 2002; 322:891-901. [PMID: 12270722 DOI: 10.1016/s0022-2836(02)00813-6] [Citation(s) in RCA: 156] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Methods for automated prediction of deleterious protein mutations have utilized both structural and evolutionary information but the relative contribution of these two factors remains unclear. To address this, we have used a variety of structural and evolutionary features to create simple deleterious mutation models that have been tested on both experimental mutagenesis and human allele data. We find that the most accurate predictions are obtained using a solvent-accessibility term, the C(beta) density, and a score derived from homologous sequences, SIFT. A classification tree using these two features has a cross-validated prediction error of 20.5% on an experimental mutagenesis test set when the prior probability for deleterious and neutral cases is equal, whereas this prediction error is 28.8% and 22.2% using either the C(beta) density or SIFT alone. The improvement imparted by structure increases when fewer homologs are available: when restricted to three homologs the prediction error improves from 26.9% using SIFT alone to 22.4% using SIFT and the C(beta) density, or 24.8% using SIFT and a noisy C(beta) density term approximating the inaccuracy of ab initio structures modeled by the Rosetta method. We conclude that methods for deleterious mutation prediction should include structural information when fewer than five to ten homologs are available, and that ab initio predicted structures may soon be useful in such cases when high-resolution structures are unavailable.
Collapse
|
15
|
Lu L, Lu H, Skolnick J. MULTIPROSPECTOR: an algorithm for the prediction of protein-protein interactions by multimeric threading. Proteins 2002; 49:350-64. [PMID: 12360525 DOI: 10.1002/prot.10222] [Citation(s) in RCA: 196] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In this postgenomic era, the ability to identify protein-protein interactions on a genomic scale is very important to assist in the assignment of physiological function. Because of the increasing number of solved structures involving protein complexes, the time is ripe to extend threading to the prediction of quaternary structure. In this spirit, a multimeric threading approach has been developed. The approach is comprised of two phases. In the first phase, traditional threading on a single chain is applied to generate a set of potential structures for the query sequences. In particular, we use our recently developed threading algorithm, PROSPECTOR. Then, for those proteins whose template structures are part of a known complex, we rethread on both partners in the complex and now include a protein-protein interfacial energy. To perform this analysis, a database of multimeric protein structures has been constructed, the necessary interfacial pairwise potentials have been derived, and a set of empirical indicators to identify true multimers based on the threading Z-score and the magnitude of the interfacial energy have been established. The algorithm has been tested on a benchmark set comprised of 40 homodimers, 15 heterodimers, and 69 monomers that were scanned against a protein library of 2478 structures that comprise a representative set of structures in the Protein Data Bank. Of these, the method correctly recognized and assigned 36 homodimers, 15 heterodimers, and 65 monomers. This protocol was applied to identify partners and assign quaternary structures of proteins found in the yeast database of interacting proteins. Our multimeric threading algorithm correctly predicts 144 interacting proteins, compared to the 56 (26) cases assigned by PSI-BLAST using a (less) permissive E-value of 1 (0.01). Next, all possible pairs of yeast proteins have been examined. Predictions (n = 2865) of protein-protein interactions are made; 1138 of these 2865 interactions have counterparts in the Database of Interacting Proteins. In contrast, PSI-BLAST made 1781 predictions, and 1215 have counterparts in DIP. An estimation of the false-negative rate for yeast-predicted interactions has also been provided. Thus, a promising approach to help assist in the assignment of protein-protein interactions on a genomic scale has been developed.
Collapse
Affiliation(s)
- Long Lu
- Laboratory of Computational Genomics, Donald Danforth Plant Science Center, St. Louis, Missouri 63132, USA
| | | | | |
Collapse
|
16
|
Abstract
We applied a new protocol based on PSI-Blast to predict the structures of fold recognition targets during CASP4. The protocol used a back-validation step to infer biologically significant connections between sequences with PSI-Blast E-values up to 10. If connections were found to proteins of known structure, alignments were generated by using HMMer. The protocol was implemented in a fully automated version (SBauto) and in a version that allowed manual intervention (SBfold). We found that the automated version made 17 predictions for target domains, of which 8 identified the correct fold with an average alignment accuracy of 24% for alignable residues and 43% for equivalent secondary structure elements. The manual version improved predictions somewhat, with 10 of 15 predictions identifying the correct fold with alignment accuracies of 33% for alignable residues and 64% for equivalent secondary structure elements. We describe successes and failures of our approach and discuss future developments of fold recognition.
Collapse
Affiliation(s)
- K K Koretke
- Protein Bioinformatics Group, GlaxoSmithKline, Collegeville, Pennsylvania 19426-0989, USA.
| | | | | |
Collapse
|
17
|
An Y, Friesner RA. A novel fold recognition method using composite predicted secondary structures. Proteins 2002; 48:352-66. [PMID: 12112702 DOI: 10.1002/prot.10145] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In this work, we introduce a new method for fold recognition using composite secondary structures assembled from different secondary structure prediction servers for a given target sequence. An automatic, complete, and robust way of finding all possible combinations of predicted secondary structure segments (SSS) for the target sequence and clustering them into a few flexible clusters, each containing patterns with the same number of SSS, is developed. This program then takes two steps in choosing plausible homologues: (i) a SSS-based alignment excludes impossible templates whose SSS patterns are very different from any of those of the target; (ii) a residue-based alignment selects good structural templates based on sequence similarity and secondary structure similarity between the target and only those templates left in the first stage. The secondary structure of each residue in the target is selected from one of the predictions to find the best match with the template. Truncation is applied to a target where different predictions vary. In most cases, a target is also divided into N-terminal and C-terminal fragments, each of which is used as a separate subsequence. Our program was tested on the fold recognition targets from CASP3 with known PDB codes and some available targets from CASP4. The results are compared with a structural homologue list for each target produced by the CE program (Shindyalov and Bourne, Protein Eng 1998;11:739-747). The program successfully locates homologues with high Z-score and low root-mean-score deviation within the top 30-50 predictions in the overwhelming majority of cases.
Collapse
Affiliation(s)
- Yuling An
- Department of Chemistry and Center for Biomolecular Simulation, Columbia University, New York, New York 10027, USA
| | | |
Collapse
|
18
|
Abstract
Fold recognition predicts protein three-dimensional structure by establishing relationships between a protein sequence and known protein structures. Most methods explicitly use information derived from the secondary and tertiary structure of the templates. Here we show that rigorous application of a sequence search method (PSI-BLAST) with no reference to secondary or tertiary structure information is able to perform as well as traditional fold recognition methods. Since the method, SENSER, does not require knowledge of the three-dimensional structure, it can be used to infer relationships that are not tractable by methods dependent on structural templates.
Collapse
Affiliation(s)
- Kristin K Koretke
- Microbial Bioinformatics Group, GlaxoSmithKline, Collegeville, Pennsylvania 19426-0989, USA.
| | | | | |
Collapse
|
19
|
de la Cruz X, Sillitoe I, Orengo C. Use of structure comparison methods for the refinement of protein structure predictions. I. Identifying the structural family of a protein from low-resolution models. Proteins 2002; 46:72-84. [PMID: 11746704 DOI: 10.1002/prot.10002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Predicting the three-dimensional structure of proteins is still one of the most challenging problems in molecular biology. Despite its difficulty, several investigators have started to produce consistently low-resolution predictions for small proteins. However, in most of these cases, the prediction accuracy is still too low to make them useful. In the present article, we address the problem of obtaining better-quality predictions, starting from low-resolution models. To this end, we have devised a new procedure that uses these models, together with structure comparison methods, to identify the structural family of the target protein. This would allow, in a second step not described in the present work, to refine the predictions using conserved features of the identified family. In our approach, the structure database is investigated using predictions, at different accuracy levels, for a given protein. As query structures, we used both low-resolution versions of the native structures, as well as different sets of low accuracy predictions. In general, we found that for predictions with a resolution of > or =5-7 A, structure comparison methods were able to identify the fold of a protein in the top positions.
Collapse
Affiliation(s)
- Xavier de la Cruz
- Departmento de Bioquímica y Biología Molecular Facultad de Químicas; Universidad de Barcelona, Barcelona, Spain.
| | | | | |
Collapse
|
20
|
Abstract
The mapping of the human genome was completed earlier this year and efforts are underway to understand the role of gene products (i.e. proteins) in biological pathways and human disease and to exploit their functional roles to derive protein therapeutics and protein-based drugs. A key component to the next revolution in the 'post-genomic' era will be the increasingly widespread use of protein structure in rational experimental design. Improvements in quality, availability and utility of large-scale three- and four-dimensional protein structural information are enabling a revolution in rational design, having particular impact on drug discovery and optimization. New computational methodologies now yield modeled structures that are, in many cases, quantitatively comparable with crystal structures, at a fraction of the cost.
Collapse
Affiliation(s)
- E T. Maggio
- Structural Bioinformatics, 92127, Tel: +1 858 675 2400 fax: +1 858 618 1040, San Diego, CA, USA
| | | |
Collapse
|
21
|
Kolinski A, Betancourt MR, Kihara D, Rotkiewicz P, Skolnick J. Generalized comparative modeling (GENECOMP): a combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement. Proteins 2001; 44:133-49. [PMID: 11391776 DOI: 10.1002/prot.1080] [Citation(s) in RCA: 76] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
An improved generalized comparative modeling method, GENECOMP, for the refinement of threading models is developed and validated on the Fischer database of 68 probe-template pairs, a standard benchmark used to evaluate threading approaches. The basic idea is to perform ab initio folding using a lattice protein model, SICHO, near the template provided by the new threading algorithm PROSPECTOR. PROSPECTOR also provides predicted contacts and secondary structure for the template-aligned regions, and possibly for the unaligned regions by garnering additional information from other top-scoring threaded structures. Since the lowest-energy structure generated by the simulations is not necessarily the best structure, we employed two structure-selection protocols: distance geometry and clustering. In general, clustering is found to generate somewhat better quality structures in 38 of 68 cases. When applied to the Fischer database, the protocol does no harm and in a significant number of cases improves upon the initial threading model, sometimes dramatically. The procedure is readily automated and can be implemented on a genomic scale.
Collapse
Affiliation(s)
- A Kolinski
- Laboratory of Computational Genomics, Donald Danforth Plant Science Center, St. Louis, Missouri 63141, USA
| | | | | | | | | |
Collapse
|
22
|
Abstract
The mapping of the human genome was completed earlier this year and efforts are underway to understand the role of gene products (i.e. proteins) in biological pathways and human disease and to exploit their functional roles to derive protein therapeutics and protein-based drugs. A key component to the next revolution in the 'post-genomic' era will be the increasingly widespread use of protein structure in rational experimental design. Improvements in quality, availability and utility of large-scale 3D and 4D protein structural information are enabling a revolution in rational design, having particular impact on drug discovery and optimization. New computational methodologies now yield modeled structures that are, in many cases, quantitatively comparable with crystal structures, at a fraction of the cost.
Collapse
Affiliation(s)
- E T Maggio
- Structural Bioinformatics Inc., 92127, San Diego, CA, USA.
| | | |
Collapse
|
23
|
Kim W, Wilbur WJ. Amino acid residue environments and predictions of residue type. COMPUTERS & CHEMISTRY 2001; 25:411-22. [PMID: 11459355 DOI: 10.1016/s0097-8485(01)00075-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The determination of a protein's structure from the knowledge of its linear chain is one of the important problems that remains as a bottleneck in interpreting the rapidly increasing repository of genetic sequence data. One approach to this problem that has shown promise and given a measure of success is threading. In this approach contact energies between different amino acids are first determined by statistical methods applied to known structures. These contact energies are then applied to a sequence whose structure is to be determined by threading it through various known structures and determining the total threading energy for each candidate structure. That structure that yields the lowest total energy is then considered the leading candidate among all the structures tested. Additional information is often needed in order to support the results of threading studies, as it is well known in the field that the contact potentials used are not sufficiently sensitive to allow definitive conclusions. Here, we investigate the hypothesis that the environment of an amino acid residue realized as all those residues not local to it on the chain but sufficiently close spatially can supply information predictive of the type of that residue that is not adequately reflected in the individual contact energies. We present evidence that confirms this hypothesis and suggests a high order cooperativity between the residues that surround a given residue and how they interact with it. We suggest a possible application to threading.
Collapse
Affiliation(s)
- W Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | |
Collapse
|
24
|
Abstract
Methods predicting protein secondary structure improved substantially in the 1990s through the use of evolutionary information taken from the divergence of proteins in the same structural family. Recently, the evolutionary information resulting from improved searches and larger databases has again boosted prediction accuracy by more than four percentage points to its current height of around 76% of all residues predicted correctly in one of the three states, helix, strand, and other. The past year also brought successful new concepts to the field. These new methods may be particularly interesting in light of the improvements achieved through simple combining of existing methods. Divergent evolutionary profiles contain enough information not only to substantially improve prediction accuracy, but also to correctly predict long stretches of identical residues observed in alternative secondary structure states depending on nonlocal conditions. An example is a method automatically identifying structural switches and thus finding a remarkable connection between predicted secondary structure and aspects of function. Secondary structure predictions are increasingly becoming the work horse for numerous methods aimed at predicting protein structure and function. Is the recent increase in accuracy significant enough to make predictions even more useful? Because the recent improvement yields a better prediction of segments, and in particular of beta strands, I believe the answer is affirmative. What is the limit of prediction accuracy? We shall see.
Collapse
Affiliation(s)
- B Rost
- CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, New York 10032, USA
| |
Collapse
|
25
|
Chasman D, Adams RM. Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol 2001; 307:683-706. [PMID: 11254390 DOI: 10.1006/jmbi.2001.4510] [Citation(s) in RCA: 298] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
We have developed a formalism and a computational method for analyzing the potential functional consequences of non-synonymous single nucleotide polymorphisms. Our approach uses a structural model and phylogenetic information to derive a selection of structure and sequence-based features serving as indicators of an amino acid polymorphim's effect on function. The feature values can be integrated into a probabilistic assessment of whether an amino acid polymorphism will affect the function or stability of a target protein. The method has been validated with data sets of unbiased mutations in the lac repressor and lysoyzyme. Applying our methodology to recent surveys of genetic variation in the coding regions of clinically important genes, we estimate that approximately 26-32 % of the natural non-synonymous single nucleotide polymorphisms have effects on function. This estimate suggests that a typical person will have about 6240-12,800 heterozygous loci that encode proteins with functional variation due to natural amino acid polymorphism.
Collapse
Affiliation(s)
- D Chasman
- Variagenics, 60 Hampshire Street, Cambridge, MA 02144, USA.
| | | |
Collapse
|
26
|
|
27
|
Abstract
A homology-based structure prediction method ideally gives both a correct fold assignment and an accurate query-template alignment. In this article we show that the combination of two existing methods, PSI-BLAST and threading, leads to significant enhancement in the success rate of fold recognition. The combined approach, termed COBLATH, also yields much higher alignment accuracy than found in previous studies. It consists of two-way searches both by PSI-BLAST and by threading. In the PSI-BLAST portion, a query is used to search for hits in a library of potential templates and, conversely, each potential template is used to search for hits in a library of queries. In the threading portion, the scoring function is the sum of a sequence profile and a 6x6 substitution matrix between predicted query and known template secondary structure and solvent exposure. "Two-way" in threading means that the query's sequence profile is used to match the sequences of all potential templates and the sequence profiles of all potential templates are used to match the query's sequence. When tested on a set of 533 nonhomologous proteins, COBLATH was able to assign folds for 390 (73%). Among these 390 queries, 265 (68%) had root-mean-square deviations (RMSDs) of less than 8 A between predicted and actual structures. Such high success rate and accuracy make COBLATH an ideal tool for structural genomics.
Collapse
Affiliation(s)
- Y Shan
- Department of Physics, Drexel University, Philadelphia, Pennsylvania 19104, USA
| | | | | |
Collapse
|
28
|
Abstract
Several recent publications illustrated advantages of using sequence profiles in recognizing distant homologies between proteins. At the same time, the practical usefulness of distant homology recognition depends not only on the sensitivity of the algorithm, but also on the quality of the alignment between a prediction target and the template from the database of known proteins. Here, we study this question for several supersensitive protein algorithms that were previously compared in their recognition sensitivity (Rychlewski et al., 2000). A database of protein pairs with similar structures, but low sequence similarity is used to rate the alignments obtained with several different methods, which included sequence-sequence, sequence-profile, and profile-profile alignment methods. We show that incorporation of evolutionary information encoded in sequence profiles into alignment calculation methods significantly increases the alignment accuracy, bringing them closer to the alignments obtained from structure comparison. In general, alignment quality is correlated with recognition and alignment score significance. For every alignment method, alignments with statistically significant scores correlate with both correct structural templates and good quality alignments. At the same time, average alignment lengths differ in various methods, making the comparison between them difficult. For instance, the alignments obtained by FFAS, the profile-profile alignment algorithm developed in our group are always longer that the alignments obtained with the PSI-BLAST algorithms. To address this problem, we develop methods to truncate or extend alignments to cover a specified percentage of protein lengths. In most cases, the elongation of the alignment by profile-profile methods is reasonable, adding fragments of similar structure. The examples of erroneous alignment are examined and it is shown that they can be identified based on the model quality.
Collapse
Affiliation(s)
- L Jaroszewski
- The Burnham Institute, La Jolla, California 92037, USA
| | | | | |
Collapse
|
29
|
Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499-520. [PMID: 10860755 DOI: 10.1006/jmbi.2000.3741] [Citation(s) in RCA: 1198] [Impact Index Per Article: 49.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
A method (three-dimensional position-specific scoring matrix, 3D-PSSM) to recognise remote protein sequence homologues is described. The method combines the power of multiple sequence profiles with knowledge of protein structure to provide enhanced recognition and thus functional assignment of newly sequenced genomes. The method uses structural alignments of homologous proteins of similar three-dimensional structure in the structural classification of proteins (SCOP) database to obtain a structural equivalence of residues. These equivalences are used to extend multiply aligned sequences obtained by standard sequence searches. The resulting large superfamily-based multiple alignment is converted into a PSSM. Combined with secondary structure matching and solvation potentials, 3D-PSSM can recognise structural and functional relationships beyond state-of-the-art sequence methods. In a cross-validated benchmark on 136 homologous relationships unambiguously undetectable by position-specific iterated basic local alignment search tool (PSI-Blast), 3D-PSSM can confidently assign 18 %. The method was applied to the remaining unassigned regions of the Mycoplasma genitalium genome and an additional 13 regions were assigned with 95 % confidence. 3D-PSSM is available to the community as a web server: http://www.bmm.icnet.uk/servers/3dpssm
Collapse
Affiliation(s)
- L A Kelley
- Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, 44 Lincoln's Inn Fields, London, WC2A 3PX, England
| | | | | |
Collapse
|
30
|
Domingues FS, Lackner P, Andreeva A, Sippl MJ. Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J Mol Biol 2000; 297:1003-13. [PMID: 10736233 DOI: 10.1006/jmbi.2000.3615] [Citation(s) in RCA: 72] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity. Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.
Collapse
Affiliation(s)
- F S Domingues
- Center for Applied Molecular Engineering, Institute for Chemistry and Biochemistry, University of Salzburg, Jakob Haringer Strasse 3, Salzburg, A-5020, Austria
| | | | | | | |
Collapse
|
31
|
Panchenko AR, Marchler-Bauer A, Bryant SH. Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol 2000; 296:1319-31. [PMID: 10698636 DOI: 10.1006/jmbi.2000.3541] [Citation(s) in RCA: 102] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Using a benchmark set of structurally similar proteins, we conduct a series of threading experiments intended to identify a scoring function with an optimal combination of contact-potential and sequence-profile terms. The benchmark set is selected to include many medium-difficulty fold recognition targets, where sequence similarity is undetectable by BLAST but structural similarity is extensive. The contact potential is based on the log-odds of non-local contacts involving different amino acid pairs, in native as opposed to randomly compacted structures. The sequence profile term is that used in PSI-BLAST. We find that combination of these terms significantly improves the success rate of fold recognition over use of either term alone, with respect to both recognition sensitivity and the accuracy of threading models. Improvement is greatest for targets between 10 % and 20 % sequence identity and 60 % to 80 % superimposable residues, where the number of models crossing critical accuracy and significance thresholds more than doubles. We suggest that these improvements account for the successful performance of the combined scoring function at CASP3. We discuss possible explanations as to why sequence-profile and contact-potential terms appear complementary.
Collapse
Affiliation(s)
- A R Panchenko
- National Center for Biotechnology Information, National Institutes of Health, Building 38A, Room 8N805, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
32
|
Abstract
The current state of the art in modeling protein structure has been assessed, based on the results of the CASP (Critical Assessment of protein Structure Prediction) experiments. In comparative modeling, improvements have been made in sequence alignment, sidechain orientation and loop building. Refinement of the models remains a serious challenge. Improved sequence profile methods have had a large impact in fold recognition. Although there has been some progress in alignment quality, this factor still limits model usefulness. In ab initio structure prediction, there has been notable progress in building approximately correct structures of 40-60 residue-long protein fragments. There is still a long way to go before the general ab initio prediction problem is solved. Overall, the field is maturing into a practical technology, able to deliver useful models for a large number of sequences.
Collapse
Affiliation(s)
- J Moult
- Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, MD 20850, USA.
| |
Collapse
|
33
|
|
34
|
|