1
|
Sanfelice D, Sanz-Hernández M, de Simone A, Bullard B, Pastore A. Toward Understanding the Molecular Bases of Stretch Activation: A STRUCTURAL COMPARISON OF THE TWO TROPONIN C ISOFORMS OF LETHOCERUS. J Biol Chem 2016; 291:16090-9. [PMID: 27226601 PMCID: PMC4965559 DOI: 10.1074/jbc.m116.726646] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 05/18/2016] [Indexed: 11/25/2022] Open
Abstract
Muscles are usually activated by calcium binding to the calcium sensory protein troponin-C, which is one of the three components of the troponin complex. However, in cardiac and insect flight muscle activation is also produced by mechanical stress. Little is known about the molecular bases of this calcium-independent activation. In Lethocerus, a giant water bug often used as a model system because of its large muscle fibers, there are two troponin-C isoforms, called F1 and F2, that have distinct roles in activating the muscle. It has been suggested that this can be explained either by differences in structural features or by differences in the interactions with other proteins. Here we have compared the structural and dynamic properties of the two proteins and shown how they differ. We have also mapped the interactions of the F2 isoform with peptides spanning the sequence of its natural partner, troponin-I. Our data have allowed us to build a model of the troponin complex and may eventually help in understanding the specialized function of the F1 and F2 isoforms and the molecular mechanism of stretch activation.
Collapse
Affiliation(s)
- Domenico Sanfelice
- From the Department of Clinical and Basic Neurosciences, Wohl Institute, King's College, London SE5 3RT, United Kingdom
| | | | - Alfonso de Simone
- the Department of Life Sciences, Imperial College, London SW7 2AZ, United Kingdom
| | - Belinda Bullard
- the Department of Biology, University of York, York YO10 5DD, United Kingdom, and
| | - Annalisa Pastore
- From the Department of Clinical and Basic Neurosciences, Wohl Institute, King's College, London SE5 3RT, United Kingdom, the Department of Molecular Medicine, Universita' of Pavia, Pavia I27100, Italy
| |
Collapse
|
2
|
Esque J, Urbain A, Etchebest C, de Brevern AG. Sequence-structure relationship study in all-α transmembrane proteins using an unsupervised learning approach. Amino Acids 2015; 47:2303-22. [PMID: 26043903 DOI: 10.1007/s00726-015-2010-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2014] [Accepted: 05/15/2015] [Indexed: 01/28/2023]
Abstract
Transmembrane proteins (TMPs) are major drug targets, but the knowledge of their precise topology structure remains highly limited compared with globular proteins. In spite of the difficulties in obtaining their structures, an important effort has been made these last years to increase their number from an experimental and computational point of view. In view of this emerging challenge, the development of computational methods to extract knowledge from these data is crucial for the better understanding of their functions and in improving the quality of structural models. Here, we revisit an efficient unsupervised learning procedure, called Hybrid Protein Model (HPM), which is applied to the analysis of transmembrane proteins belonging to the all-α structural class. HPM method is an original classification procedure that efficiently combines sequence and structure learning. The procedure was initially applied to the analysis of globular proteins. In the present case, HPM classifies a set of overlapping protein fragments, extracted from a non-redundant databank of TMP 3D structure. After fine-tuning of the learning parameters, the optimal classification results in 65 clusters. They represent at best similar relationships between sequence and local structure properties of TMPs. Interestingly, HPM distinguishes among the resulting clusters two helical regions with distinct hydrophobic patterns. This underlines the complexity of the topology of these proteins. The HPM classification enlightens unusual relationship between amino acids in TMP fragments, which can be useful to elaborate new amino acids substitution matrices. Finally, two challenging applications are described: the first one aims at annotating protein functions (channel or not), the second one intends to assess the quality of the structures (X-ray or models) via a new scoring function deduced from the HPM classification.
Collapse
Affiliation(s)
- Jérémy Esque
- INSERM, U 1134, DSIMB, 75739, Paris, France.,Univ. Paris Diderot, Sorbonne Paris Cité UMR-S 1134, 75739, Paris, France.,Institut National de la Transfusion Sanguine (INTS), 75739, Paris, France.,Laboratoire d'Excellence GR-Ex, 75739, Paris, France.,Laboratoire d'Ingénierie des Fonctions Moléculaire (IFM), ISIS, UMR 7006, 67000, Strasbourg, France.,Department of Integrative Structural Biology, INSERM U964, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), 67404, Illkirch, France.,UMR7104, Centre National de la Recherche Scientifique (CNRS), 67404, Illkirch, France.,Université de Strasbourg, 67404, Illkirch, France
| | - Aurélie Urbain
- Institut Jean-Pierre Bourgin, INRA, UMR 1318, 78026, Versailles, France
| | - Catherine Etchebest
- INSERM, U 1134, DSIMB, 75739, Paris, France.,Univ. Paris Diderot, Sorbonne Paris Cité UMR-S 1134, 75739, Paris, France.,Institut National de la Transfusion Sanguine (INTS), 75739, Paris, France.,Laboratoire d'Excellence GR-Ex, 75739, Paris, France
| | - Alexandre G de Brevern
- INSERM, U 1134, DSIMB, 75739, Paris, France. .,Univ. Paris Diderot, Sorbonne Paris Cité UMR-S 1134, 75739, Paris, France. .,Institut National de la Transfusion Sanguine (INTS), 75739, Paris, France. .,Laboratoire d'Excellence GR-Ex, 75739, Paris, France.
| |
Collapse
|
3
|
Enzymatic hydrolyzed feather peptide, a welcoming drug for multiple-antibiotic-resistant Staphylococcus aureus: structural analysis and characterization. Appl Biochem Biotechnol 2015; 175:3371-86. [PMID: 25649444 DOI: 10.1007/s12010-015-1509-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2014] [Accepted: 01/21/2015] [Indexed: 10/24/2022]
Abstract
This study aimed to explore the bactericidal activity of a feather-degraded active peptide against multiple-antibiotic-resistant (MAR) Staphylococcus aureus. An antibacterial peptide (ABP) was isolated from the chicken feathers containing fermented media of Paenibacillus woosongensis TKB2, a keratinolytic soil isolate. It was purified by HPLC, and its mass was found to be 4666.87 Da using matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) spectroscopy. The minimum inhibitory concentration (MIC) and minimum bactericidal concentration (MBC) values of this peptide were 22.5 and 90 μg/ml, respectively. SEM study revealed the distorted cell wall of the test strain along with pore formation. The possible reason for bactericidal activity of the peptide is due to generation of reactive oxygen species (ROS), resulting in membrane damage and leakage of intracellular protein. Complete sequence of the peptide was predicted and retrieved from the sequence database of chicken feather keratin after in silico trypsin digestion using ExPASy tools. Further, net charge, hydrophobicity (77.7 %) and molecular modelling of the peptide were evaluated for better understanding of its mode of action. The hydrophobic region (17 to 27) of the peptide may facilitate for initial attachment on the bacterial membrane. The ABP exhibited no adverse effects on RBC membrane and HT-29 human cell line. This cytosafe peptide can be exploited as an effective therapeutic agent to combat Staphylococcal infections.
Collapse
|
4
|
Filipic B, Nikolic K, Filipic S, Jovcic B, Agbaba D, Antic Stankovic J, Kojic M, Golic N. Identifying the CmbT substrates specificity by using a quantitative structure–activity relationship (QSAR) study. J Taiwan Inst Chem Eng 2014. [DOI: 10.1016/j.jtice.2013.09.033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
5
|
Cao R, Wang Z, Cheng J. Designing and evaluating the MULTICOM protein local and global model quality prediction methods in the CASP10 experiment. BMC STRUCTURAL BIOLOGY 2014; 14:13. [PMID: 24731387 PMCID: PMC3996498 DOI: 10.1186/1472-6807-14-13] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2013] [Accepted: 04/01/2014] [Indexed: 11/10/2022]
Abstract
BACKGROUND Protein model quality assessment is an essential component of generating and using protein structural models. During the Tenth Critical Assessment of Techniques for Protein Structure Prediction (CASP10), we developed and tested four automated methods (MULTICOM-REFINE, MULTICOM-CLUSTER, MULTICOM-NOVEL, and MULTICOM-CONSTRUCT) that predicted both local and global quality of protein structural models. RESULTS MULTICOM-REFINE was a clustering approach that used the average pairwise structural similarity between models to measure the global quality and the average Euclidean distance between a model and several top ranked models to measure the local quality. MULTICOM-CLUSTER and MULTICOM-NOVEL were two new support vector machine-based methods of predicting both the local and global quality of a single protein model. MULTICOM-CONSTRUCT was a new weighted pairwise model comparison (clustering) method that used the weighted average similarity between models in a pool to measure the global model quality. Our experiments showed that the pairwise model assessment methods worked better when a large portion of models in the pool were of good quality, whereas single-model quality assessment methods performed better on some hard targets when only a small portion of models in the pool were of reasonable quality. CONCLUSIONS Since digging out a few good models from a large pool of low-quality models is a major challenge in protein structure prediction, single model quality assessment methods appear to be poised to make important contributions to protein structure modeling. The other interesting finding was that single-model quality assessment scores could be used to weight the models by the consensus pairwise model comparison method to improve its accuracy.
Collapse
Affiliation(s)
| | | | - Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, Missouri 65211, USA.
| |
Collapse
|
6
|
Computational Approaches and Resources in Single Amino Acid Substitutions Analysis Toward Clinical Research. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:365-423. [DOI: 10.1016/b978-0-12-800168-4.00010-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
7
|
Ray A, Lindahl E, Wallner B. Improved model quality assessment using ProQ2. BMC Bioinformatics 2012; 13:224. [PMID: 22963006 PMCID: PMC3584948 DOI: 10.1186/1471-2105-13-224] [Citation(s) in RCA: 150] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2012] [Accepted: 09/07/2012] [Indexed: 11/19/2022] Open
Abstract
Background Employing methods to assess the quality of modeled protein structures is now standard practice in bioinformatics. In a broad sense, the techniques can be divided into methods relying on consensus prediction on the one hand, and single-model methods on the other. Consensus methods frequently perform very well when there is a clear consensus, but this is not always the case. In particular, they frequently fail in selecting the best possible model in the hard cases (lacking consensus) or in the easy cases where models are very similar. In contrast, single-model methods do not suffer from these drawbacks and could potentially be applied on any protein of interest to assess quality or as a scoring function for sampling-based refinement. Results Here, we present a new single-model method, ProQ2, based on ideas from its predecessor, ProQ. ProQ2 is a model quality assessment algorithm that uses support vector machines to predict local as well as global quality of protein models. Improved performance is obtained by combining previously used features with updated structural and predicted features. The most important contribution can be attributed to the use of profile weighting of the residue specific features and the use features averaged over the whole model even though the prediction is still local. Conclusions ProQ2 is significantly better than its predecessors at detecting high quality models, improving the sum of Z-scores for the selected first-ranked models by 20% and 32% compared to the second-best single-model method in CASP8 and CASP9, respectively. The absolute quality assessment of the models at both local and global level is also improved. The Pearson’s correlation between the correct and local predicted score is improved from 0.59 to 0.70 on CASP8 and from 0.62 to 0.68 on CASP9; for global score to the correct GDT_TS from 0.75 to 0.80 and from 0.77 to 0.80 again compared to the second-best single methods in CASP8 and CASP9, respectively. ProQ2 is available at http://proq2.wallnerlab.org.
Collapse
Affiliation(s)
- Arjun Ray
- Department of Theoretical Physics & Swedish eScience Research Center, Royal Institute of Technology, Stockholm, Sweden
| | | | | |
Collapse
|
8
|
Lopez G, Maietta P, Rodriguez JM, Valencia A, Tress ML. firestar--advances in the prediction of functionally important residues. Nucleic Acids Res 2011; 39:W235-41. [PMID: 21672959 PMCID: PMC3125799 DOI: 10.1093/nar/gkr437] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
firestar is a server for predicting catalytic and ligand-binding residues in protein sequences. Here, we present the important developments since the first release of firestar. Previous versions of the server required human interpretation of the results; the server is now fully automatized. firestar has been implemented as a web service and can now be run in high-throughput mode. Prediction coverage has been greatly improved with the extension of the FireDB database and the addition of alignments generated by HHsearch. Ligands in FireDB are now classified for biological relevance. Many of the changes have been motivated by the critical assessment of techniques for protein structure prediction (CASP) ligand-binding prediction experiment, which provided us with a framework to test the performance of firestar. URL: http://firedb.bioinfo.cnio.es/Php/FireStar.php.
Collapse
Affiliation(s)
- Gonzalo Lopez
- Structural Computational Biology Group, Spanish National Cancer Research Centre (CNIO), c. Melchor Fernandez Almagro, 3, 28029 Madrid, Spain
| | | | | | | | | |
Collapse
|
9
|
Systematic assessment of accuracy of comparative model of proteins belonging to different structural fold classes. J Mol Model 2011; 17:2831-7. [PMID: 21301906 DOI: 10.1007/s00894-011-0976-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2010] [Accepted: 01/17/2011] [Indexed: 10/18/2022]
Abstract
In the absence of experimental structures, comparative modeling continues to be the chosen method for retrieving structural information on target proteins. However, models lack the accuracy of experimental structures. Alignment error and structural divergence (between target and template) influence model accuracy the most. Here, we examine the potential additional impact of backbone geometry, as our previous studies have suggested that the structural class (all-α, αβ, all-β) of a protein may influence the accuracy of its model. In the twilight zone (sequence identity ≤ 30%) and at a similar level of target-template divergence, the accuracy of protein models does indeed follow the trend all-α > αβ > all-β. This is mainly because the alignment accuracy follows the same trend (all-α > αβ > all-β), with backbone geometry playing only a minor role. Differences in the diversity of sequences belonging to different structural classes leads to the observed accuracy differences, thus enabling the accuracy of alignments/models to be estimated a priori in a class-dependent manner. This study provides a systematic description of and quantifies the structural class-dependent effect in comparative modeling. The study also suggests that datasets for large-scale sequence/structure analyses should have equal representations of different structural classes to avoid class-dependent bias.
Collapse
|
10
|
Abstract
Homology modeling is based on the observation that related protein sequences adopt similar three-dimensional structures. Hence, a homology model of a protein can be derived using related protein structure(s) as modeling template(s). A key step in this approach is the establishment of correspondence between residues of the protein to be modeled and those of modeling template(s). This step, often referred to as sequence-structure alignment, is one of the major determinants of the accuracy of a homology model. This chapter gives an overview of methods for deriving sequence-structure alignments and discusses recent methodological developments leading to improved performance. However, no method is perfect. How to find alignment regions that may have errors and how to make improvements? This is another focus of this chapter. Finally, the chapter provides a practical guidance of how to get the most of the available tools in maximizing the accuracy of sequence-structure alignments.
Collapse
|
11
|
|
12
|
Benkert P, Tosatto SCE, Schwede T. Global and local model quality estimation at CASP8 using the scoring functions QMEAN and QMEANclust. Proteins 2010; 77 Suppl 9:173-80. [PMID: 19705484 DOI: 10.1002/prot.22532] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Identifying the best candidate model among an ensemble of alternatives is crucial in protein structure prediction. For this purpose, scoring functions have been developed which either calculate a quality estimate on the basis of a single model or derive a score from the information contained in the ensemble of models generated for a given sequence (i.e., consensus methods). At CASP7, consensus methods have performed considerably better than scoring functions operating on single models. However, consensus methods tend to fail if the best models are far from the center of the dominant structural cluster. At CASP8, we investigated whether our hybrid method QMEANclust may overcome this limitation by combining the QMEAN composite scoring function operating on single models with consensus information. We participated with four different scoring functions in the quality assessment category. The QMEANclust consensus scoring function turned out to be a successful method both for the ranking of entire models but especially for the estimation of the per-residue model quality. In this article, we briefly describe the two scoring functions QMEAN and QMEANclust and discuss their performance in the context of what went right and wrong at CASP8. Both scoring functions are publicly available at http://swissmodel.expasy.org/qmean/.
Collapse
Affiliation(s)
- Pascal Benkert
- Biozentrum, University of Basel, Basel 4056, Switzerland
| | | | | |
Collapse
|
13
|
Wang Z, Tegge AN, Cheng J. Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009; 75:638-47. [PMID: 19004001 DOI: 10.1002/prot.22275] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Knowing the quality of a protein structure model is important for its appropriate usage. We developed a model evaluation method to assess the absolute quality of a single protein model using only structural features with support vector machine regression. The method assigns an absolute quantitative score (i.e. GDT-TS) to a model by comparing its secondary structure, relative solvent accessibility, contact map, and beta sheet structure with their counterparts predicted from its primary sequence. We trained and tested the method on the CASP6 dataset using cross-validation. The correlation between predicted and true scores is 0.82. On the independent CASP7 dataset, the correlation averaged over 95 protein targets is 0.76; the average correlation for template-based and ab initio targets is 0.82 and 0.50, respectively. Furthermore, the predicted absolute quality scores can be used to rank models effectively. The average difference (or loss) between the scores of the top-ranked models and the best models is 5.70 on the CASP7 targets. This method performs favorably when compared with the other methods used on the same dataset. Moreover, the predicted absolute quality scores are comparable across models for different proteins. These features make the method a valuable tool for model quality assurance and ranking.
Collapse
Affiliation(s)
- Zheng Wang
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
14
|
Gao X, Xu J, Li SC, Li M. Predicting local quality of a sequence-structure alignment. J Bioinform Comput Biol 2009; 7:789-810. [PMID: 19785046 DOI: 10.1142/s0219720009004345] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2009] [Revised: 04/06/2009] [Accepted: 04/07/2009] [Indexed: 11/18/2022]
Abstract
Although protein structure prediction has made great progress in recent years, a protein model derived from automated prediction methods is subject to various errors. As methods for structure prediction develop, a continuing problem is how to evaluate the quality of a protein model, especially to identify some well-predicted regions of the model, so that the structural biology community can benefit from the automated structure prediction. It is also important to identify badly-predicted regions in a model so that some refinement measurements can be applied to it. We present two complementary techniques, FragQA and PosQA, to accurately predict local quality of a sequence-structure (i.e. sequence-template) alignment generated by comparative modeling (i.e. homology modeling and threading). FragQA and PosQA predict local quality from two different perspectives. Different from existing methods, FragQA directly predicts cRMSD between a continuously aligned fragment determined by an alignment and the corresponding fragment in the native structure, while PosQA predicts the quality of an individual aligned position. Both FragQA and PosQA use an SVM (Support Vector Machine) regression method to perform prediction using similar information extracted from a single given alignment. Experimental results demonstrate that FragQA performs well on predicting local fragment quality, and PosQA outperforms two top-notch methods, ProQres and ProQprof. Our results indicate that (1) local quality can be predicted well; (2) local sequence evolutionary information (i.e. sequence similarity) is the major factor in predicting local quality; and (3) structural information such as solvent accessibility and secondary structure helps to improve the prediction performance.
Collapse
Affiliation(s)
- Xin Gao
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, N2L 3G1, Canada.
| | | | | | | |
Collapse
|
15
|
Benkert P, Schwede T, Tosatto SC. QMEANclust: estimation of protein model quality by combining a composite scoring function with structural density information. BMC STRUCTURAL BIOLOGY 2009; 9:35. [PMID: 19457232 PMCID: PMC2709111 DOI: 10.1186/1472-6807-9-35] [Citation(s) in RCA: 112] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2008] [Accepted: 05/20/2009] [Indexed: 11/10/2022]
Abstract
BACKGROUND The selection of the most accurate protein model from a set of alternatives is a crucial step in protein structure prediction both in template-based and ab initio approaches. Scoring functions have been developed which can either return a quality estimate for a single model or derive a score from the information contained in the ensemble of models for a given sequence. Local structural features occurring more frequently in the ensemble have a greater probability of being correct. Within the context of the CASP experiment, these so called consensus methods have been shown to perform considerably better in selecting good candidate models, but tend to fail if the best models are far from the dominant structural cluster. In this paper we show that model selection can be improved if both approaches are combined by pre-filtering the models used during the calculation of the structural consensus. RESULTS Our recently published QMEAN composite scoring function has been improved by including an all-atom interaction potential term. The preliminary model ranking based on the new QMEAN score is used to select a subset of reliable models against which the structural consensus score is calculated. This scoring function called QMEANclust achieves a correlation coefficient of predicted quality score and GDT_TS of 0.9 averaged over the 98 CASP7 targets and perform significantly better in selecting good models from the ensemble of server models than any other groups participating in the quality estimation category of CASP7. Both scoring functions are also benchmarked on the MOULDER test set consisting of 20 target proteins each with 300 alternatives models generated by MODELLER. QMEAN outperforms all other tested scoring functions operating on individual models, while the consensus method QMEANclust only works properly on decoy sets containing a certain fraction of near-native conformations. We also present a local version of QMEAN for the per-residue estimation of model quality (QMEANlocal) and compare it to a new local consensus-based approach. CONCLUSION Improved model selection is obtained by using a composite scoring function operating on single models in order to enrich higher quality models which are subsequently used to calculate the structural consensus. The performance of consensus-based methods such as QMEANclust highly depends on the composition and quality of the model ensemble to be analysed. Therefore, performance estimates for consensus methods based on large meta-datasets (e.g. CASP) might overrate their applicability in more realistic modelling situations with smaller sets of models based on individual methods.
Collapse
Affiliation(s)
- Pascal Benkert
- Swiss Institute of Bioinformatics, Biozentrum, University of Basel, Klingelbergstrasse 50/70, 4056 Basel, Switzerland.
| | | | | |
Collapse
|
16
|
Handl J, Knowles J, Lovell SC. Artefacts and biases affecting the evaluation of scoring functions on decoy sets for protein structure prediction. Bioinformatics 2009; 25:1271-9. [PMID: 19297350 PMCID: PMC2677743 DOI: 10.1093/bioinformatics/btp150] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2008] [Revised: 03/06/2009] [Accepted: 03/14/2009] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Decoy datasets, consisting of a solved protein structure and numerous alternative native-like structures, are in common use for the evaluation of scoring functions in protein structure prediction. Several pitfalls with the use of these datasets have been identified in the literature, as well as useful guidelines for generating more effective decoy datasets. We contribute to this ongoing discussion an empirical assessment of several decoy datasets commonly used in experimental studies. RESULTS We find that artefacts and sampling issues in the large majority of these data make it trivial to discriminate the native structure. This underlines that evaluation based on the rank/z-score of the native is a weak test of scoring function performance. Moreover, sampling biases present in the way decoy sets are generated or used can strongly affect other types of evaluation measures such as the correlation between score and root mean squared deviation (RMSD) to the native. We demonstrate how, depending on type of bias and evaluation context, sampling biases may lead to both over- or under-estimation of the quality of scoring terms, functions or methods. AVAILABILITY Links to the software and data used in this study are available at http://dbkgroup.org/handl/decoy_sets.
Collapse
Affiliation(s)
- Julia Handl
- Faculty of Life Sciences, University of Manchester, Manchester, UK
| | | | | |
Collapse
|
17
|
Benkert P, Künzli M, Schwede T. QMEAN server for protein model quality estimation. Nucleic Acids Res 2009; 37:W510-4. [PMID: 19429685 DOI: 10.1093/nar/gkp322] [Citation(s) in RCA: 593] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Model quality estimation is an essential component of protein structure prediction, since ultimately the accuracy of a model determines its usefulness for specific applications. Usually, in the course of protein structure prediction a set of alternative models is produced, from which subsequently the most accurate model has to be selected. The QMEAN server provides access to two scoring functions successfully tested at the eighth round of the community-wide blind test experiment CASP. The user can choose between the composite scoring function QMEAN, which derives a quality estimate on the basis of the geometrical analysis of single models, and the clustering-based scoring function QMEANclust which calculates a global and local quality estimate based on a weighted all-against-all comparison of the models from the ensemble provided by the user. The web server performs a ranking of the input models and highlights potentially problematic regions for each model. The QMEAN server is available at http://swissmodel.expasy.org/qmean.
Collapse
|
18
|
Kelley LA, Sternberg MJE. Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc 2009; 4:363-71. [PMID: 19247286 DOI: 10.1038/nprot.2009.2] [Citation(s) in RCA: 3415] [Impact Index Per Article: 227.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Determining the structure and function of a novel protein is a cornerstone of many aspects of modern biology. Over the past decades, a number of computational tools for structure prediction have been developed. It is critical that the biological community is aware of such tools and is able to interpret their results in an informed way. This protocol provides a guide to interpreting the output of structure prediction servers in general and one such tool in particular, the protein homology/analogy recognition engine (Phyre). New profile-profile matching algorithms have improved structure prediction considerably in recent years. Although the performance of Phyre is typical of many structure prediction systems using such algorithms, all these systems can reliably detect up to twice as many remote homologies as standard sequence-profile searching. Phyre is widely used by the biological community, with >150 submissions per day, and provides a simple interface to results. Phyre takes 30 min to predict the structure of a 250-residue protein.
Collapse
Affiliation(s)
- Lawrence A Kelley
- Structural Bioinformatics Group, Division of Molecular Biosciences, Department of Life Sciences, Imperial College London, South Kensington Campus, London, UK.
| | | |
Collapse
|
19
|
Bordoli L, Kiefer F, Arnold K, Benkert P, Battey J, Schwede T. Protein structure homology modeling using SWISS-MODEL workspace. Nat Protoc 2009; 4:1-13. [PMID: 19131951 DOI: 10.1038/nprot.2008.197] [Citation(s) in RCA: 934] [Impact Index Per Article: 62.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Homology modeling aims to build three-dimensional protein structure models using experimentally determined structures of related family members as templates. SWISS-MODEL workspace is an integrated Web-based modeling expert system. For a given target protein, a library of experimental protein structures is searched to identify suitable templates. On the basis of a sequence alignment between the target protein and the template structure, a three-dimensional model for the target protein is generated. Model quality assessment tools are used to estimate the reliability of the resulting models. Homology modeling is currently the most accurate computational method to generate reliable structural models and is routinely used in many biological applications. Typically, the computational effort for a modeling project is less than 2 h. However, this does not include the time required for visualization and interpretation of the model, which may vary depending on personal experience working with protein structures.
Collapse
Affiliation(s)
- Lorenza Bordoli
- Biozentrum, University of Basel, Klingelbergstrasse 50-70, CH 4056 Basel, Switzerland
| | | | | | | | | | | |
Collapse
|
20
|
Chen H, Kihara D. Estimating quality of template-based protein models by alignment stability. Proteins 2008; 71:1255-74. [PMID: 18041762 DOI: 10.1002/prot.21819] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The error in protein tertiary structure prediction is unavoidable, but it is not explicitly shown in most of the current prediction algorithms. Estimated error of a predicted structure is crucial information for experimental biologists to use the prediction model for design and interpretation of experiments. Here, we propose a method to estimate errors in predicted structures based on the stability of the optimal target-template alignment when compared with a set of suboptimal alignments. The stability of the optimal alignment is quantified by an index named the SuboPtimal Alignment Diversity (SPAD). We implemented SPAD in a profile-based threading algorithm and investigated how well SPAD can indicate errors in threading models using a large benchmark dataset of 5232 alignments. SPAD shows a very good correlation not only to alignment shift errors but also structure-level errors, the root mean square deviation (RMSD) of predicted structure models to the native structures (i.e. global errors), and local errors at each residue position. We have further compared SPAD with seven other quality measures, six from sequence alignment-based measures and one atomic statistical potential, discrete optimized protein energy (DOPE), in terms of the correlation coefficient to the global and local structure-level errors. In terms of the correlation to the RMSD of structure models, when a target and a template are in the same SCOP family, the sequence identity showed a best correlation to the RMSD; in the superfamily level, SPAD was the best; and in the fold level, DOPE was best. However, in a head-to-head comparison, SPAD wins over the other measures. Next, SPAD is compared with three other measures of local errors. In this comparison, SPAD was best in all of the family, the superfamily and the fold levels. Using the discovered correlation, we have also predicted the global and local error of our predicted structures of CASP7 targets by the SPAD. Finally, we proposed a sausage representation of predicted tertiary structures which intuitively indicate the predicted structure and the estimated error range of the structure simultaneously.
Collapse
Affiliation(s)
- Hao Chen
- Department of Biological Sciences, College of Science, Purdue University, West Lafayette, Indiana 47907, USA
| | | |
Collapse
|
21
|
Measuring global credibility with application to local sequence alignment. PLoS Comput Biol 2008; 4:e1000077. [PMID: 18464927 PMCID: PMC2367447 DOI: 10.1371/journal.pcbi.1000077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2007] [Accepted: 03/31/2008] [Indexed: 11/19/2022] Open
Abstract
Computational biology is replete with high-dimensional (high-D) discrete prediction and inference problems, including sequence alignment, RNA structure prediction, phylogenetic inference, motif finding, prediction of pathways, and model selection problems in statistical genetics. Even though prediction and inference in these settings are uncertain, little attention has been focused on the development of global measures of uncertainty. Regardless of the procedure employed to produce a prediction, when a procedure delivers a single answer, that answer is a point estimate selected from the solution ensemble, the set of all possible solutions. For high-D discrete space, these ensembles are immense, and thus there is considerable uncertainty. We recommend the use of Bayesian credibility limits to describe this uncertainty, where a (1−α)%, 0≤α≤1, credibility limit is the minimum Hamming distance radius of a hyper-sphere containing (1−α)% of the posterior distribution. Because sequence alignment is arguably the most extensively used procedure in computational biology, we employ it here to make these general concepts more concrete. The maximum similarity estimator (i.e., the alignment that maximizes the likelihood) and the centroid estimator (i.e., the alignment that minimizes the mean Hamming distance from the posterior weighted ensemble of alignments) are used to demonstrate the application of Bayesian credibility limits to alignment estimators. Application of Bayesian credibility limits to the alignment of 20 human/rodent orthologous sequence pairs and 125 orthologous sequence pairs from six Shewanella species shows that credibility limits of the alignments of promoter sequences of these species vary widely, and that centroid alignments dependably have tighter credibility limits than traditional maximum similarity alignments. Sequence alignment is the cornerstone capability used by a multitude of computational biology applications, such as phylogeny reconstruction and identification of common regulatory mechanisms. Sequence alignment methods typically seek a high-scoring alignment between a pair of sequences, and assign a statistical significance to this single alignment. However, because a single alignment of two (or more) sequences is a point estimate, it may not be representative of the entire set (ensemble) of possible alignments of those sequences; thus, there may be considerable uncertainty associated with any one alignment among an immense ensemble of possibilities. To address the uncertainty of a proposed alignment, we used a Bayesian probabilistic approach to assess an alignment's reliability in the context of the entire ensemble of possible alignments. Our approach performs a global assessment of the degree to which the members of the ensemble depart from a selected alignment, thereby determining a credibility limit. In an evaluation of the popular maximum similarity alignment and the centroid alignment (i.e., the alignment that is in the center of the posterior distribution of alignments), we find that the centroid yields tighter credibility limits (on average) than the maximum similarity alignment. Beyond the usual interest in putting error limits on point estimates, our findings of substantial variability in credibility limits of alignments argue for wider adoption of these limits, so the degree of error is delineated prior to the subsequent use of the alignments.
Collapse
|
22
|
Lunter G, Rocco A, Mimouni N, Heger A, Caldeira A, Hein J. Uncertainty in homology inferences: assessing and improving genomic sequence alignment. Genome Res 2007; 18:298-309. [PMID: 18073381 DOI: 10.1101/gr.6725608] [Citation(s) in RCA: 90] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at http://genserv.anat.ox.ac.uk/grape/.
Collapse
Affiliation(s)
- Gerton Lunter
- MRC Functional Genetics Unit, Department of Physiology, Anatomy, and Genetics, University of Oxford, Oxford OX1 3QX, United Kingdom.
| | | | | | | | | | | |
Collapse
|
23
|
Lee M, Jeong CS, Kim D. Predicting and improving the protein sequence alignment quality by support vector regression. BMC Bioinformatics 2007; 8:471. [PMID: 18053160 PMCID: PMC2222655 DOI: 10.1186/1471-2105-8-471] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Accepted: 12/03/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment. RESULTS In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs. CONCLUSION The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at http://pbil.kaist.ac.kr/forecast.
Collapse
Affiliation(s)
- Minho Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Chan-seok Jeong
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| |
Collapse
|
24
|
López G, Valencia A, Tress ML. firestar--prediction of functionally important residues using structural templates and alignment reliability. Nucleic Acids Res 2007; 35:W573-7. [PMID: 17584799 PMCID: PMC1933227 DOI: 10.1093/nar/gkm297] [Citation(s) in RCA: 88] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Here we present firestar, an expert system for predicting ligand-binding residues in protein structures. The server provides a method for extrapolating from the large inventory of functionally important residues organized in the FireDB database and adds information about the local conservation of potential-binding residues. The interface allows users to make queries by protein sequence or structure. The user can access pairwise and multiple alignments with structures that have relevant functionally important binding sites. The results are presented in a series of easy to read displays that allow users to compare binding residue conservation across homologous proteins. The binding site residues can also be viewed with molecular visualization tools. One feature of firestar is that it can be used to evaluate the biological relevance of small molecule ligands present in PDB structures. With the server it is easy to discern whether small molecule binding is conserved in homologous structures. We found this facility particularly useful during the recent assessment of CASP7 function prediction. Availability: http://firedb.bioinfo.cnio.es/Php/FireStar.php.
Collapse
Affiliation(s)
- Gonzalo López
- Structural Biology and Biocomputing Program, Spanish National Cancer Research Centre (CNIO) Melchor Fernández Almagro, 3, E-28029, Madrid, Spain.
| | | | | |
Collapse
|
25
|
Abstract
MOTIVATION Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.
Collapse
Affiliation(s)
- Huzefa Rangwala
- Department of Computer Science & Engineering, University of Minnesota Minneapolis, MN 55455, USA.
| | | |
Collapse
|
26
|
Lopez G, Valencia A, Tress M. FireDB--a database of functionally important residues from proteins of known structure. Nucleic Acids Res 2006; 35:D219-23. [PMID: 17132832 PMCID: PMC1716728 DOI: 10.1093/nar/gkl897] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The FireDB database is a databank for functional information relating to proteins with known structures. It contains the most comprehensive and detailed repository of known functionally important residues, bringing together both ligand binding and catalytic residues in one site. The platform integrates biologically relevant data filtered from the close atomic contacts in Protein Data Bank crystal structures and reliably annotated catalytic residues from the Catalytic Site Atlas. The interface allows users to make queries by protein, ligand or keyword. Relevant biologically important residues are displayed in a simple and easy to read manner that allows users to assess binding site similarity across homologous proteins. Binding site residue variations can also be viewed with molecular visualization tools. The database is available at
Collapse
Affiliation(s)
- Gonzalo Lopez
- Computational and Structural Biology Program, Spanish National Cancer Research Centre (CNIO) Melchor Fernández Almagro, 3, E-28029, Madrid, Spain.
| | | | | |
Collapse
|
27
|
Rai BK, Fiser A. Multiple mapping method: a novel approach to the sequence-to-structure alignment problem in comparative protein structure modeling. Proteins 2006; 63:644-61. [PMID: 16437570 DOI: 10.1002/prot.20835] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
A major bottleneck in comparative protein structure modeling is the quality of input alignment between the target sequence and the template structure. A number of alignment methods are available, but none of these techniques produce consistently good solutions for all cases. Alignments produced by alternative methods may be superior in certain segments but inferior in others when compared to each other; therefore, an accurate solution often requires an optimal combination of them. To address this problem, we have developed a new approach, Multiple Mapping Method (MMM). The algorithm first identifies the alternatively aligned regions from a set of input alignments. These alternatively aligned segments are scored using a composite scoring function, which determines their fitness within the structural environment of the template. The best scoring regions from a set of alternative segments are combined with the core part of the alignments to produce the final MMM alignment. The algorithm was tested on a dataset of 1400 protein pairs using 11 combinations of two to four alignment methods. In all cases MMM showed statistically significant improvement by reducing alignment errors in the range of 3 to 17%. MMM also compared favorably over two alignment meta-servers. The algorithm is computationally efficient; therefore, it is a suitable tool for genome scale modeling studies.
Collapse
Affiliation(s)
- Brajesh K Rai
- Department of Biochemistry and Seaver Center for Bioinformatics, Albert Einstein College of Medicine, Bronx, New York 10461, USA
| | | |
Collapse
|
28
|
Dunbrack RL. Sequence comparison and protein structure prediction. Curr Opin Struct Biol 2006; 16:374-84. [PMID: 16713709 DOI: 10.1016/j.sbi.2006.05.006] [Citation(s) in RCA: 119] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2006] [Revised: 03/22/2006] [Accepted: 05/08/2006] [Indexed: 10/24/2022]
Abstract
Sequence comparison is a major step in the prediction of protein structure from existing templates in the Protein Data Bank. The identification of potentially remote homologues to be used as templates for modeling target sequences of unknown structure and their accurate alignment remain challenges, despite many years of study. The most recent advances have been in combining as many sources of information as possible--including amino acid variation in the form of profiles or hidden Markov models for both the target and template families, known and predicted secondary structures of the template and target, respectively, the combination of structure alignment for distant homologues and sequence alignment for close homologues to build better profiles, and the anchoring of certain regions of the alignment based on existing biological data. Newer technologies have been applied to the problem, including the use of support vector machines to tackle the fold classification problem for a target sequence and the alignment of hidden Markov models. Finally, using the consensus of many fold recognition methods, whether based on profile-profile alignments, threading or other approaches, continues to be one of the most successful strategies for both recognition and alignment of remote homologues. Although there is still room for improvement in identification and alignment methods, additional progress may come from model building and refinement methods that can compensate for large structural changes between remotely related targets and templates, as well as for regions of misalignment.
Collapse
Affiliation(s)
- Roland L Dunbrack
- Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA.
| |
Collapse
|
29
|
Tress ML, Cozzetto D, Tramontano A, Valencia A. An analysis of the Sargasso Sea resource and the consequences for database composition. BMC Bioinformatics 2006; 7:213. [PMID: 16623953 PMCID: PMC1513258 DOI: 10.1186/1471-2105-7-213] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2005] [Accepted: 04/19/2006] [Indexed: 01/20/2023] Open
Abstract
Background The environmental sequencing of the Sargasso Sea has introduced a huge new resource of genomic information. Unlike the protein sequences held in the current searchable databases, the Sargasso Sea sequences originate from a single marine environment and have been sequenced from species that are not easily obtainable by laboratory cultivation. The resource also contains very many fragments of whole protein sequences, a side effect of the shotgun sequencing method. These sequences form a significant addendum to the current searchable databases but also present us with some intrinsic difficulties. While it is important to know whether it is possible to assign function to these sequences with the current methods and whether they will increase our capacity to explore sequence space, it is also interesting to know how current bioinformatics techniques will deal with the new sequences in the resource. Results The Sargasso Sea sequences seem to introduce a bias that decreases the potential of current methods to propose structure and function for new proteins. In particular the high proportion of sequence fragments in the resource seems to result in poor quality multiple alignments. Conclusion These observations suggest that the new sequences should be used with care, especially if the information is to be used in large scale analyses. On a positive note, the results may just spark improvements in computational and experimental methods to take into account the fragments generated by environmental sequencing techniques.
Collapse
Affiliation(s)
- Michael L Tress
- Protein Design Group, CNB-CSIC, Calle Darwin, Cantoblanco 28049 Madrid, Spain
| | - Domenico Cozzetto
- Department of Biochemical Sciences, University "La Sapienza" Rome, Italy
| | - Anna Tramontano
- Department of Biochemical Sciences, University "La Sapienza" Rome, Italy
| | - Alfonso Valencia
- Protein Design Group, CNB-CSIC, Calle Darwin, Cantoblanco 28049 Madrid, Spain
| |
Collapse
|
30
|
Wallner B, Elofsson A. Identification of correct regions in protein models using structural, alignment, and consensus information. Protein Sci 2006; 15:900-13. [PMID: 16522791 PMCID: PMC2242478 DOI: 10.1110/ps.051799606] [Citation(s) in RCA: 122] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
In this study we present two methods to predict the local quality of a protein model: ProQres and ProQprof. ProQres is based on structural features that can be calculated from a model, while ProQprof uses alignment information and can only be used if the model is created from an alignment. In addition, we also propose a simple approach based on local consensus, Pcons-local. We show that all these methods perform better than state-of-the-art methodologies and that, when applicable, the consensus approach is by far the best approach to predict local structure quality. It was also found that ProQprof performed better than other methods for models based on distant relationships, while ProQres performed best for models based on closer relationship, i.e., a model has to be reasonably good to make a structural evaluation useful. Finally, we show that a combination of ProQprof and ProQres (ProQlocal) performed better than any other nonconsensus method for both high- and low-quality models. Additional information and Web servers are available at: http://www.sbc.su.se/~bjorn/ProQ/.
Collapse
Affiliation(s)
- Björn Wallner
- Stockholm Bioinformatics Center, Stockholm University, SE-106 91 Stockholm, Sweden.
| | | |
Collapse
|
31
|
Kiel C, Serrano L. The ubiquitin domain superfold: structure-based sequence alignments and characterization of binding epitopes. J Mol Biol 2005; 355:821-44. [PMID: 16310215 DOI: 10.1016/j.jmb.2005.10.010] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2005] [Revised: 09/29/2005] [Accepted: 10/05/2005] [Indexed: 10/25/2022]
Abstract
Ubiquitin-like domains are present, apart from ubiquitin-like proteins themselves, in many multidomain proteins involved in different signal transduction processes. The sequence conservation for all ubiquitin superfold family members is rather poor, even between subfamily members, leading to mistakes in sequence alignments using conventional sequence alignment methods. However, a correct alignment is essential, especially for in silico methods that predict binding partners on the basis of sequence and structure. In this study, using 3D-structural information we have generated and manually corrected sequence alignments for proteins of the five ubiquitin superfold subfamilies. On the basis of this alignment, we suggest domains for which structural information will be useful to allow homology modelling. In addition, we have analysed the energetic and electrostatic properties of ubiquitin-like domains in complex with various functional binding proteins using the protein design algorithm FoldX. On the basis of an in silico alanine-scanning mutagenesis, we provide a detailed binding epitope mapping of the hotspots of the ubiquitin domain fold, involved in the interaction with different domains and proteins. Finally, we provide a consensus fingerprint sequence that identifies all sequences described to belong to the ubiquitin superfold family. It is possible that the method that we describe may be applied to other domain families sharing a similar fold but having low levels of sequence homology.
Collapse
Affiliation(s)
- Christina Kiel
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany.
| | | |
Collapse
|
32
|
Ohlson T, Elofsson A. ProfNet, a method to derive profile-profile alignment scoring functions that improves the alignments of distantly related proteins. BMC Bioinformatics 2005; 6:253. [PMID: 16225676 PMCID: PMC1274300 DOI: 10.1186/1471-2105-6-253] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2005] [Accepted: 10/14/2005] [Indexed: 11/10/2022] Open
Abstract
Background Profile-profile methods have been used for some years now to detect and align homologous proteins. The best such methods use information from the background distribution of amino acids and substitution tables either when constructing the profiles or in the scoring. This makes the methods dependent on the quality and choice of substitution table as well as the construction of the profiles. Here, we introduce a novel method called ProfNet that is used to derive a profile-profile scoring function. The method optimizes the discrimination between scores of related and unrelated residues and it is fast and straightforward to use. This new method derives a scoring function that is mainly dependent on the actual alignment of residues from a training set, and it does not use any additional information about the background distribution. Results It is shown that ProfNet improves the discrimination of related and unrelated residues. Further it can be used to improve the alignment of distantly related proteins. Conclusion The best performance is obtained using superfamily related proteins in the training of ProfNet, and a classifier that is related to the distance between the structurally aligned residues. The main difference between the new scoring function and a traditional profile-profile scoring function is that conserved residues on average score higher with the new function.
Collapse
Affiliation(s)
- Tomas Ohlson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Arne Elofsson
- Stockholm Blolnformatlcs Center, Stockholm University, SE-106 91 Stockholm, Sweden
| |
Collapse
|
33
|
Margelevičius M, Venclovas Č. PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability. BMC Bioinformatics 2005; 6:185. [PMID: 16033659 PMCID: PMC1187875 DOI: 10.1186/1471-2105-6-185] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2005] [Accepted: 07/21/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein sequence alignments have become indispensable for virtually any evolutionary, structural or functional study involving proteins. Modern sequence search and comparison methods combined with rapidly increasing sequence data often can reliably match even distantly related proteins that share little sequence similarity. However, even highly significant matches generally may have incorrectly aligned regions. Therefore when exact residue correspondence is used to transfer biological information from one aligned sequence to another, it is critical to know which alignment regions are reliable and which may contain alignment errors. RESULTS PSI-BLAST-ISS is a standalone Unix-based tool designed to delineate reliable regions of sequence alignments as well as to suggest potential variants in unreliable regions. The region-specific reliability is assessed by producing multiple sequence alignments in different sequence contexts followed by the analysis of the consistency of alignment variants. The PSI-BLAST-ISS output enables the user to simultaneously analyze alignment reliability between query and multiple homologous sequences. In addition, PSI-BLAST-ISS can be used to detect distantly related homologous proteins. The software is freely available at: http://www.ibt.lt/bioinformatics/iss. CONCLUSION PSI-BLAST-ISS is an effective reliability assessment tool that can be useful in applications such as comparative modelling or analysis of individual sequence regions. It favorably compares with the existing similar software both in the performance and functional features.
Collapse
|
34
|
Tress M, de Juan D, Graña O, Gómez MJ, Gómez-Puertas P, González JM, López G, Valencia A. Scoring docking models with evolutionary information. Proteins 2005; 60:275-80. [DOI: 10.1002/prot.20570] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
35
|
Marmey P, Rojas-Mendoza A, de Kochko A, Beachy RN, Fauquet CM. Characterization of the protease domain of Rice tungro bacilliform virus responsible for the processing of the capsid protein from the polyprotein. Virol J 2005; 2:33. [PMID: 15831103 PMCID: PMC1087892 DOI: 10.1186/1743-422x-2-33] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2005] [Accepted: 04/14/2005] [Indexed: 11/21/2022] Open
Abstract
Background Rice tungro bacilliform virus (RTBV) is a pararetrovirus, and a member of the family Caulimoviridae in the genus Badnavirus. RTBV has a long open reading frame that encodes a large polyprotein (P3). Pararetroviruses show similarities with retroviruses in molecular organization and replication. P3 contains a putative movement protein (MP), the capsid protein (CP), the aspartate protease (PR) and the reverse transcriptase (RT) with a ribonuclease H activity. PR is a member of the cluster of retroviral proteases and serves to proteolytically process P3. Previous work established the N- and C-terminal amino acid sequences of CP and RT, processing of RT by PR, and estimated the molecular mass of PR by western blot assays. Results A molecular mass of a protein that was associated with virions was determined by in-line HPLC electrospray ionization mass spectral analysis. Comparison with retroviral proteases amino acid sequences allowed the characterization of a putative protease domain in this protein. Structural modelling revealed strong resemblance with retroviral proteases, with overall folds surrounding the active site being well conserved. Expression in E. coli of putative domain was affected by the presence or absence of the active site in the construct. Analysis of processing of CP by PR, using pulse chase labelling experiments, demonstrated that the 37 kDa capsid protein was dependent on the presence of the protease in the constructs. Conclusion The findings suggest the characterization of the RTBV protease domain. Sequence analysis, structural modelling, in vitro expression studies are evidence to consider the putative domain as being the protease domain. Analysis of expression of different peptides corresponding to various domains of P3 suggests a processing of CP by PR. This work clarifies the organization of the RTBV polyprotein, and its processing by the RTBV protease.
Collapse
Affiliation(s)
- Philippe Marmey
- IRD, UMR «DGPC», B.P. 64501, 34394 Montpellier cedex 5, France
| | - Ana Rojas-Mendoza
- Protein Design Group, Centro Nacional de Biotecnologia, Campus Universidad Autonoma Cantoblanco, 28049 Madrid, Spain
| | | | - Roger N Beachy
- Donald Danforth Plant Science Center, 975 North Warson Road, St. Louis, MO 63132, USA
| | - Claude M Fauquet
- Donald Danforth Plant Science Center, 975 North Warson Road, St. Louis, MO 63132, USA
| |
Collapse
|
36
|
Han S, Lee BC, Yu ST, Jeong CS, Lee S, Kim D. Fold recognition by combining profile-profile alignment and support vector machine. Bioinformatics 2005; 21:2667-73. [PMID: 15769835 DOI: 10.1093/bioinformatics/bti384] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity.
Collapse
Affiliation(s)
- Sangjo Han
- Department of Biosystems, Korea Advanced Institute of Science and Technology, Daejeon, 305-701, Korea
| | | | | | | | | | | |
Collapse
|
37
|
Thompson JD, Prigent V, Poch O. LEON: multiple aLignment Evaluation Of Neighbours. Nucleic Acids Res 2004; 32:1298-307. [PMID: 14982955 PMCID: PMC390283 DOI: 10.1093/nar/gkh294] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2003] [Revised: 01/16/2004] [Accepted: 01/29/2004] [Indexed: 11/13/2022] Open
Abstract
Sequence alignments are fundamental to a wide range of applications, including database searching, functional residue identification and structure prediction techniques. These applications predict or propagate structural/functional/evolutionary information based on a presumed homology between the aligned sequences. If the initial hypothesis of homology is wrong, no subsequent application, however sophisticated, can be expected to yield accurate results. Here we present a novel method, LEON, to predict homology between proteins based on a multiple alignment of complete sequences (MACS). In MACS, weak signals from distantly related proteins can be considered in the overall context of the family. Intermediate sequences and the combination of individual weak matches are used to increase the significance of low-scoring regions. Residue composition is also taken into account by incorporation of several existing methods for the detection of compositionally biased sequence segments. The accuracy and reliability of the predictions is demonstrated in large-scale comparisons with structural and sequence family databases, where the specificity was shown to be >99% and the sensitivity was estimated to be approximately 76%. LEON can thus be used to reliably identify the complex relationships between large multidomain proteins and should be useful for automatic high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Collapse
Affiliation(s)
- Julie D Thompson
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP, BP 163, 67404 Illkirch Cedex, France
| | | | | |
Collapse
|