1
|
Tian P, Best RB. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis. Biophys J 2017; 113:1719-1730. [PMID: 29045866 PMCID: PMC5647607 DOI: 10.1016/j.bpj.2017.08.039] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Revised: 08/03/2017] [Accepted: 08/08/2017] [Indexed: 12/23/2022] Open
Abstract
Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance.
Collapse
Affiliation(s)
- Pengfei Tian
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland
| | - Robert B Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.
| |
Collapse
|
2
|
Skolnick J, Zhou H. Why Is There a Glass Ceiling for Threading Based Protein Structure Prediction Methods? J Phys Chem B 2016; 121:3546-3554. [PMID: 27748116 DOI: 10.1021/acs.jpcb.6b09517] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Despite their different implementations, comparison of the best threading approaches to the prediction of evolutionary distant protein structures reveals that they tend to succeed or fail on the same protein targets. This is true despite the fact that the structural template library has good templates for all cases. Thus, a key question is why are certain protein structures threadable while others are not. Comparison with threading results on a set of artificial sequences selected for stability further argues that the failure of threading is due to the nature of the protein structures themselves. Using a new contact map based alignment algorithm, we demonstrate that certain folds are highly degenerate in that they can have very similar coarse grained fractions of native contacts aligned and yet differ significantly from the native structure. For threadable proteins, this is not the case. Thus, contemporary threading approaches appear to have reached a plateau, and new approaches to structure prediction are required.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology , 950 Atlantic Drive Northwest, Atlanta, Georgia 30318, United States
| | - Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology , 950 Atlantic Drive Northwest, Atlanta, Georgia 30318, United States
| |
Collapse
|
3
|
Li J, Fang H. A comparison of different functions for predicted protein model quality assessment. J Comput Aided Mol Des 2016; 30:553-8. [PMID: 27488386 DOI: 10.1007/s10822-016-9924-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2016] [Accepted: 07/08/2016] [Indexed: 11/30/2022]
Abstract
In protein structure prediction, a considerable number of models are usually produced by either the Template-Based Method (TBM) or the ab initio prediction. The purpose of this study is to find the critical parameter in assessing the quality of the predicted models. A non-redundant template library was developed and 138 target sequences were modeled. The target sequences were all distant from the proteins in the template library and were aligned with template library proteins on the basis of the transformation matrix. The quality of each model was first assessed with QMEAN and its six parameters, which are C_β interaction energy (C_beta), all-atom pairwise energy (PE), solvation energy (SE), torsion angle energy (TAE), secondary structure agreement (SSA), and solvent accessibility agreement (SAE). Finally, the alignment score (score) was also used to assess the quality of model. Hence, a total of eight parameters (i.e., QMEAN, C_beta, PE, SE, TAE, SSA, SAE, score) were independently used to assess the quality of each model. The results indicate that SSA is the best parameter to estimate the quality of the model.
Collapse
Affiliation(s)
- Juan Li
- Department of Hematology, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School, Nanjing, Jiangsu, 210008, People's Republic of China
| | - Huisheng Fang
- School of Life Science and Technology, China Pharmaceutical University, Nanjing, Jiangsu, 210009, People's Republic of China.
| |
Collapse
|
4
|
An empirical energy function for structural assessment of protein transmembrane domains. Biochimie 2015; 115:155-61. [DOI: 10.1016/j.biochi.2015.05.018] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 05/21/2015] [Indexed: 11/19/2022]
|
5
|
Chae MH, Krull F, Knapp EW. Optimized distance-dependent atom-pair-based potential DOOP for protein structure prediction. Proteins 2015; 83:881-90. [PMID: 25693513 DOI: 10.1002/prot.24782] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2014] [Revised: 02/06/2015] [Accepted: 02/10/2015] [Indexed: 12/20/2022]
Abstract
The DOcking decoy-based Optimized Potential (DOOP) energy function for protein structure prediction is based on empirical distance-dependent atom-pair interactions. To optimize the atom-pair interactions, native protein structures are decomposed into polypeptide chain segments that correspond to structural motives involving complete secondary structure elements. They constitute near native ligand-receptor systems (or just pairs). Thus, a total of 8609 ligand-receptor systems were prepared from 954 selected proteins. For each of these hypothetical ligand-receptor systems, 1000 evenly sampled docking decoys with 0-10 Å interface root-mean-square-deviation (iRMSD) were generated with a method used before for protein-protein docking. A neural network-based optimization method was applied to derive the optimized energy parameters using these decoys so that the energy function mimics the funnel-like energy landscape for the interaction between these hypothetical ligand-receptor systems. Thus, our method hierarchically models the overall funnel-like energy landscape of native protein structures. The resulting energy function was tested on several commonly used decoy sets for native protein structure recognition and compared with other statistical potentials. In combination with a torsion potential term which describes the local conformational preference, the atom-pair-based potential outperforms other reported statistical energy functions in correct ranking of native protein structures for a variety of decoy sets. This is especially the case for the most challenging ROSETTA decoy set, although it does not take into account side chain orientation-dependence explicitly. The DOOP energy function for protein structure prediction, the underlying database of protein structures with hypothetical ligand-receptor systems and their decoys are freely available at http://agknapp.chemie.fu-berlin.de/doop/.
Collapse
Affiliation(s)
- Myong-Ho Chae
- Department of Biology, University of Science, Unjong-District, Pyongyang, DPR Korea
| | | | | |
Collapse
|
6
|
|
7
|
Gniewek P, Kolinski A, Gront D. Optimization of profile-to-profile alignment parameters for one-dimensional threading. J Comput Biol 2012; 19:879-86. [PMID: 22731622 DOI: 10.1089/cmb.2011.0307] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
The development of automatic approaches for the comparison of protein sequences has become increasingly important. Methods that compare profiles allow for the use of information about whole protein families, resulting in more sensitive and accurate detection of distantly related sequences. In this contribution, we describe a thorough optimization and tests of a profile-to-profile alignment method. A number of different scoring schemes has been implemented and compared on the basis of their ability to identify a template protein from the same SCOP family as a query. In addition to sequence profiles, secondary structure profiles were used to increase the rate of successful detection. Our results show that a properly tuned one-dimensional threading method can recognize a correct template from the same SCOP family nearly as well as structural alignment. Our benchmark set, which might be useful in other similar studies, as well as the fold-recognition software we developed may be downloaded (www.bioshell.pl/profile-alignments).
Collapse
Affiliation(s)
- Pawel Gniewek
- Faculty of Chemistry, Warsaw University, Warsaw, Poland
| | | | | |
Collapse
|
8
|
Vishnepolsky B, Managadze G, Grigolava M, Pirtskhalava M. Evaluation performance of substitution matrices, based on contacts between residue terminal groups. J Biomol Struct Dyn 2012; 30:180-90. [PMID: 22702729 DOI: 10.1080/07391102.2012.677769] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Sequence alignment is a standard method for the estimation of the evolutionary, structural, and functional relationships among amino acid sequences. The quality of alignments depends on the used similarity matrix. Statistical contact potentials (CPs) contain information on contact propensities among residues in native protein structures. Substitution matrices (SMs) based on CPs are applicable for the comparison of distantly related sequences. Here, contact between amino acids was estimated on the basis of the evaluation of the distances between side-chain terminal groups (SCTGs), which are defined as the group of the side-chain heavy atoms with fixed distances between them. In this paper, two new types of CPs and similarity matrices have been constructed: one based on fixed cutoff distance obtained from geometric characteristics of the SCTGs (TGC1), while the other is distance-dependent potential (TGC2). These matrices are compared with other popular SMs. The performance of the matrices was evaluated by comparing sequence with structural alignments. The obtained results show that TGC2 has the best performance among contact-based matrices, but on the whole, contact-based matrices have slightly lower performance than other SMs except fold-level similarity.
Collapse
Affiliation(s)
- Boris Vishnepolsky
- Life Science Research Centre, Laboratory of Bioinformatics, 14 Gotua St, Tbilisi, 0160, Georgia.
| | | | | | | |
Collapse
|
9
|
Gront D, Blaszczyk M, Wojciechowski P, Kolinski A. BioShell Threader: protein homology detection based on sequence profiles and secondary structure profiles. Nucleic Acids Res 2012; 40:W257-62. [PMID: 22693216 PMCID: PMC3394251 DOI: 10.1093/nar/gks555] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The BioShell package has recently been extended with a web server for protein homology detection based on profile-to-profile alignment (known as 1D threading). Its aim is to assign structural templates to each domain of the query. The server uses sequence profiles that describe observed sequence variability and secondary structure profiles providing expected probability for a certain secondary structure type at a given position in a protein. Three independent predictors are used to increase the rate of successful predictions. Careful evaluation shows that there is nearly 80% chance that the query sequence belongs to the same SCOP family as the top scoring template. The Bioshell Threader server is freely available at: http://www.bioshell.pl/threader/.
Collapse
Affiliation(s)
- Dominik Gront
- University of Warsaw, Faculty of Chemistry, Pasteura 1, 02-093 Warsaw, Poland.
| | | | | | | |
Collapse
|
10
|
Zhao F, Xu J. A position-specific distance-dependent statistical potential for protein structure and functional study. Structure 2012; 20:1118-26. [PMID: 22608968 PMCID: PMC3372698 DOI: 10.1016/j.str.2012.04.003] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2012] [Revised: 04/09/2012] [Accepted: 04/10/2012] [Indexed: 10/28/2022]
Abstract
Although studied extensively, designing highly accurate protein energy potential is still challenging. A lot of knowledge-based statistical potentials are derived from the inverse of the Boltzmann law and consist of two major components: observed atomic interacting probability and reference state. These potentials mainly distinguish themselves in the reference state and use a similar simple counting method to estimate the observed probability, which is usually assumed to correlate with only atom types. This article takes a rather different view on the observed probability and parameterizes it by the protein sequence profile context of the atoms and the radius of the gyration, in addition to atom types. Experiments confirm that our position-specific statistical potential outperforms currently the popular ones in several decoy discrimination tests. Our results imply that, in addition to reference state, the observed probability also makes energy potentials different and evolutionary information greatly boost performance of energy potentials.
Collapse
Affiliation(s)
- Feng Zhao
- Toyota Technological Institute at Chicago, Chicago IL, USA 60637
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago IL, USA 60637
| |
Collapse
|
11
|
Movaghar AF, Launay G, Schbath S, Gibrat JF, Rodolphe F. Statistical significance of threading scores. J Comput Biol 2011; 19:13-29. [PMID: 22149633 DOI: 10.1089/cmb.2011.0236] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present a general method for assessing threading score significance. The threading score of a protein sequence, thread onto a given structure, should be compared with the threading score distribution of a random amino-acid sequence, of the same length, thread on the same structure; small p-values point significantly high scores. We claim that, due to general protein contact map properties, this reference distribution is a Weibull extreme value distribution whose parameters depend on the threading method, the structure, the length of the query and the random sequence simulation model used. These parameters can be estimated off-line with simulated sequence samples, for different sequence lengths. They can further be interpolated at the exact length of a query, enabling the quick computation of the p-value.
Collapse
|
12
|
Kuziemko A, Honig B, Petrey D. Using structure to explore the sequence alignment space of remote homologs. PLoS Comput Biol 2011; 7:e1002175. [PMID: 21998567 PMCID: PMC3188491 DOI: 10.1371/journal.pcbi.1002175] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2011] [Accepted: 07/14/2011] [Indexed: 11/18/2022] Open
Abstract
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is “optimal” in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are “suboptimal” in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for “modelability”, we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended. It has been suggested that, for nearly every protein sequence, there is already a protein with a similar structure in current protein structure databases. However, with poor or undetectable sequence relationships, it is expected that accurate alignments and models cannot be generated. Here we show that this is not the case, and that whenever structural relationship exists, there are usually local sequence relationships that can be used to generate an accurate alignment, no matter what the global sequence identity. However, this requires an alternative to the traditional dynamic programming algorithm and the consideration of a small ensemble of alignments. We present an algorithm, S4, and demonstrate that it is capable of generating accurate alignments in nearly all cases where a structural relationship exists between two proteins. Our results thus constitute an important advance in the full exploitation of the information in structural databases. That is, the expectation of an accurate alignment suggests that a meaningful model can be generated for nearly every sequence for which a suitable template exists.
Collapse
Affiliation(s)
- Andrew Kuziemko
- Howard Hughes Medical Institute, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Barry Honig
- Howard Hughes Medical Institute, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Donald Petrey
- Howard Hughes Medical Institute, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
13
|
Hu Y, Dong X, Wu A, Cao Y, Tian L, Jiang T. Incorporation of local structural preference potential improves fold recognition. PLoS One 2011; 6:e17215. [PMID: 21365008 PMCID: PMC3041821 DOI: 10.1371/journal.pone.0017215] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 01/25/2011] [Indexed: 11/19/2022] Open
Abstract
Fold recognition, or threading, is a popular protein structure modeling approach that uses known structure templates to build structures for those of unknown. The key to the success of fold recognition methods lies in the proper integration of sequence, physiochemical and structural information. Here we introduce another type of information, local structural preference potentials of 3-residue and 9-residue fragments, for fold recognition. By combining the two local structural preference potentials with the widely used sequence profile, secondary structure information and hydrophobic score, we have developed a new threading method called FR-t5 (fold recognition by use of 5 terms). In benchmark testings, we have found the consideration of local structural preference potentials in FR-t5 not only greatly enhances the alignment accuracy and recognition sensitivity, but also significantly improves the quality of prediction models.
Collapse
Affiliation(s)
- Yun Hu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Xiaoxi Dong
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Aiping Wu
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
| | - Yang Cao
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Liqing Tian
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- Graduate University of Chinese Academy of Sciences, Beijing, China
| | - Taijiao Jiang
- National Laboratory of Biomacromolecules, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
14
|
Lee SY, Skolnick J. TASSER_WT: a protein structure prediction algorithm with accurate predicted contact restraints for difficult protein targets. Biophys J 2011; 99:3066-75. [PMID: 21044605 DOI: 10.1016/j.bpj.2010.09.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2010] [Revised: 08/29/2010] [Accepted: 09/07/2010] [Indexed: 12/29/2022] Open
Abstract
To improve the prediction accuracy in the regime where template alignment quality is poor, an updated version of TASSER_2.0, namely TASSER_WT, was developed. TASSER_WT incorporates more accurate contact restraints from a new method, COMBCON. COMBCON uses confidence-weighted contacts from PROSPECTOR_3.5, the latest version, PROSPECTOR_4, and a new local structural fragment-based threading algorithm, STITCH, implemented in two variants depending on expected fragment prediction accuracy. TASSER_WT is tested on 622 Hard proteins, the most difficult targets (incorrect alignments and/or templates and incorrect side-chain contact restraints) in a comprehensive benchmark of 2591 nonhomologous, single domain proteins ≤ 200 residues that cover the PDB at 35% pairwise sequence identity. For 454 of 622 Hard targets, COMBCON provides contact restraints with higher accuracy and number of contacts per residue. As contact coverage with confidence weight ≥ 3 (F(wt ≥ 3)(cov)) increases, the more improved are TASSER_WT models. When F(wt ≥ 3)(cov) > 1.0 and > 0.4, the average root mean-square deviation of TASSER_WT (TASSER_2.0) models is 4.11 Å (6.72 Å) and 5.03 Å (6.40 Å), respectively. Regarding a structure prediction as successful when a model has a TM-score to the native structure ≥ 0.4, when F(wt ≥ 3)(cov) > 1.0 and > 0.4, the success rate of TASSER_WT (TASSER_2.0) is 98.8% (76.2%) and 93.7% (81.1%), respectively.
Collapse
Affiliation(s)
- Seung Yup Lee
- Center for Study of Systems Biology, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | |
Collapse
|
15
|
Cetin H, Sasaki TN, Sasai M. The Fragment-based Consistency Score in Model Quality Assessment for De Novo Prediction of Protein Structures. CHEM-BIO INFORMATICS JOURNAL 2011. [DOI: 10.1273/cbij.11.63] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Hikmet Cetin
- Department of Computational Science and Engineering, Nagoya University
| | | | - Masaki Sasai
- Department of Computational Science and Engineering, Nagoya University
- School of Computational Sciences, Korea Institute for Advanced Study
- Okazaki Institute for Integrative Bioscience
| |
Collapse
|
16
|
Zhou Y, Duan Y, Yang Y, Faraggi E, Lei H. Trends in template/fragment-free protein structure prediction. Theor Chem Acc 2011; 128:3-16. [PMID: 21423322 PMCID: PMC3030773 DOI: 10.1007/s00214-010-0799-2] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 08/15/2010] [Indexed: 12/13/2022]
Abstract
Predicting the structure of a protein from its amino acid sequence is a long-standing unsolved problem in computational biology. Its solution would be of both fundamental and practical importance as the gap between the number of known sequences and the number of experimentally solved structures widens rapidly. Currently, the most successful approaches are based on fragment/template reassembly. Lacking progress in template-free structure prediction calls for novel ideas and approaches. This article reviews trends in the development of physical and specific knowledge-based energy functions as well as sampling techniques for fragment-free structure prediction. Recent physical- and knowledge-based studies demonstrated that it is possible to sample and predict highly accurate protein structures without borrowing native fragments from known protein structures. These emerging approaches with fully flexible sampling have the potential to move the field forward.
Collapse
Affiliation(s)
- Yaoqi Zhou
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Yong Duan
- UC Davis Genome Center and Department of Applied Science, University of California, One Shields Avenue, Davis, CA USA
- College of Physics, Huazhong University of Science and Technology, 1037 Luoyu Road, 430074 Wuhan, China
| | - Yuedong Yang
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Eshel Faraggi
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Hongxing Lei
- UC Davis Genome Center and Department of Applied Science, University of California, One Shields Avenue, Davis, CA USA
- Beijing Institute of Genomics, Chinese Academy of Sciences, 100029 Beijing, China
| |
Collapse
|
17
|
Zhang J, Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS One 2010; 5:e15386. [PMID: 21060880 PMCID: PMC2965178 DOI: 10.1371/journal.pone.0015386] [Citation(s) in RCA: 171] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2010] [Accepted: 09/01/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND An accurate potential function is essential to attack protein folding and structure prediction problems. The key to developing efficient knowledge-based potential functions is to design reference states that can appropriately counteract generic interactions. The reference states of many knowledge-based distance-dependent atomic potential functions were derived from non-interacting particles such as ideal gas, however, which ignored the inherent sequence connectivity and entropic elasticity of proteins. METHODOLOGY We developed a new pair-wise distance-dependent, atomic statistical potential function (RW), using an ideal random-walk chain as reference state, which was optimized on CASP models and then benchmarked on nine structural decoy sets. Second, we incorporated a new side-chain orientation-dependent energy term into RW (RWplus) and found that the side-chain packing orientation specificity can further improve the decoy recognition ability of the statistical potential. SIGNIFICANCE RW and RWplus demonstrate a significantly better ability than the best performing pair-wise distance-dependent atomic potential functions in both native and near-native model selections. It has higher energy-RMSD and energy-TM-score correlations compared with other potentials of the same type in real-life structure assembly decoys. When benchmarked with a comprehensive list of publicly available potentials, RW and RWplus shows comparable performance to the state-of-the-art scoring functions, including those combining terms from multiple resources. These data demonstrate the usefulness of random-walk chain as reference states which correctly account for sequence connectivity and entropic elasticity of proteins. It shows potential usefulness in structure recognition and protein folding simulations. The RW and RWplus potentials, as well as the newly generated I-TASSER decoys, are freely available in http://zhanglab.ccmb.med.umich.edu/RW.
Collapse
Affiliation(s)
- Jian Zhang
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yang Zhang
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
18
|
Feng Y, Kloczkowski A, Jernigan RL. Potentials 'R' Us web-server for protein energy estimations with coarse-grained knowledge-based potentials. BMC Bioinformatics 2010; 11:92. [PMID: 20163737 PMCID: PMC3098114 DOI: 10.1186/1471-2105-11-92] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 02/17/2010] [Indexed: 11/13/2022] Open
Abstract
Background Knowledge-based potentials have been widely used in the last 20 years for fold recognition, protein structure prediction from amino acid sequence, ligand binding, protein design, and many other purposes. However generally these are not readily accessible online. Results Our new knowledge-based potential server makes available many of these potentials for easy use to automatically compute the energies of protein structures or models supplied. Our web server for protein energy estimation uses four-body potentials, short-range potentials, and 23 different two-body potentials. Users can select potentials according to their needs and preferences. Files containing the coordinates of protein atoms in the PDB format can be uploaded as input. The results will be returned to the user's email address. Conclusions Our Potentials 'R'Us server is an easily accessible, freely available tool with a web interface that collects all existing and future protein coarse-grained potentials and computes energies of multiple structural models.
Collapse
Affiliation(s)
- Yaping Feng
- Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA 50011-0320, USA
| | | | | |
Collapse
|
19
|
Aloy P, Oliva B. Splitting statistical potentials into meaningful scoring functions: testing the prediction of near-native structures from decoy conformations. BMC STRUCTURAL BIOLOGY 2009; 9:71. [PMID: 19917096 PMCID: PMC2783033 DOI: 10.1186/1472-6807-9-71] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2009] [Accepted: 11/16/2009] [Indexed: 11/20/2022]
Abstract
Background Recent advances on high-throughput technologies have produced a vast amount of protein sequences, while the number of high-resolution structures has seen a limited increase. This has impelled the production of many strategies to built protein structures from its sequence, generating a considerable amount of alternative models. The selection of the closest model to the native conformation has thus become crucial for structure prediction. Several methods have been developed to score protein models by energies, knowledge-based potentials and combination of both. Results Here, we present and demonstrate a theory to split the knowledge-based potentials in scoring terms biologically meaningful and to combine them in new scores to predict near-native structures. Our strategy allows circumventing the problem of defining the reference state. In this approach we give the proof for a simple and linear application that can be further improved by optimizing the combination of Zscores. Using the simplest composite score () we obtained predictions similar to state-of-the-art methods. Besides, our approach has the advantage of identifying the most relevant terms involved in the stability of the protein structure. Finally, we also use the composite Zscores to assess the conformation of models and to detect local errors. Conclusion We have introduced a method to split knowledge-based potentials and to solve the problem of defining a reference state. The new scores have detected near-native structures as accurately as state-of-art methods and have been successful to identify wrongly modeled regions of many near-native conformations.
Collapse
Affiliation(s)
- Patrick Aloy
- Institut de Recerca Biomèdica and Barcelona Supercomputing Center, 10-12 08028 Barcelona, Catalonia, Spain.
| | | |
Collapse
|
20
|
Zhu J, Fan H, Periole X, Honig B, Mark AE. Refining homology models by combining replica-exchange molecular dynamics and statistical potentials. Proteins 2009; 72:1171-88. [PMID: 18338384 DOI: 10.1002/prot.22005] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
A protocol is presented for the global refinement of homology models of proteins. It combines the advantages of temperature-based replica-exchange molecular dynamics (REMD) for conformational sampling and the use of statistical potentials for model selection. The protocol was tested using 21 models. Of these 14 were models of 10 small proteins for which high-resolution crystal structures were available, the remainder were targets of the recent CASPR exercise. It was found that REMD in combination with currently available force fields could sample near-native conformational states starting from high-quality homology models. Conformations in which the backbone RMSD of secondary structure elements (SSE-RMSD) was lower than the starting value by 0.5-1.0 A were found for 15 out of the 21 cases (average 0.82 A). Furthermore, when a simple scoring function consisting of two statistical potentials was used to rank the structures, one or more structures with SSE-RMSD of at least 0.2 A lower than the starting value was found among the five best ranked structures in 11 out of the 21 cases. The average improvement in SSE-RMSD for the best models was 0.42 A. However, none of the scoring functions tested identified the structures with the lowest SSE-RMSD as the best models although all identified the native conformation as the one with lowest energy. This suggests that while the proposed protocol proved effective for the refinement of high-quality models of small proteins scoring functions remain one of the major limiting factors in structure refinement. This and other aspects by which the methodology could be further improved are discussed.
Collapse
Affiliation(s)
- Jiang Zhu
- Howard Hughes Medical Institute and Columbia University, Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, USA
| | | | | | | | | |
Collapse
|
21
|
Vishnepolsky B, Managadze G, Pirtskhalava M. Comparison of the efficiency of evolutionary change-based and side chain orientation-based fold recognition potentials. Proteins 2008; 71:1863-78. [PMID: 18175309 DOI: 10.1002/prot.21871] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The present article describes residue level knowledge based potential SORDIS. SORDIS incorporates the information on side-chain orientation in relation to hydrophobic core centres, distance of residue from the globule centre and secondary structure. SORDIS has been tested and compared with widespread evolutionary change-based substitution matrices (BLOSUM, PAM, GONNET, Johnson-Overington, BLAJ, HSDM, and STROMA) in fold recognition experiments within the zone of weak sequence similarity (<16%). The obtained results show that the lower is the amino acid similarity between homologous pairs the higher is the performance of SORDIS in comparison with the potentials, based on the information about the evolutionary changes. Therefore, we propose that the employment of SORDIS in fold recognition can be useful.
Collapse
Affiliation(s)
- Boris Vishnepolsky
- Institute of Molecular Biology and Biological Physics, Tbilisi 0160, Georgia
| | | | | |
Collapse
|
22
|
Pulim V, Bienkowska J, Berger B. LTHREADER: prediction of extracellular ligand-receptor interactions in cytokines using localized threading. Protein Sci 2007; 17:279-92. [PMID: 18096641 DOI: 10.1110/ps.073178108] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Identification of extracellular ligand-receptor interactions is important for drug design and the treatment of diseases. Difficulties in detecting these interactions using high-throughput experimental techniques motivate the development of computational prediction methods. We propose a novel threading algorithm, LTHREADER, which generates accurate local sequence-structure interface alignments and integrates various statistical scores and experimental binding data to predict interactions within ligand-receptor families. LTHREADER uses a profile of secondary structure and solvent accessibility predictions with residue contact maps to guide and constrain alignments. Using a decision tree classifier and low-throughput experimental data for training, it combines information inferred from statistical interaction potentials, energy functions, correlated mutations, and conserved residue pairs to predict interactions. We apply our method to cytokines, which play a central role in the development of many diseases including cancer and inflammatory and autoimmune disorders. We tested our approach on two representative families from different structural classes (all-alpha and all-beta proteins) of cytokines. In comparison with the state-of-the-art threader RAPTOR, LTHREADER generates on average 20% more accurate alignments of interacting residues. Furthermore, in cross-validation tests, LTHREADER correctly predicts experimentally confirmed interactions for a common binding mode within the 4-helical long-chain cytokine family with 75% sensitivity and 86% specificity with 40% gain in sensitivity compared to RAPTOR. For the TNF-like family our method achieves 70% sensitivity with 55% specificity with 70% gain in sensitivity. LTHREADER combines information from multiple complex templates when such data are available. When only one solved structure is available, a localized PSI-BLAST approach also outperforms standard threading methods with 25%-50% improvements in sensitivity.
Collapse
Affiliation(s)
- Vinay Pulim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA
| | | | | |
Collapse
|
23
|
Liu S, Zhang C, Liang S, Zhou Y. Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins 2007; 68:636-45. [PMID: 17510969 DOI: 10.1002/prot.21459] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Recognizing the structural similarity without significant sequence identity (called fold recognition) is the key for bridging the gap between the number of known protein sequences and the number of structures solved. Previously, we developed a fold-recognition method called SP(3) which combines sequence-derived sequence profiles, secondary-structure profiles and residue-depth dependent, structure-derived sequence profiles. The use of residue-depth-dependent profiles makes SP(3) one of the best automatic predictors in CASP 6. Because residue depth (RD) and solvent accessible surface area (solvent accessibility) are complementary in describing the exposure of a residue to solvent, we test whether or not incorporation of solvent-accessibility profiles into SP(3) could further increase the accuracy of fold recognition. The resulting method, called SP(4), was tested in SALIGN benchmark for alignment accuracy and Lindahl, LiveBench 8 and CASP7 blind prediction for fold recognition sensitivity and model-structure accuracy. For remote homologs, SP(4) is found to consistently improve over SP(3) in the accuracy of sequence alignment and predicted structural models as well as in the sensitivity of fold recognition. Our result suggests that RD and solvent accessibility can be used concurrently for improving the accuracy and sensitivity of fold recognition. The SP(4) server and its local usage package are available on http://sparks.informatics.iupui.edu/SP4.
Collapse
Affiliation(s)
- Song Liu
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York at Buffalo, Buffalo, New York 14214, USA
| | | | | | | |
Collapse
|
24
|
Koike R, Kinoshita K, Kidera A. Probabilistic alignment detects remote homology in a pair of protein sequences without homologous sequence information. Proteins 2007; 66:655-63. [PMID: 17152080 DOI: 10.1002/prot.21240] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Dynamic programming (DP) and its heuristic algorithms are the most fundamental methods for similarity searches of amino acid sequences. Their detection power has been improved by including supplemental information, such as homologous sequences in the profile method. Here, we describe a method, probabilistic alignment (PA), that gives improved detection power, but similarly to the original DP, uses only a pair of amino acid sequences. Receiver operating characteristic (ROC) analysis demonstrated that the PA method is far superior to BLAST, and that its sensitivity and selectivity approach to those of PSI-BLAST. Particularly for orphan proteins having few homologues in the database, PA exhibits much better performance than PSI-BLAST. On the basis of this observation, we applied the PA method to a homology search of two orphan proteins, Latexin and Resuscitation-promoting factor domain. Their molecular functions have been described based on structural similarities, but sequence homologues have not been identified by PSI-BLAST. PA successfully detected sequence homologues for the two proteins and confirmed that the observed structural similarities are the result of an evolutional relationship.
Collapse
Affiliation(s)
- Ryotaro Koike
- Global Scientific Information and Computing Center, Tokyo Institute of Technology, Ookayama, Tokyo 152-8550, Japan
| | | | | |
Collapse
|
25
|
Shen MY, Sali A. Statistical potential for assessment and prediction of protein structures. Protein Sci 2007; 15:2507-24. [PMID: 17075131 PMCID: PMC2242414 DOI: 10.1110/ps.062416606] [Citation(s) in RCA: 1778] [Impact Index Per Article: 104.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Protein structures in the Protein Data Bank provide a wealth of data about the interactions that determine the native states of proteins. Using the probability theory, we derive an atomic distance-dependent statistical potential from a sample of native structures that does not depend on any adjustable parameters (Discrete Optimized Protein Energy, or DOPE). DOPE is based on an improved reference state that corresponds to noninteracting atoms in a homogeneous sphere with the radius dependent on a sample native structure; it thus accounts for the finite and spherical shape of the native structures. The DOPE potential was extracted from a nonredundant set of 1472 crystallographic structures. We tested DOPE and five other scoring functions by the detection of the native state among six multiple target decoy sets, the correlation between the score and model error, and the identification of the most accurate non-native structure in the decoy set. For all decoy sets, DOPE is the best performing function in terms of all criteria, except for a tie in one criterion for one decoy set. To facilitate its use in various applications, such as model assessment, loop modeling, and fitting into cryo-electron microscopy mass density maps combined with comparative protein structure modeling, DOPE was incorporated into the modeling package MODELLER-8.
Collapse
Affiliation(s)
- Min-Yi Shen
- Department of Biopharmaceutical Sciences, Department of Pharmaceutical Chemistry, University of California at San Francisco, San Francisco, California 94158, USA.
| | | |
Collapse
|
26
|
Abstract
MOTIVATION Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.
Collapse
Affiliation(s)
- Huzefa Rangwala
- Department of Computer Science & Engineering, University of Minnesota Minneapolis, MN 55455, USA.
| | | |
Collapse
|
27
|
Miyazawa S, Jernigan RL. How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins? J Chem Phys 2006; 122:024901. [PMID: 15638624 DOI: 10.1063/1.1824012] [Citation(s) in RCA: 51] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We estimate the statistical distribution of relative orientations between contacting residues from a database of protein structures and evaluate the potential of mean force for relative orientations between contacting residues. Polar angles and Euler angles are used to specify two degrees of directional freedom and three degrees of rotational freedom for the orientation of one residue relative to another in contacting residues, respectively. A local coordinate system affixed to each residue based only on main chain atoms is defined for fold recognition. The number of contacting residue pairs in the database will severely limit the resolution of the statistical distribution of relative orientations, if it is estimated by dividing space into cells and counting samples observed in each cell. To overcome such problems and to evaluate the fully anisotropic distributions of relative orientations as a function of polar and Euler angles, we choose a method in which the observed distribution is represented as a sum of delta functions each of which represents the observed orientation of a contacting residue, and is evaluated as a series expansion of spherical harmonics functions. The sample size limits the frequencies of modes whose expansion coefficients can be reliably estimated. High frequency modes are statistically less reliable than low frequency modes. Each expansion coefficient is separately corrected for the sample size according to suggestions from a Bayesian statistical analysis. As a result, many expansion terms can be utilized to evaluate orientational distributions. Also, unlike other orientational potentials, the uniform distribution is used for a reference distribution in evaluating a potential of mean force for each type of contacting residue pair from its orientational distribution, so that residue-residue orientations can be fully evaluated. It is shown by using decoy sets that the discrimination power of the orientational potential in fold recognition increases by taking account of the Euler angle dependencies and becomes comparable to that of a simple contact potential, and that the total energy potential taken as a simple sum of contact, orientation, and (phi,psi) potentials performs well to identify the native folds.
Collapse
Affiliation(s)
- Sanzo Miyazawa
- Faculty of Technology, Gunma University, Kiryu, Gunma 376-8515, Japan.
| | | |
Collapse
|
28
|
Jayaram B, Bhushan K, Shenoy SR, Narang P, Bose S, Agrawal P, Sahu D, Pandey V. Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins. Nucleic Acids Res 2006; 34:6195-204. [PMID: 17090600 PMCID: PMC1693886 DOI: 10.1093/nar/gkl789] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
We describe here an energy based computer software suite for narrowing down the search space of tertiary structures of small globular proteins. The protocol comprises eight different computational modules that form an automated pipeline. It combines physics based potentials with biophysical filters to arrive at 10 plausible candidate structures starting from sequence and secondary structure information. The methodology has been validated here on 50 small globular proteins consisting of 2–3 helices and strands with known tertiary structures. For each of these proteins, a structure within 3–6 Å RMSD (root mean square deviation) of the native has been obtained in the 10 lowest energy structures. The protocol has been web enabled and is accessible at .
Collapse
Affiliation(s)
- B Jayaram
- Department of Chemistry and Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi Hauz Khas, New Delhi 110 016, India.
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Chivian D, Baker D. Homology modeling using parametric alignment ensemble generation with consensus and energy-based model selection. Nucleic Acids Res 2006; 34:e112. [PMID: 16971460 PMCID: PMC1635247 DOI: 10.1093/nar/gkl480] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The accuracy of a homology model based on the structure of a distant relative or other topologically equivalent protein is primarily limited by the quality of the alignment. Here we describe a systematic approach for sequence-to-structure alignment, called ‘K*Sync’, in which alignments are generated by dynamic programming using a scoring function that combines information on many protein features, including a novel measure of how obligate a sequence region is to the protein fold. By systematically varying the weights on the different features that contribute to the alignment score, we generate very large ensembles of diverse alignments, each optimal under a particular constellation of weights. We investigate a variety of approaches to select the best models from the ensemble, including consensus of the alignments, a hydrophobic burial measure, low- and high-resolution energy functions, and combinations of these evaluation methods. The effect on model quality and selection resulting from loop modeling and backbone optimization is also studied. The performance of the method on a benchmark set is reported and shows the approach to be effective at both generating and selecting accurate alignments. The method serves as the foundation of the homology modeling module in the Robetta server.
Collapse
Affiliation(s)
- Dylan Chivian
- Department of Biochemistry, University of WashingtonSeattle, WA, USA
| | - David Baker
- Department of Biochemistry, University of WashingtonSeattle, WA, USA
- Howard Hughes Medical Institute, SeattleWA, USA
- To whom correspondence should be addressed at Department of Biochemistry and HHMI, University of Washington, Box 357350, Seattle, WA 98195, USA. Tel: +1 206 543 1295; Fax: +1 206 685 1792;
| |
Collapse
|
30
|
Abstract
This study involves the development of a rapid comparative modeling tool for homologous sequences by extension of the TASSER methodology, developed for tertiary structure prediction. This comparative modeling procedure was validated on a representative benchmark set of proteins in the Protein Data Bank composed of 901 single domain proteins (41-200 residues) having sequence identities between 35-90% with respect to the template. Using a Monte Carlo search scheme with the length of runs optimized for weakly/nonhomologous proteins, TASSER often provides appreciable improvement in structure quality over the initial template. However, on average, this requires approximately 29 h of CPU time per sequence. Since homologous proteins are unlikely to require the extent of conformational search as weakly/nonhomologous proteins, TASSER's parameters were optimized to reduce the required CPU time to approximately 17 min, while retaining TASSER's ability to improve structure quality. Using this optimized TASSER (TASSER-Lite), we find an average improvement in the aligned region of approximately 10% in root mean-square deviation from native over the initial template. Comparison of TASSER-Lite with the widely used comparative modeling tool MODELLER showed that TASSER-Lite yields final models that are closer to the native. TASSER-Lite is provided on the web at (http://cssb.biology.gatech.edu/skolnick/webservice/tasserlite/index.html).
Collapse
Affiliation(s)
- Shashi Bhushan Pandit
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, GA 30318, USA
| | | | | |
Collapse
|
31
|
Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins 2006; 58:321-8. [PMID: 15523666 PMCID: PMC1408319 DOI: 10.1002/prot.20308] [Citation(s) in RCA: 195] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Recognizing structural similarity without significant sequence identity has proved to be a challenging task. Sequence-based and structure-based methods as well as their combinations have been developed. Here, we propose a fold-recognition method that incorporates structural information without the need of sequence-to-structure threading. This is accomplished by generating sequence profiles from protein structural fragments. The structure-derived sequence profiles allow a simple integration with evolution-derived sequence profiles and secondary-structural information for an optimized alignment by efficient dynamic programming. The resulting method (called SP(3)) is found to make a statistically significant improvement in both sensitivity of fold recognition and accuracy of alignment over the method based on evolution-derived sequence profiles alone (SP) and the method based on evolution-derived sequence profile and secondary structure profile (SP(2)). SP(3) was tested in SALIGN benchmark for alignment accuracy and Lindahl, PROSPECTOR 3.0, and LiveBench 8.0 benchmarks for remote-homology detection and model accuracy. SP(3) is found to be the most sensitive and accurate single-method server in all benchmarks tested where other methods are available for comparison (although its results are statistically indistinguishable from the next best in some cases and the comparison is subjected to the limitation of time-dependent sequence and/or structural library used by different methods.). In LiveBench 8.0, its accuracy rivals some of the consensus methods such as ShotGun-INBGU, Pmodeller3, Pcons4, and ROBETTA. SP(3) fold-recognition server is available on http://theory.med.buffalo.edu.
Collapse
Affiliation(s)
| | - Yaoqi Zhou
- *Correspondence to: Dr. Yaoqi Zhou, Howard Hughes Medical Institute, Center for Single Molecule Biophysics and Department of Physiology & Biophysics, State University of New York at Buffalo, 124 Sherman Hall, Buffalo, NY 14214. E-mail:
| |
Collapse
|
32
|
Abstract
Homology modeling plays a central role in determining protein structure in the structural genomics project. The importance of homology modeling has been steadily increasing because of the large gap that exists between the overwhelming number of available protein sequences and experimentally solved protein structures, and also, more importantly, because of the increasing reliability and accuracy of the method. In fact, a protein sequence with over 30% identity to a known structure can often be predicted with an accuracy equivalent to a low-resolution X-ray structure. The recent advances in homology modeling, especially in detecting distant homologues, aligning sequences with template structures, modeling of loops and side chains, as well as detecting errors in a model, have contributed to reliable prediction of protein structure, which was not possible even several years ago. The ongoing efforts in solving protein structures, which can be time-consuming and often difficult, will continue to spur the development of a host of new computational methods that can fill in the gap and further contribute to understanding the relationship between protein structure and function.
Collapse
Affiliation(s)
- Zhexin Xiang
- Center for Molecular Modeling, Center for Information Technology, National Institutes of Health, Building 12A Room 2051, 12 South Drive, Bethesda, Maryland 20892-5624, USA.
| |
Collapse
|
33
|
Abstract
Two single-method servers, SPARKS 2 and SP3, participated in automatic-server predictions in CASP6. The overall results for all as well as detailed performance in comparative modeling targets are presented. It is shown that both SPARKS 2 and SP3 are able to recognize their corresponding best templates for all easy comparative modeling targets. The alignment accuracy, however, is not always the best among all the servers. Possible factors are discussed. SPARKS 2 and SP3 fold recognition servers, as well as their executables, are freely available for all academic users on http://theory.med.buffalo.edu.
Collapse
Affiliation(s)
- Hongyi Zhou
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology and Biophysics, State University of New York, Buffalo, New York 14214, USA
| | | |
Collapse
|
34
|
Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2006; 61:127-36. [PMID: 16044462 DOI: 10.1002/prot.20527] [Citation(s) in RCA: 226] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, high-throughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, full-length sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site (http://www-bio3d-igbmc.u-strasbg.fr/balibase) has been completely redesigned to provide a more user-friendly, interactive interface for the visualization of the BAliBASE reference alignments and the associated annotations.
Collapse
Affiliation(s)
- Julie D Thompson
- Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire, (CNRS/INSERM/ULP), Illkirch Cedex, France.
| | | | | | | |
Collapse
|
35
|
Qiu J, Elber R. SSALN: an alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins 2006; 62:881-91. [PMID: 16385554 DOI: 10.1002/prot.20854] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In template-based modeling of protein structures, the generation of the alignment between the target and the template is a critical step that significantly affects the accuracy of the final model. This paper proposes an alignment algorithm SSALN that learns substitution matrices and position-specific gap penalties from a database of structurally aligned protein pairs. In addition to the amino acid sequence information, secondary structure and solvent accessibility information of a position are used to derive substitution scores and position-specific gap penalties. In a test set of CASP5 targets, SSALN outperforms sequence alignment methods such as a Smith-Waterman algorithm with BLOSUM50 and PSI_BLAST. SSALN also generates better alignments than PSI_BLAST in the CASP6 test set. LOOPP server prediction based on an SSALN alignment is ranked the best for target T0280_1 in CASP6. SSALN is also compared with several threading methods and sequence alignment methods on the ProSup benchmark. SSALN has the highest alignment accuracy among the methods compared. On the Fischer's benchmark, SSALN performs better than CLUSTALW and GenTHREADER, and generates more alignments with accuracy >50%, >60% or >70% than FUGUE, but fewer alignments with accuracy >80% than FUGUE. All the supplemental materials can be found at http://www.cs.cornell.edu/ approximately jianq/research.htm.
Collapse
Affiliation(s)
- Jian Qiu
- Department of Computer Science, Cornell University, Ithaca, New York 14853, USA
| | | |
Collapse
|
36
|
Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics 2006; 22:1456-63. [PMID: 16547073 DOI: 10.1093/bioinformatics/btl102] [Citation(s) in RCA: 156] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. RESULTS Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile-profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is approximately 85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%.
Collapse
Affiliation(s)
- Jianlin Cheng
- Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, CA, USA
| | | |
Collapse
|
37
|
Inferring ideal amino acid interaction forms from statistical protein contact potentials. Proteins 2006; 59:49-57. [PMID: 15688450 DOI: 10.1002/prot.20380] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
We have analyzed 29 different published matrices of protein pairwise contact potentials (CPs) between amino acids derived from different sets of proteins, either crystallographic structures taken from the Protein Data Bank (PDB) or computer-generated decoys. Each of the CPs is similar to 1 of the 2 matrices derived in the work of Miyazawa and Jernigan (Proteins 1999;34:49-68). The CP matrices of the first class can be approximated with a correlation of order 0.9 by the formula e(ij) = h(i) + h(j), 1 <or= i, j <or= 20, where the residue-type dependent factor h is highly correlated with the frequency of occurrence of a given amino acid type inside proteins. Electrostatic interactions for the potentials of this class are almost negligible. In the potentials belonging to this class, the major contribution to the potentials is the one-body transfer energy of the amino acid from water to the protein environment. Potentials belonging to the second class can be approximated with a correlation of 0.9 by the formula e(ij) = c(0) - h(i)h(j) + q(i)q(j), where c(0) is a constant, h is highly correlated with the Kyte-Doolittle hydrophobicity scale, and a new, less dominant, residue-type dependent factor q is correlated ( approximately 0.9) with amino acid isoelectric points pI. Including electrostatic interactions significantly improves the approximation for this class of potentials. While, the high correlation between potentials of the first class and the hydrophobic transfer energies is well known, the fact that this approximation can work well also for the second class of potentials is a new finding. We interpret potentials of this class as representing energies of contact of amino acid pairs within an average protein environment.
Collapse
|
38
|
Analysis of protein homology by assessing the (dis)similarity in protein loop regions. Proteins 2005; 57:539-47. [PMID: 15382231 DOI: 10.1002/prot.20237] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Two proteins are considered to have a similar fold if sufficiently many of their secondary structure elements are positioned similarly in space and are connected in the same order. Such a common structural scaffold may arise due to either divergent or convergent evolution. The intervening unaligned regions ("loops") between the superimposable helices and strands can exhibit a wide range of similarity and may offer clues to the structural evolution of folds. One might argue that more closely related proteins differ less in their nonconserved loop regions than distantly related proteins and, at the same time, the degree of variability in the loop regions in structurally similar but unrelated proteins is higher than in homologs. Here we introduce a new measure for structural (dis)similarity in loop regions that is based on the concept of the Hausdorff metric. This measure is used to gauge protein relatedness and is tested on a benchmark of homologous and analogous protein structures. It has been shown that the new measure can distinguish homologous from analogous proteins with the same or higher accuracy than the conventional measures that are based on comparing proteins in structurally aligned regions. We argue that this result can be attributed to the higher sensitivity of the Hausdorff (dis)similarity measure in detecting particularly evident dissimilarities in structures and draw some conclusions about evolutionary relatedness of proteins in the most populated protein folds.
Collapse
|
39
|
Abstract
Here, we report a novel protein sequence descriptor-based remote homology identification method, able to infer fold relationships without the explicit knowledge of structure. In a first phase, we have individually benchmarked 13 different descriptor types in fold identification experiments in a highly diverse set of protein sequences. The relevant descriptors were related to the fold class membership by using simple similarity measures in the descriptor spaces, such as the cosine angle. Our results revealed that the three best-performing sets of descriptors were the sequence-alignment-based descriptor using PSI-BLAST e-values, the descriptors based on the alignment of secondary structural elements (SSEA), and the descriptors based on the occurrence of PROSITE functional motifs. In a second phase, the three top-performing descriptors were combined to obtain a final method with improved performance, which we named DescFold. Class membership was predicted by Support Vector Machine (SVM) learning. In comparison with the individual PSI-BLAST-based descriptor, the rate of remote homology identification increased from 33.7% to 46.3%. We found out that the composite set of descriptors was able to identify the true remote homolog for nearly every sixth sequence at the 95% confidence level, or some 10% more than a single PSI-BLAST search. We have benchmarked the DescFold method against several other state-of-the-art fold recognition algorithms for the 172 LiveBench-8 targets, and we concluded that it was able to add value to the existing techniques by providing a confident hit for at least 10% of the sequences not identifiable by the previously known methods.
Collapse
Affiliation(s)
- Ziding Zhang
- Nestlé Research Center, BioAnalyti-cal Science, CH-1000 Lausanne 26, Switzerland. Ziding.
| | | | | |
Collapse
|
40
|
Skolnick J, Kihara D, Zhang Y. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins 2004; 56:502-18. [PMID: 15229883 DOI: 10.1002/prot.20106] [Citation(s) in RCA: 118] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This article describes the PROSPECTOR_3 threading algorithm, which combines various scoring functions designed to match structurally related target/template pairs. Each variant described was found to have a Z-score above which most identified templates have good structural (threading) alignments, Z(struct) (Z(good)). 'Easy' targets with accurate threading alignments are identified as single templates with Z > Z(good) or two templates, each with Z > Z(struct), having a good consensus structure in mutually aligned regions. 'Medium' targets have a pair of templates lacking a consensus structure, or a single template for which Z(struct) < Z < Z(good). PROSPECTOR_3 was applied to a comprehensive Protein Data Bank (PDB) benchmark composed of 1491 single domain proteins, 41-200 residues long and no more than 30% identical to any threading template. Of the proteins, 878 were found to be easy targets, with 761 having a root mean square deviation (RMSD) from native of less than 6.5 A. The average contact prediction accuracy was 46%, and on average 17.6 residue continuous fragments were predicted with RMSD values of 2.0 A. There were 606 medium targets identified, 87% (31%) of which had good structural (threading) alignments. On average, 9.1 residue, continuous fragments with RMSD of 2.5 A were predicted. Combining easy and medium sets, 63% (91%) of the targets had good threading (structural) alignments compared to native; the average target/template sequence identity was 22%. Only nine targets lacked matched templates. Moreover, PROSPECTOR_3 consistently outperforms PSIBLAST. Similar results were predicted for open reading frames (ORFS) < or =200 residues in the M. genitalium, E. coli and S. cerevisiae genomes. Thus, progress has been made in identification of weakly homologous/analogous proteins, with very high alignment coverage, both in a comprehensive PDB benchmark as well as in genomes.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington St., Suite 300, Buffalo, NY 14203, USA.
| | | | | |
Collapse
|
41
|
Abstract
Cytokines are an important class of proteins responsible for intercellular communication. The helical cytokines have a four-helix bundle fold, and they have remained largely intractable for sequence alignment methods due to their high evolutionary divergence. This paper presents a method that is specifically designed to recognize the helical cytokine fold in preprotein sequences such as full-length cDNA translations or transcripts predicted by gene finding methods. A protein fold is modeled by multiple sequence profiles, each representing a structurally conserved region. Nonstructural profiles are used to represent additional signals found in preprotein sequences. Profiles are connected by loop regions, each of a specified minimum and maximum length. A model for the helical cytokines is created by progressively improving a placement of four amphipathic helices onto training sequences. The sensitivity and specificity of the method are evaluated by a cross-validation procedure, demonstrating that cytokines with no intrafamily sequence similarity can be recognized. The method has been successfully used for the discovery of several new helical cytokines in the human genome.
Collapse
|
42
|
Zhang C, Liu S, Zhou H, Zhou Y. An accurate, residue-level, pair potential of mean force for folding and binding based on the distance-scaled, ideal-gas reference state. Protein Sci 2004; 13:400-11. [PMID: 14739325 PMCID: PMC2286718 DOI: 10.1110/ps.03348304] [Citation(s) in RCA: 116] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Structure prediction on a genomic scale requires a simplified energy function that can efficiently sample the conformational space of polypeptide chains. A good energy function at minimum should discriminate native structures against decoys. Here, we show that a recently developed, residue-specific, all-atom knowledge-based potential (167 atomic types) based on distance-scaled, finite ideal-gas reference state (DFIRE-all-atom) can be substantially simplified to 20 residue types located at side-chain center of mass (DFIRE-SCM) without a significant change in its capability of structure discrimination. Using 96 standard multiple decoy sets, we show that there is only a small reduction (from 80% to 78%) in success rate of ranking native structures as the top 1. The success rate is higher than two previously developed, all-atom distance-dependent statistical pair potentials. Applied to structure selections of 21 docking decoys without modification, the DFIRE-SCM potential is 29% more successful in recognizing native complex structures than an all-atom statistical potential trained by a database of dimeric interfaces. The potential also achieves 92% accuracy in distinguishing true dimeric interfaces from artificial crystal interfaces. In addition, the DFIRE potential with the C(alpha) positions as the interaction centers recognizes 123 native structures out of a comprehensive 125-protein TOUCHSTONE decoy set in which each protein has 24,000 decoys with only C(alpha) positions. Furthermore, the performance by DFIRE-SCM on newly established 25 monomeric and 31 docking Rosetta-decoy sets is comparable to (or better than in the case of monomeric decoy sets) that of a recently developed, all-atom Rosetta energy function enhanced with an orientation-dependent hydrogen bonding potential.
Collapse
Affiliation(s)
- Chi Zhang
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, SUNY Buffalo, 124 Sherman Hall, Buffalo, NY 14214, USA
| | | | | | | |
Collapse
|
43
|
Torda AE, Procter JB, Huber T. Wurst: a protein threading server with a structural scoring function, sequence profiles and optimized substitution matrices. Nucleic Acids Res 2004; 32:W532-5. [PMID: 15215443 PMCID: PMC441495 DOI: 10.1093/nar/gkh357] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Wurst is a protein threading program with an emphasis on high quality sequence to structure alignments (http://www.zbh.uni-hamburg.de/wurst). Submitted sequences are aligned to each of about 3000 templates with a conventional dynamic programming algorithm, but using a score function with sophisticated structure and sequence terms. The structure terms are a log-odds probability of sequence to structure fragment compatibility, obtained from a Bayesian classification procedure. A simplex optimization was used to optimize the sequence-based terms for the goal of alignment and model quality and to balance the sequence and structural contributions against each other. Both sequence and structural terms operate with sequence profiles.
Collapse
Affiliation(s)
- Andrew E Torda
- University of Hamburg, Zentrum für Bioinformatik, Bundesstrasse 43, D-20146 Hamburg, Germany
| | | | | |
Collapse
|
44
|
Zhou H, Zhou Y. Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins 2004; 55:1005-13. [PMID: 15146497 DOI: 10.1002/prot.20007] [Citation(s) in RCA: 163] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
An elaborate knowledge-based energy function is designed for fold recognition. It is a residue-level single-body potential so that highly efficient dynamic programming method can be used for alignment optimization. It contains a backbone torsion term, a buried surface term, and a contact-energy term. The energy score combined with sequence profile and secondary structure information leads to an algorithm called SPARKS (Sequence, secondary structure Profiles and Residue-level Knowledge-based energy Score) for fold recognition. Compared with the popular PSI-BLAST, SPARKS is 21% more accurate in sequence-sequence alignment in ProSup benchmark and 10%, 25%, and 20% more sensitive in detecting the family, superfamily, fold similarities in the Lindahl benchmark, respectively. Moreover, it is one of the best methods for sensitivity (the number of correctly recognized proteins), alignment accuracy (based on the MaxSub score), and specificity (the average number of correctly recognized proteins whose scores are higher than the first false positives) in LiveBench 7 among more than twenty servers of non-consensus methods. The simple algorithm used in SPARKS has the potential for further improvement. This highly efficient method can be used for fold recognition on genomic scales. A web server is established for academic users on http://theory.med.buffalo.edu.
Collapse
Affiliation(s)
- Hongyi Zhou
- Howard Hughes Medical Institute Center for Single Molecule Biophysics, Department of Physiology & Biophysics, State University of New York at Buffalo, New York 14214, USA
| | | |
Collapse
|
45
|
Tang CL, Xie L, Koh IYY, Posy S, Alexov E, Honig B. On the Role of Structural Information in Remote Homology Detection and Sequence Alignment: New Methods Using Hybrid Sequence Profiles. J Mol Biol 2003; 334:1043-62. [PMID: 14643665 DOI: 10.1016/j.jmb.2003.10.025] [Citation(s) in RCA: 71] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Structural alignments often reveal relationships between proteins that cannot be detected using sequence alignment alone. However, profile search methods based entirely on structural alignments alone have not been found to be effective in finding remote homologs. Here, we explore the role of structural information in remote homolog detection and sequence alignment. To this end, we develop a series of hybrid multidimensional alignment profiles that combine sequence, secondary and tertiary structure information into hybrid profiles. Sequence-based profiles are profiles whose position-specific scoring matrix is derived from sequence alignment alone; structure-based profiles are those derived from multiple structure alignments. We compare pure sequence-based profiles to pure structure-based profiles, as well as to hybrid profiles that use combined sequence-and-structure-based profiles, where sequence-based profiles are used in loop/motif regions and structural information is used in core structural regions. All of the hybrid methods offer significant improvement over simple profile-to-profile alignment. We demonstrate that both sequence-based and structure-based profiles contribute to remote homology detection and alignment accuracy, and that each contains some unique information. We discuss the implications of these results for further improvements in amino acid sequence and structural analysis.
Collapse
Affiliation(s)
- Christopher L Tang
- Department of Biochemistry and Molecular Biophysics, Howard Hughes Medical Institute, Columbia University, New York, NY 10032, USA
| | | | | | | | | | | |
Collapse
|
46
|
Marti‐Renom MA, Madhusudhan M, Eswar N, Pieper U, Shen M, Sali A, Fiser A, Mirkovic N, John B, Stuart A. Modeling Protein Structure from its Sequence. ACTA ACUST UNITED AC 2003. [DOI: 10.1002/0471250953.bi0501s03] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Marc A. Marti‐Renom
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - M.S. Madhusudhan
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Narayanan Eswar
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Ursula Pieper
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Min‐yi Shen
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andrej Sali
- Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and The California Institute for Quantitative Biomedical Research University of California at San Francisco San Francisco California
| | - Andras Fiser
- Department of Biochemistry and Seaver Foundation Center for Bioinformatics Albert Einstein College of Medicine Bronx New York
| | - Nebojsa Mirkovic
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Bino John
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| | - Ashley Stuart
- Laboratory of Molecular Biophysics The Rockefeller University New York New York
| |
Collapse
|
47
|
Pei J, Dokholyan NV, Shakhnovich EI, Grishin NV. Using protein design for homology detection and active site searches. Proc Natl Acad Sci U S A 2003; 100:11361-6. [PMID: 12975528 PMCID: PMC208762 DOI: 10.1073/pnas.2034878100] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2003] [Indexed: 11/18/2022] Open
Abstract
We describe a method of designing artificial sequences that resemble naturally occurring sequences in terms of their compatibility with a template structure and its functional constraints. The design procedure is a Monte Carlo simulation of amino acid substitution process. The selective fixation of substitutions is dictated by a simple scoring function derived from the template structure and a multiple alignment of its homologs. Designed sequences represent an enlargement of sequence space around native sequences. We show that the use of designed sequences improves the performance of profile-based homology detection. The difference in position-specific conservation between designed sequences and native sequences is helpful for prediction of functionally important residues. Our sequence selection criteria in evolutionary simulations introduce amino acid substitution rate variation among sites in a natural way, providing a better model to test phylogenetic methods.
Collapse
Affiliation(s)
- Jimin Pei
- Department of Biochemistry and Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | | | | | | |
Collapse
|
48
|
Abstract
A new potential energy function representing the conformational preferences of sequentially local regions of a protein backbone is presented. This potential is derived from secondary structure probabilities such as those produced by neural network-based prediction methods. The potential is applied to the problem of remote homolog identification, in combination with a distance-dependent inter-residue potential and position-based scoring matrices. This fold recognition jury is implemented in a Java application called JThread. These methods are benchmarked on several test sets, including one released entirely after development and parameterization of JThread. In benchmark tests to identify known folds structurally similar to (but not identical with) the native structure of a sequence, JThread performs significantly better than PSI-BLAST, with 10% more structures identified correctly as the most likely structural match in a fold library, and 20% more structures correctly narrowed down to a set of five possible candidates. JThread also improves the average sequence alignment accuracy significantly, from 53% to 62% of residues aligned correctly. Reliable fold assignments and alignments are identified, making the method useful for genome annotation. JThread is applied to predicted open reading frames (ORFs) from the genomes of Mycoplasma genitalium and Drosophila melanogaster, identifying 20 new structural annotations in the former and 801 in the latter.
Collapse
Affiliation(s)
- John Marc Chandonia
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94143-2240, USA
| | | |
Collapse
|
49
|
Zhang Z, Lindstam M, Unge J, Peterson C, Lu G. Potential for dramatic improvement in sequence alignment against structures of remote homologous proteins by extracting structural information from multiple structure alignment. J Mol Biol 2003; 332:127-42. [PMID: 12946352 DOI: 10.1016/s0022-2836(03)00858-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A novel method has been developed for acquiring the correct alignment of a query sequence against remotely homologous proteins by extracting structural information from profiles of multiple structure alignment. A systematic search algorithm combined with a group of score functions based on sequence information and structural information has been introduced in this procedure. A limited number of top solutions (15,000) with high scores were selected as candidates for further examination. On a test-set comprising 301 proteins from 75 protein families with sequence identity less than 30%, the proportion of proteins with completely correct alignment as first candidate was improved to 39.8% by our method, whereas the typical performance of existing sequence-based alignment methods was only between 16.1% and 22.7%. Furthermore, multiple candidates for possible alignment were provided in our approach, which dramatically increased the possibility of finding correct alignment, such that completely correct alignments were found amongst the top-ranked 1000 candidates in 88.3% of the proteins. With the assistance of a sequence database, completely correct alignment solutions were achieved amongst the top 1000 candidates in 94.3% of the proteins. From such a limited number of candidates, it would become possible to identify more correct alignment using a more time-consuming but more powerful method with more detailed structural information, such as side-chain packing and energy minimization, etc. The results indicate that the novel alignment strategy could be helpful for extending the application of highly reliable methods for fold identification and homology modeling to a huge number of homologous proteins of low sequence similarity. Details of the methods, together with the results and implications for future development are presented.
Collapse
Affiliation(s)
- Ziding Zhang
- Department of Molecular Biophysics, Center for Chemistry and Chemical Engineering, Lund University, P.O. Box 124, SE-221 00 Lund, Sweden
| | | | | | | | | |
Collapse
|
50
|
Abstract
The success of structural genomics initiatives requires the development and application of tools for structure analysis, prediction, and annotation. In this paper we review recent developments in these areas; specifically structure alignment, the detection of remote homologs and analogs, homology modeling and the use of structures to predict function. We also discuss various rationales for structural genomics initiatives. These include the structure-based clustering of sequence space and genome-wide function assignment. It is also argued that structural genomics can be integrated into more traditional biological research if specific biological questions are included in target selection strategies.
Collapse
Affiliation(s)
- Sharon Goldsmith-Fischman
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | | |
Collapse
|