1
|
Bitton M, Keasar C. Estimation of model accuracy by a unique set of features and tree-based regressor. Sci Rep 2022; 12:14074. [PMID: 35982086 PMCID: PMC9388490 DOI: 10.1038/s41598-022-17097-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 07/20/2022] [Indexed: 11/26/2022] Open
Abstract
Computationally generated models of protein structures bridge the gap between the practically negligible price tag of sequencing and the high cost of experimental structure determination. By providing a low-cost (and often free) partial alternative to experimentally determined structures, these models help biologists design and interpret their experiments. Obviously, the more accurate the models the more useful they are. However, methods for protein structure prediction generate many structural models of various qualities, necessitating means for the estimation of their accuracy. In this work we present MESHI_consensus, a new method for the estimation of model accuracy. The method uses a tree-based regressor and a set of structural, target-based, and consensus-based features. The new method achieved high performance in the EMA (Estimation of Model Accuracy) track of the recent CASP14 community-wide experiment (https://predictioncenter.org/casp14/index.cgi). The tertiary structure prediction track of that experiment revealed an unprecedented leap in prediction performance by a single prediction group/method, namely AlphaFold2. This achievement would inevitably have a profound impact on the field of protein structure prediction, including the accuracy estimation sub-task. We conclude this manuscript with some speculations regarding the future role of accuracy estimation in a new era of accurate protein structure prediction.
Collapse
Affiliation(s)
- Mor Bitton
- Department of Computer Science, Ben Gurion University, Be'er Sheva, Israel.
| | - Chen Keasar
- Department of Computer Science, Ben Gurion University, Be'er Sheva, Israel.
| |
Collapse
|
2
|
Keller GLJ, Weiss LI, Baker BM. Physicochemical Heuristics for Identifying High Fidelity, Near-Native Structural Models of Peptide/MHC Complexes. Front Immunol 2022; 13:887759. [PMID: 35547730 PMCID: PMC9084917 DOI: 10.3389/fimmu.2022.887759] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 03/29/2022] [Indexed: 11/13/2022] Open
Abstract
There is long-standing interest in accurately modeling the structural features of peptides bound and presented by class I MHC proteins. This interest has grown with the advent of rapid genome sequencing and the prospect of personalized, peptide-based cancer vaccines, as well as the development of molecular and cellular therapeutics based on T cell receptor recognition of peptide-MHC. However, while the speed and accessibility of peptide-MHC modeling has improved substantially over the years, improvements in accuracy have been modest. Accuracy is crucial in peptide-MHC modeling, as T cell receptors are highly sensitive to peptide conformation and capturing fine details is therefore necessary for useful models. Studying nonameric peptides presented by the common class I MHC protein HLA-A*02:01, here we addressed a key question common to modern modeling efforts: from a set of models (or decoys) generated through conformational sampling, which is best? We found that the common strategy of decoy selection by lowest energy can lead to substantial errors in predicted structures. We therefore adopted a data-driven approach and trained functions capable of predicting near native decoys with exceptionally high accuracy. Although our implementation is limited to nonamer/HLA-A*02:01 complexes, our results serve as an important proof of concept from which improvements can be made and, given the significance of HLA-A*02:01 and its preference for nonameric peptides, should have immediate utility in select immunotherapeutic and other efforts for which structural information would be advantageous.
Collapse
Affiliation(s)
- Grant L J Keller
- Department of Chemistry & Biochemistry and the Harper Cancer Research Institute, University of Notre Dame, Notre Dame, IN, United States
| | - Laura I Weiss
- Department of Chemistry & Biochemistry and the Harper Cancer Research Institute, University of Notre Dame, Notre Dame, IN, United States
| | - Brian M Baker
- Department of Chemistry & Biochemistry and the Harper Cancer Research Institute, University of Notre Dame, Notre Dame, IN, United States
| |
Collapse
|
3
|
Kumar SP, Dixit NY, Patel CN, Rawal RM, Pandya HA. PharmRF: A machine-learning scoring function to identify the best protein-ligand complexes for structure-based pharmacophore screening with high enrichments. J Comput Chem 2022; 43:847-863. [PMID: 35301752 DOI: 10.1002/jcc.26840] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Revised: 02/14/2022] [Accepted: 02/26/2022] [Indexed: 11/09/2022]
Abstract
Structure-based pharmacophore models are often developed by selecting a single protein-ligand complex with good resolution and better binding affinity data which prevents the analysis of other structures having a similar potential to act as better templates. PharmRF is a pharmacophore-based scoring function for selecting the best crystal structures with the potential to attain high enrichment rates in pharmacophore-based virtual screening prospectively. The PharmRF scoring function is trained and tested on the PDBbind v2018 protein-ligand complex dataset and employs a random forest regressor to correlate protein pocket descriptors and ligand pharmacophoric elements with binding affinity. PharmRF score represents the calculated binding affinity which identifies high-affinity ligands by thorough pruning of all the PDB entries available for a particular protein of interest with a high PharmRF score. Ligands with high PharmRF scores can provide a better basis for structure-based pharmacophore enumerations with a better enrichment rate. Evaluated on 10 protein-ligand systems of the DUD-E dataset, PharmRF achieved superior performance (average success rate: 77.61%, median success rate: 87.16%) than Vina docking score (75.47%, 79.39%). PharmRF was further evaluated using the CASF-2016 benchmark set yielding a moderate correlation of 0.591 with experimental binding affinity, similar in performance to 25 scoring functions tested on this dataset. Independent assessment of PharmRF on 8 protein-ligand systems of LIT-PCBA dataset exhibited average and median success rates of 57.55% and 74.72% with 4 targets attaining success rate > 90%. The PharmRF scoring model, scripts, and related resources can be accessed at https://github.com/Prasanth-Kumar87/PharmRF.
Collapse
Affiliation(s)
- Sivakumar Prasanth Kumar
- Institute of Defence Studies and Research, Gujarat University, Ahmedabad, India.,Department of Life Sciences, University School of Sciences, Gujarat University, Ahmedabad, India.,Department of Botany, Bioinformatics, and Climate Change Impacts Management, University School of Sciences, Gujarat University, Ahmedabad, India
| | - Nandan Y Dixit
- Department of Botany, Bioinformatics, and Climate Change Impacts Management, University School of Sciences, Gujarat University, Ahmedabad, India
| | - Chirag N Patel
- Department of Botany, Bioinformatics, and Climate Change Impacts Management, University School of Sciences, Gujarat University, Ahmedabad, India
| | - Rakesh M Rawal
- Institute of Defence Studies and Research, Gujarat University, Ahmedabad, India.,Department of Life Sciences, University School of Sciences, Gujarat University, Ahmedabad, India
| | - Himanshu A Pandya
- Institute of Defence Studies and Research, Gujarat University, Ahmedabad, India.,Department of Life Sciences, University School of Sciences, Gujarat University, Ahmedabad, India.,Department of Botany, Bioinformatics, and Climate Change Impacts Management, University School of Sciences, Gujarat University, Ahmedabad, India
| |
Collapse
|
4
|
An automated and combinative method for the predictive ranking of candidate effector proteins of fungal plant pathogens. Sci Rep 2021; 11:19731. [PMID: 34611252 PMCID: PMC8492765 DOI: 10.1038/s41598-021-99363-0] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 09/16/2021] [Indexed: 01/29/2023] Open
Abstract
Fungal plant-pathogens promote infection of their hosts through the release of 'effectors'-a broad class of cytotoxic or virulence-promoting molecules. Effectors may be recognised by resistance or sensitivity receptors in the host, which can determine disease outcomes. Accurate prediction of effectors remains a major challenge in plant pathology, but if achieved will facilitate rapid improvements to host disease resistance. This study presents a novel tool and pipeline for the ranking of predicted effector candidates-Predector-which interfaces with multiple software tools and methods, aggregates disparate features that are relevant to fungal effector proteins, and applies a pairwise learning to rank approach. Predector outperformed a typical combination of secretion and effector prediction methods in terms of ranking performance when applied to a curated set of confirmed effectors derived from multiple species. We present Predector ( https://github.com/ccdmb/predector ) as a useful tool for the ranking of predicted effector candidates, which also aggregates and reports additional supporting information relevant to effector and secretome prediction in a simple, efficient, and reproducible manner.
Collapse
|
5
|
Alam FF, Shehu A. Unsupervised multi-instance learning for protein structure determination. J Bioinform Comput Biol 2021; 19:2140002. [PMID: 33568002 DOI: 10.1142/s0219720021400023] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many regions of the protein universe remain inaccessible by wet-laboratory or computational structure determination methods. A significant challenge in elucidating these dark regions in silico relates to the ability to discriminate relevant structure(s) among many structures/decoys computed for a protein of interest, a problem known as decoy selection. Clustering decoys based on geometric similarity remains popular. However, it is unclear how exactly to exploit the groups of decoys revealed via clustering to select individual structures for prediction. In this paper, we provide an intuitive formulation of the decoy selection problem as an instance of unsupervised multi-instance learning. We address the problem in three stages, first organizing given decoys of a protein molecule into bags, then identifying relevant bags, and finally drawing individual instances from these bags to offer as prediction. We propose both non-parametric and parametric algorithms for drawing individual instances. Our evaluation utilizes two datasets, one benchmark dataset of ensembles of decoys for a varied list of protein molecules, and a dataset of decoy ensembles for targets drawn from recent CASP competitions. A comparative analysis with state-of-the-art methods reveals that the proposed approach outperforms existing methods, thus warranting further investigation of multi-instance learning to advance our treatment of decoy selection.
Collapse
Affiliation(s)
- Fardina Fathmiul Alam
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, Virginia 22030, USA
| |
Collapse
|
6
|
Cao Y, Shen Y. Energy-based graph convolutional networks for scoring protein docking models. Proteins 2020; 88:1091-1099. [PMID: 32144844 DOI: 10.1002/prot.25888] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/15/2020] [Accepted: 02/26/2020] [Indexed: 12/18/2022]
Abstract
Structural information about protein-protein interactions, often missing at the interactome scale, is important for mechanistic understanding of cells and rational discovery of therapeutics. Protein docking provides a computational alternative for such information. However, ranking near-native docked models high among a large number of candidates, often known as the scoring problem, remains a critical challenge. Moreover, estimating model quality, also known as the quality assessment problem, is rarely addressed in protein docking. In this study, the two challenging problems in protein docking are regarded as relative and absolute scoring, respectively, and addressed in one physics-inspired deep learning framework. We represent protein and complex structures as intra- and inter-molecular residue contact graphs with atom-resolution node and edge features. And we propose a novel graph convolutional kernel that aggregates interacting nodes' features through edges so that generalized interaction energies can be learned directly from 3D data. The resulting energy-based graph convolutional networks (EGCN) with multihead attention are trained to predict intra- and inter-molecular energies, binding affinities, and quality measures (interface RMSD) for encounter complexes. Compared to a state-of-the-art scoring function for model ranking, EGCN significantly improves ranking for a critical assessment of predicted interactions (CAPRI) test set involving homology docking; and is comparable or slightly better for Score_set, a CAPRI benchmark set generated by diverse community-wide docking protocols not known to training data. For Score_set quality assessment, EGCN shows about 27% improvement to our previous efforts. Directly learning from 3D structure data in graph representation, EGCN represents the first successful development of graph convolutional networks for protein docking.
Collapse
Affiliation(s)
- Yue Cao
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas.,TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, Texas
| |
Collapse
|
7
|
Akhter N, Chennupati G, Kabir KL, Djidjev H, Shehu A. Unsupervised and Supervised Learning over theEnergy Landscape for Protein Decoy Selection. Biomolecules 2019; 9:E607. [PMID: 31615116 PMCID: PMC6843838 DOI: 10.3390/biom9100607] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 10/03/2019] [Accepted: 10/04/2019] [Indexed: 11/17/2022] Open
Abstract
The energy landscape that organizes microstates of a molecular system and governs theunderlying molecular dynamics exposes the relationship between molecular form/structure, changesto form, and biological activity or function in the cell. However, several challenges stand in the wayof leveraging energy landscapes for relating structure and structural dynamics to function. Energylandscapes are high-dimensional, multi-modal, and often overly-rugged. Deep wells or basins inthem do not always correspond to stable structural states but are instead the result of inherentinaccuracies in semi-empirical molecular energy functions. Due to these challenges, energeticsis typically ignored in computational approaches addressing long-standing central questions incomputational biology, such as protein decoy selection. In the latter, the goal is to determine over apossibly large number of computationally-generated three-dimensional structures of a protein thosestructures that are biologically-active/native. In recent work, we have recast our attention on theprotein energy landscape and its role in helping us to advance decoy selection. Here, we summarizesome of our successes so far in this direction via unsupervised learning. More importantly, we furtheradvance the argument that the energy landscape holds valuable information to aid and advance thestate of protein decoy selection via novel machine learning methodologies that leverage supervisedlearning. Our focus in this article is on decoy selection for the purpose of a rigorous, quantitativeevaluation of how leveraging protein energy landscapes advances an important problem in proteinmodeling. However, the ideas and concepts presented here are generally useful to make discoveriesin studies aiming to relate molecular structure and structural dynamics to function.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Kazi Lutful Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
- Center for Adaptive Human-Machine Partnership, George Mason University, Fairfax, VA 22030, USA.
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA.
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA.
| |
Collapse
|
8
|
Mirzaei S, Sidi T, Keasar C, Crivelli S. Purely Structural Protein Scoring Functions Using Support Vector Machine and Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1515-1523. [PMID: 28113636 DOI: 10.1109/tcbb.2016.2602269] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The function of a protein is determined by its structure, which creates a need for efficient methods of protein structure determination to advance scientific and medical research. Because current experimental structure determination methods carry a high price tag, computational predictions are highly desirable. Given a protein sequence, computational methods produce numerous 3D structures known as decoys. Selection of the best quality decoys is both challenging and essential as the end users can handle only a few ones. Therefore, scoring functions are central to decoy selection. They combine measurable features into a single number indicator of decoy quality. Unfortunately, current scoring functions do not consistently select the best decoys. Machine learning techniques offer great potential to improve decoy scoring. This paper presents two machine-learning based scoring functions to predict the quality of proteins structures, i.e., the similarity between the predicted structure and the experimental one without knowing the latter. We use different metrics to compare these scoring functions against three state-of-the-art scores. This is a first attempt at comparing different scoring functions using the same non-redundant dataset for training and testing and the same features. The results show that adding informative features may be more significant than the method used.
Collapse
|
9
|
Khare S, Bhasin M, Sahoo A, Varadarajan R. Protein model discrimination attempts using mutational sensitivity, predicted secondary structure, and model quality information. Proteins 2019; 87:326-336. [PMID: 30615225 DOI: 10.1002/prot.25654] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 12/22/2018] [Accepted: 01/02/2019] [Indexed: 01/02/2023]
Abstract
Structure prediction methods often generate a large number of models for a target sequence. Even if the correct fold for the target sequence is sampled in this dataset, it is difficult to distinguish it from other decoy structures. An attempt to solve this problem using experimental mutational sensitivity data for the CcdB protein was described previously by exploiting the correlation of residue depth with mutational sensitivity (r ~ 0.6). We now show that such a correlation extends to four other proteins with localized active sites, and for which saturation mutagenesis datasets exist. We also examine whether incorporation of predicted secondary structure information and the DOPE model quality assessment score, in addition to mutational sensitivity, improves the accuracy of model discrimination using a decoy dataset of 163 targets from CASP. Although most CASP models would have been subjected to model quality assessment prior to submission, we find that the DOPE score makes a substantial contribution to the observed improvement. We therefore also applied the approach to CcdB and four other proteins for which reliable experimental mutational data exist and observe that inclusion of experimental mutational data results in a small qualitative improvement in model discrimination relative to that seen with just the DOPE score. This is largely because of our limited ability to quantitatively predict effects of point mutations on in vivo protein activity. Further improvements in the methodology are required to facilitate improved utilization of single mutant data.
Collapse
Affiliation(s)
- Shruti Khare
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Munmun Bhasin
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Anusmita Sahoo
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India
| | - Raghavan Varadarajan
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, India.,Chemical Biology Unit, Jawaharlal Nehru Center for Advanced Scientific Research, Bangalore, India
| |
Collapse
|
10
|
An Energy Landscape Treatment of Decoy Selection in Template-Free Protein Structure Prediction. COMPUTATION 2018. [DOI: 10.3390/computation6020039] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
11
|
From Extraction of Local Structures of Protein Energy Landscapes to Improved Decoy Selection in Template-Free Protein Structure Prediction. Molecules 2018; 23:molecules23010216. [PMID: 29351266 PMCID: PMC6017496 DOI: 10.3390/molecules23010216] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Accepted: 12/11/2017] [Indexed: 11/17/2022] Open
Abstract
Due to the essential role that the three-dimensional conformation of a protein plays in regulating interactions with molecular partners, wet and dry laboratories seek biologically-active conformations of a protein to decode its function. Computational approaches are gaining prominence due to the labor and cost demands of wet laboratory investigations. Template-free methods can now compute thousands of conformations known as decoys, but selecting native conformations from the generated decoys remains challenging. Repeatedly, research has shown that the protein energy functions whose minima are sought in the generation of decoys are unreliable indicators of nativeness. The prevalent approach ignores energy altogether and clusters decoys by conformational similarity. Complementary recent efforts design protein-specific scoring functions or train machine learning models on labeled decoys. In this paper, we show that an informative consideration of energy can be carried out under the energy landscape view. Specifically, we leverage local structures known as basins in the energy landscape probed by a template-free method. We propose and compare various strategies of basin-based decoy selection that we demonstrate are superior to clustering-based strategies. The presented results point to further directions of research for improving decoy selection, including the ability to properly consider the multiplicity of native conformations of proteins.
Collapse
|
12
|
Najibi SM, Maadooliat M, Zhou L, Huang JZ, Gao X. Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions. Comput Struct Biotechnol J 2017; 15:243-254. [PMID: 28280526 PMCID: PMC5331158 DOI: 10.1016/j.csbj.2017.01.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2016] [Revised: 01/26/2017] [Accepted: 01/28/2017] [Indexed: 11/19/2022] Open
Abstract
Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.
Collapse
Affiliation(s)
| | - Mehdi Maadooliat
- Department of Mathematics, Statistics and Computer Science, Marquette University, WI 53201-1881, USA
- Center for Human Genetics, Marshfield Clinic Research Institute, Marshfield, WI 54449, USA
| | - Lan Zhou
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Jianhua Z. Huang
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Corresponding author.
| |
Collapse
|
13
|
Srivastav VK, Singh V, Tiwari M. Recent Advancements in Docking Methodologies. Oncology 2017. [DOI: 10.4018/978-1-5225-0549-5.ch033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Nowadays molecular docking has become an important methodology in CADD (Computer-Aided Drug Design)-assisted drug discovery process. It is an important computational tool widely used to predict binding mode, binding affinity and binding free energy of a protein-ligand complex. The important factors responsible for accurate results in docking studies are correct binding site prediction, use of suitable small-molecule databases, consistent docking pose, high dock score with good MD (Molecular Dynamics), clarity whether the compound is an inhibitor or agonist, etc. However, still there are several limitations which make it difficult to obtain accurate results from docking studies. In this chapter, the main focus is on recent advancements in various aspects of molecular docking such as ligand sampling, protein flexibility, scoring functions, fragment docking, post-processing, docking into homology models and protein-protein docking.
Collapse
Affiliation(s)
| | - Vineet Singh
- Shri Govindram Seksaria Institute of Technology and Science, India
| | - Meena Tiwari
- Shri Govindram Seksaria Institute of Technology and Science, India
| |
Collapse
|
14
|
Maadooliat M, Zhou L, Najibi SM, Gao X, Huang JZ. Collective Estimation of Multiple Bivariate Density Functions With Application to Angular-Sampling-Based Protein Loop Modeling. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1099535] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
15
|
ProTSAV: A protein tertiary structure analysis and validation server. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1864:11-9. [PMID: 26478257 DOI: 10.1016/j.bbapap.2015.10.004] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Revised: 09/26/2015] [Accepted: 10/14/2015] [Indexed: 01/06/2023]
Abstract
Quality assessment of predicted model structures of proteins is as important as the protein tertiary structure prediction. A highly efficient quality assessment of predicted model structures directs further research on function. Here we present a new server ProTSAV, capable of evaluating predicted model structures based on some popular online servers and standalone tools. ProTSAV furnishes the user with a single quality score in case of individual protein structure along with a graphical representation and ranking in case of multiple protein structure assessment. The server is validated on ~64,446 protein structures including experimental structures from RCSB and predicted model structures for CASP targets and from public decoy sets. ProTSAV succeeds in predicting quality of protein structures with a specificity of 100% and a sensitivity of 98% on experimentally solved structures and achieves a specificity of 88%and a sensitivity of 91% on predicted protein structures of CASP11 targets under 2Å.The server overcomes the limitations of any single server/method and is seen to be robust in helping in quality assessment. ProTSAV is freely available at http://www.scfbio-iitd.res.in/software/proteomics/protsav.jsp.
Collapse
|
16
|
Manavalan B, Lee J, Lee J. Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS One 2014; 9:e106542. [PMID: 25222008 PMCID: PMC4164442 DOI: 10.1371/journal.pone.0106542] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 08/06/2014] [Indexed: 01/28/2023] Open
Abstract
Recently, predicting proteins three-dimensional (3D) structure from its sequence information has made a significant progress due to the advances in computational techniques and the growth of experimental structures. However, selecting good models from a structural model pool is an important and challenging task in protein structure prediction. In this study, we present the first application of random forest based model quality assessment (RFMQA) to rank protein models using its structural features and knowledge-based potential energy terms. The method predicts a relative score of a model by using its secondary structure, solvent accessibility and knowledge-based potential energy terms. We trained and tested the RFMQA method on CASP8 and CASP9 targets using 5-fold cross-validation. The correlation coefficient between the TM-score of the model selected by RFMQA (TMRF) and the best server model (TMbest) is 0.945. We benchmarked our method on recent CASP10 targets by using CASP8 and 9 server models as a training set. The correlation coefficient and average difference between TMRF and TMbest over 95 CASP10 targets are 0.984 and 0.0385, respectively. The test results show that our method works better in selecting top models when compared with other top performing methods. RFMQA is available for download from http://lee.kias.re.kr/RFMQA/RFMQA_eval.tar.gz.
Collapse
Affiliation(s)
- Balachandran Manavalan
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
| | - Juyong Lee
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
| | - Jooyoung Lee
- Center for In Silico Protein Science, School of Computational Sciences, Korea Institute for Advanced Study, Seoul, Korea
- * E-mail:
| |
Collapse
|
17
|
Nguyen SP, Shang Y, Xu D. DL-PRO: A Novel Deep Learning Method for Protein Model Quality Assessment. PROCEEDINGS OF ... INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS. INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS 2014; 2014:2071-2078. [PMID: 25392745 PMCID: PMC4226404 DOI: 10.1109/ijcnn.2014.6889891] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Computational protein structure prediction is very important for many applications in bioinformatics. In the process of predicting protein structures, it is essential to accurately assess the quality of generated models. Although many single-model quality assessment (QA) methods have been developed, their accuracy is not high enough for most real applications. In this paper, a new approach based on C-α atoms distance matrix and machine learning methods is proposed for single-model QA and the identification of native-like models. Different from existing energy/scoring functions and consensus approaches, this new approach is purely geometry based. Furthermore, a novel algorithm based on deep learning techniques, called DL-Pro, is proposed. For a protein model, DL-Pro uses its distance matrix that contains pairwise distances between two residues' C-α atoms in the model, which sometimes is also called contact map, as an orientation-independent representation. From training examples of distance matrices corresponding to good and bad models, DL-Pro learns a stacked autoencoder network as a classifier. In experiments on selected targets from the Critical Assessment of Structure Prediction (CASP) competition, DL-Pro obtained promising results, outperforming state-of-the-art energy/scoring functions, including OPUS-CA, DOPE, DFIRE, and RW.
Collapse
Affiliation(s)
- Son P. Nguyen
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Yi Shang
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Dong Xu
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA. Christopher S. Bond Life Science Center, University of Missouri at Columbia
| |
Collapse
|
18
|
Improvement in low-homology template-based modeling by employing a model evaluation method with focus on topology. PLoS One 2014; 9:e89935. [PMID: 24587135 PMCID: PMC3935967 DOI: 10.1371/journal.pone.0089935] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 01/24/2014] [Indexed: 01/22/2023] Open
Abstract
Many template-based modeling (TBM) methods have been developed over the recent years that allow for protein structure prediction and for the study of structure-function relationships for proteins. One major problem all TBM algorithms face, however, is their unsatisfactory performance when proteins under consideration are low-homology. To improve the performance of TBM methods for such targets, a novel model evaluation method was developed here, and named MEFTop. Our novel method focuses on evaluating the topology by using two novel groups of features. These novel features included secondary structure element (SSE) contact information and 3-dimensional topology information. By combining MEFTop algorithm with FR-t5, a threading program developed by our group, we found that this modified TBM program, which was named FR-t5-M, exhibited significant improvements in predictive abilities for low-homology protein targets. We further showed that the MEFTop could be a generalized method to improve threading programs for low-homology protein targets. The softwares (FR-t5-M and MEFTop) are available to non-commercial users at our website: http://jianglab.ibp.ac.cn/lims/FRt5M/FRt5M.html.
Collapse
|
19
|
Khoury GA, Tamamis P, Pinnaduwage N, Smadbeck J, Kieslich CA, Floudas CA. Princeton_TIGRESS: protein geometry refinement using simulations and support vector machines. Proteins 2013; 82:794-814. [PMID: 24174311 DOI: 10.1002/prot.24459] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 10/18/2013] [Accepted: 10/22/2013] [Indexed: 12/30/2022]
Abstract
Protein structure refinement aims to perform a set of operations given a predicted structure to improve model quality and accuracy with respect to the native in a blind fashion. Despite the numerous computational approaches to the protein refinement problem reported in the previous three CASPs, an overwhelming majority of methods degrade models rather than improve them. We initially developed a method tested using blind predictions during CASP10 which was officially ranked in 5th place among all methods in the refinement category. Here, we present Princeton_TIGRESS, which when benchmarked on all CASP 7,8,9, and 10 refinement targets, simultaneously increased GDT_TS 76% of the time with an average improvement of 0.83 GDT_TS points per structure. The method was additionally benchmarked on models produced by top performing three-dimensional structure prediction servers during CASP10. The robustness of the Princeton_TIGRESS protocol was also tested for different random seeds. We make the Princeton_TIGRESS refinement protocol freely available as a web server at http://atlas.princeton.edu/refinement. Using this protocol, one can consistently refine a prediction to help bridge the gap between a predicted structure and the actual native structure.
Collapse
Affiliation(s)
- George A Khoury
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey, 08540
| | | | | | | | | | | |
Collapse
|
20
|
Wang Q, Shang C, Xu D, Shang Y. NEW MDS AND CLUSTERING BASED ALGORITHMS FOR PROTEIN MODEL QUALITY ASSESSMENT AND SELECTION. INT J ARTIF INTELL T 2013; 22:1360006. [PMID: 24808625 DOI: 10.1142/s0218213013600063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In protein tertiary structure prediction, assessing the quality of predicted models is an essential task. Over the past years, many methods have been proposed for the protein model quality assessment (QA) and selection problem. Despite significant advances, the discerning power of current methods is still unsatisfactory. In this paper, we propose two new algorithms, CC-Select and MDS-QA, based on multidimensional scaling and k-means clustering. For the model selection problem, CC-Select combines consensus with clustering techniques to select the best models from a given pool. Given a set of predicted models, CC-Select first calculates a consensus score for each structure based on its average pairwise structural similarity to other models. Then, similar structures are grouped into clusters using multidimensional scaling and clustering algorithms. In each cluster, the one with the highest consensus score is selected as a candidate model. For the QA problem, MDS-QA combines single-model scoring functions with consensus to determine more accurate assessment score for every model in a given pool. Using extensive benchmark sets of a large collection of predicted models, we compare the two algorithms with existing state-of-the-art quality assessment methods and show significant improvement.
Collapse
Affiliation(s)
- Qingguo Wang
- Bioinformatics and Systems Medicine Laboratory, Vanderbilt University Nashville, TN 37203, USA
| | - Charles Shang
- Computer Science Department, University of Illinois at Urbana-Champaign Urbana, IL 61801, USA
| | - Dong Xu
- Computer Science Department, University of Missouri Columbia, MO 65211, USA
| | - Yi Shang
- Computer Science Department, University of Missouri Columbia, MO 65211, USA
| |
Collapse
|
21
|
Zhou J, Yan W, Hu G, Shen B. SVR_CAF: An integrated score function for detecting native protein structures among decoys. Proteins 2013; 82:556-64. [DOI: 10.1002/prot.24421] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2013] [Revised: 08/30/2013] [Accepted: 09/10/2013] [Indexed: 01/05/2023]
Affiliation(s)
- Jianhong Zhou
- Center for Systems Biology; Soochow University; Suzhou Jiangsu 215006 China
| | - Wenying Yan
- Center for Systems Biology; Soochow University; Suzhou Jiangsu 215006 China
| | - Guang Hu
- Center for Systems Biology; Soochow University; Suzhou Jiangsu 215006 China
| | - Bairong Shen
- Center for Systems Biology; Soochow University; Suzhou Jiangsu 215006 China
| |
Collapse
|
22
|
He Z, Alazmi M, Zhang J, Xu D. Protein structural model selection by combining consensus and single scoring methods. PLoS One 2013; 8:e74006. [PMID: 24023923 PMCID: PMC3759460 DOI: 10.1371/journal.pone.0074006] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2013] [Accepted: 07/26/2013] [Indexed: 01/28/2023] Open
Abstract
Quality assessment (QA) for predicted protein structural models is an important and challenging research problem in protein structure prediction. Consensus Global Distance Test (CGDT) methods assess each decoy (predicted structural model) based on its structural similarity to all others in a decoy set and has been proved to work well when good decoys are in a majority cluster. Scoring functions evaluate each single decoy based on its structural properties. Both methods have their merits and limitations. In this paper, we present a novel method called PWCom, which consists of two neural networks sequentially to combine CGDT and single model scoring methods such as RW, DDFire and OPUS-Ca. Specifically, for every pair of decoys, the difference of the corresponding feature vectors is input to the first neural network which enables one to predict whether the decoy-pair are significantly different in terms of their GDT scores to the native. If yes, the second neural network is used to decide which one of the two is closer to the native structure. The quality score for each decoy in the pool is based on the number of winning times during the pairwise comparisons. Test results on three benchmark datasets from different model generation methods showed that PWCom significantly improves over consensus GDT and single scoring methods. The QA server (MUFOLD-Server) applying this method in CASP 10 QA category was ranked the second place in terms of Pearson and Spearman correlation performance.
Collapse
Affiliation(s)
- Zhiquan He
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Missouri, United States of America
| | - Meshari Alazmi
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Missouri, United States of America
| | - Jingfen Zhang
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Missouri, United States of America
| | - Dong Xu
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Missouri, United States of America
- * E-mail:
| |
Collapse
|
23
|
Wallhäußer E, Hussein W, Hussein M, Hinrichs J, Becker T. Detection of dairy fouling: Combining ultrasonic measurements and classification methods. Eng Life Sci 2013. [DOI: 10.1002/elsc.201200081] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Affiliation(s)
- E. Wallhäußer
- (Bio-)Process Technology and Process Analysis; Life Science Engineering; Technische Universitaet Muenchen; Freising; Germany
| | - W.B. Hussein
- (Bio-)Process Technology and Process Analysis; Life Science Engineering; Technische Universitaet Muenchen; Freising; Germany
| | - M.A. Hussein
- (Bio-)Process Technology and Process Analysis; Life Science Engineering; Technische Universitaet Muenchen; Freising; Germany
| | - J. Hinrichs
- Animal Foodstuff Technology; Institute for Foodscience and Biotechnology; University of Hohenheim; Stuttgart; Germany
| | - T. Becker
- (Bio-)Process Technology and Process Analysis; Life Science Engineering; Technische Universitaet Muenchen; Freising; Germany
| |
Collapse
|
24
|
Determination of cleaning end of dairy protein fouling using an online system combining ultrasonic and classification methods. FOOD BIOPROCESS TECH 2013. [DOI: 10.1007/s11947-012-1041-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
25
|
Lü Q, Xia XY, Chen R, Miao DJ, Chen SS, Quan LJ, Li HO. When the lowest energy does not induce native structures: parallel minimization of multi-energy values by hybridizing searching intelligences. PLoS One 2012; 7:e44967. [PMID: 23028708 PMCID: PMC3460973 DOI: 10.1371/journal.pone.0044967] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Accepted: 08/16/2012] [Indexed: 12/03/2022] Open
Abstract
Background Protein structure prediction (PSP), which is usually modeled as a computational optimization problem, remains one of the biggest challenges in computational biology. PSP encounters two difficult obstacles: the inaccurate energy function problem and the searching problem. Even if the lowest energy has been luckily found by the searching procedure, the correct protein structures are not guaranteed to obtain. Results A general parallel metaheuristic approach is presented to tackle the above two problems. Multi-energy functions are employed to simultaneously guide the parallel searching threads. Searching trajectories are in fact controlled by the parameters of heuristic algorithms. The parallel approach allows the parameters to be perturbed during the searching threads are running in parallel, while each thread is searching the lowest energy value determined by an individual energy function. By hybridizing the intelligences of parallel ant colonies and Monte Carlo Metropolis search, this paper demonstrates an implementation of our parallel approach for PSP. 16 classical instances were tested to show that the parallel approach is competitive for solving PSP problem. Conclusions This parallel approach combines various sources of both searching intelligences and energy functions, and thus predicts protein conformations with good quality jointly determined by all the parallel searching threads and energy functions. It provides a framework to combine different searching intelligence embedded in heuristic algorithms. It also constructs a container to hybridize different not-so-accurate objective functions which are usually derived from the domain expertise.
Collapse
Affiliation(s)
- Qiang Lü
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| | | | | | | | | | | | | |
Collapse
|
26
|
Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C. Protein 3D structure computed from evolutionary sequence variation. PLoS One 2011; 6:e28766. [PMID: 22163331 PMCID: PMC3233603 DOI: 10.1371/journal.pone.0028766] [Citation(s) in RCA: 748] [Impact Index Per Article: 57.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2011] [Accepted: 11/14/2011] [Indexed: 11/19/2022] Open
Abstract
The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
Collapse
Affiliation(s)
- Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America.
| | | | | | | | | | | | | |
Collapse
|
27
|
Wang Z, Cheng J. An iterative self-refining and self-evaluating approach for protein model quality estimation. Protein Sci 2011; 21:142-51. [PMID: 22057943 DOI: 10.1002/pro.764] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2011] [Revised: 09/30/2011] [Accepted: 10/31/2011] [Indexed: 11/11/2022]
Abstract
Evaluating or predicting the quality of protein models (i.e., predicted protein tertiary structures) without knowing their native structures is important for selecting and appropriately using protein models. We describe an iterative approach that improves the performances of protein Model Quality Assurance Programs (MQAPs). Given the initial quality scores of a list of models assigned by a MQAP, the method iteratively refines the scores until the ranking of the models does not change. We applied the method to the model quality assessment data generated by 30 MQAPs during the Eighth Critical Assessment of Techniques for Protein Structure Prediction. To various degrees, our method increased the average correlation between predicted and real quality scores of 25 out of 30 MQAPs and reduced the average loss (i.e., the difference between the top ranked model and the best model) for 28 MQAPs. Particularly, for MQAPs with low average correlations (<0.4), the correlation can be increased by several times. Similar experiments conducted on the CASP9 MQAPs also demonstrated the effectiveness of the method. Our method is a hybrid method that combines the original method of a MQAP and the pair-wise comparison clustering method. It can achieve a high accuracy similar to a full pair-wise clustering method, but with much less computation time when evaluating hundreds of models. Furthermore, without knowing native structures, the iterative refining method can evaluate the performance of a MQAP by analyzing its model quality predictions.
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | |
Collapse
|
28
|
Wang Q, Vantasin K, Xu D, Shang Y. MUFOLD-WQA: A new selective consensus method for quality assessment in protein structure prediction. Proteins 2011; 79 Suppl 10:185-95. [PMID: 21997748 DOI: 10.1002/prot.23185] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2011] [Revised: 08/25/2011] [Accepted: 08/27/2011] [Indexed: 11/07/2022]
Abstract
Assessing the quality of predicted models is essential in protein tertiary structure prediction. In the past critical assessment of techniques for protein structure prediction (CASP) experiments, consensus quality assessment (QA) methods have shown to be very effective, outperforming single-model methods and other competing approaches by a large margin. In the consensus QA approach, the quality score of a model is typically estimated based on pair-wise structure similarity of it to a set of reference models. In CASP8, the differences among the top QA servers were mostly in the selection of the reference models. In this article, we present a new consensus method "SelCon" based on two key ideas: (1) to adaptively select appropriate reference models based on the attributes of the whole set of predicted models and (2) to weigh different reference models differently, and in particular not to use models that are too similar or too different from the candidate model as its references. We have developed several reference selection functions in SelCon and obtained improved QA results over existing QA methods in experiments using CASP7 and CASP8 data. In the recently completed CASP9 in 2010, the new method was implemented in our MUFOLD-WQA server. Both the official CASP9 assessment and our in-house evaluation showed that MUFOLD-WQA performed very well and achieved top performances in both the global structure QA and top-model selection category in CASP9.
Collapse
Affiliation(s)
- Qingguo Wang
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | | | | | |
Collapse
|
29
|
Shi X, Zhang J, He Z, Shang Y, Xu D. A sampling-based method for ranking protein structural models by integrating multiple scores and features. Curr Protein Pept Sci 2011; 12:540-8. [PMID: 21787308 PMCID: PMC4368063 DOI: 10.2174/138920311796957658] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Revised: 04/01/2011] [Accepted: 05/04/2011] [Indexed: 11/22/2022]
Abstract
One of the major challenges in protein tertiary structure prediction is structure quality assessment. In many cases, protein structure prediction tools generate good structural models, but fail to select the best models from a huge number of candidates as the final output. In this study, we developed a sampling-based machine-learning method to rank protein structural models by integrating multiple scores and features. First, features such as predicted secondary structure, solvent accessibility and residue-residue contact information are integrated by two Radial Basis Function (RBF) models trained from different datasets. Then, the two RBF scores and five selected scoring functions developed by others, i.e., Opus-CA, Opus-PSP, DFIRE, RAPDF, and Cheng Score are synthesized by a sampling method. At last, another integrated RBF model ranks the structural models according to the features of sampling distribution. We tested the proposed method by using two different datasets, including the CASP server prediction models of all CASP8 targets and a set of models generated by our in-house software MUFOLD. The test result shows that our method outperforms any individual scoring function on both best model selection, and overall correlation between the predicted ranking and the actual ranking of structural quality.
Collapse
Affiliation(s)
- Xiaohu Shi
- College of Computer Science and Technology, Jilin University, Jilin, Changchun 130012, China
| | | | | | | | | |
Collapse
|
30
|
Huang SY, Zou X. Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures. Proteins 2011; 79:2648-61. [PMID: 21732421 PMCID: PMC11108592 DOI: 10.1002/prot.23086] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2011] [Revised: 04/21/2011] [Accepted: 05/09/2011] [Indexed: 12/25/2022]
Abstract
In this study, we have developed a statistical mechanics-based iterative method to extract statistical atomic interaction potentials from known, nonredundant protein structures. Our method circumvents the long-standing reference state problem in deriving traditional knowledge-based scoring functions, by using rapid iterations through a physical, global convergence function. The rapid convergence of this physics-based method, unlike other parameter optimization methods, warrants the feasibility of deriving distance-dependent, all-atom statistical potentials to keep the scoring accuracy. The derived potentials, referred to as ITScore/Pro, have been validated using three diverse benchmarks: the high-resolution decoy set, the AMBER benchmark decoy set, and the CASP8 decoy set. Significant improvement in performance has been achieved. Finally, comparisons between the potentials of our model and potentials of a knowledge-based scoring function with a randomized reference state have revealed the reason for the better performance of our scoring function, which could provide useful insight into the development of other physical scoring functions. The potentials developed in this study are generally applicable for structural selection in protein structure prediction.
Collapse
Affiliation(s)
- Sheng-You Huang
- Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO 65211
| | - Xiaoqin Zou
- Department of Physics and Astronomy, Department of Biochemistry, Dalton Cardiovascular Research Center, and Informatics Institute, University of Missouri, Columbia, MO 65211
| |
Collapse
|
31
|
Wang Z, Eickholt J, Cheng J. APOLLO: a quality assessment service for single and multiple protein models. Bioinformatics 2011; 27:1715-6. [PMID: 21546397 PMCID: PMC3106203 DOI: 10.1093/bioinformatics/btr268] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Summary: We built a web server named APOLLO, which can evaluate the absolute global and local qualities of a single protein model using machine learning methods or the global and local qualities of a pool of models using a pair-wise comparison approach. Based on our evaluations on 107 CASP9 (Critical Assessment of Techniques for Protein Structure Prediction) targets, the predicted quality scores generated from our machine learning and pair-wise methods have an average per-target correlation of 0.671 and 0.917, respectively, with the true model quality scores. Based on our test on 92 CASP9 targets, our predicted absolute local qualities have an average difference of 2.60 Å with the actual distances to native structure. Availability:http://sysbio.rnet.missouri.edu/apollo/. Single and pair-wise global quality assessment software is also available at the site. Contact:chengji@missouri.edu
Collapse
Affiliation(s)
- Zheng Wang
- Department of Computer Science, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
32
|
Shirota M, Ishida T, Kinoshita K. Absolute quality evaluation of protein model structures using statistical potentials with respect to the native and reference states. Proteins 2011; 79:1550-63. [PMID: 21365682 DOI: 10.1002/prot.22982] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2010] [Revised: 11/19/2010] [Accepted: 12/19/2010] [Indexed: 11/06/2022]
Abstract
In protein structure prediction, it is crucial to evaluate the degree of native-likeness of given model structures. Statistical potentials extracted from protein structure data sets are widely used for such quality assessment problems, but they are only applicable for comparing different models of the same protein. Although various other methods, such as machine learning approaches, were developed to predict the absolute similarity of model structures to the native ones, they required a set of decoy structures in addition to the model structures. In this paper, we tried to reformulate the statistical potentials as absolute quality scores, without using the information from decoy structures. For this purpose, we regarded the native state and the reference state, which are necessary components of statistical potentials, as the good and bad standard states, respectively, and first showed that the statistical potentials can be regarded as the state functions, which relate a model structure to the native and reference states. Then, we proposed a standardized measure of protein structure, called native-likeness, by interpolating the score of a model structure between the native and reference state scores defined for each protein. The native-likeness correlated with the similarity to the native structures and discriminated the native structures from the models, with better accuracy than the raw score. Our results show that statistical potentials can quantify the native-like properties of protein structures, if they fully utilize the statistical information obtained from the data set.
Collapse
Affiliation(s)
- Matsuyuki Shirota
- Department of Applied Information Sciences, Graduate School of Information Science, Tohoku University, 6-3-09, Aoba, Aramaki, Aoba-Ku, Sendai, Miyagi 980-8579, Japan
| | | | | |
Collapse
|
33
|
Adamczak R, Pillardy J, Vallat BK, Meller J. Fast geometric consensus approach for protein model quality assessment. J Comput Biol 2011; 18:1807-18. [PMID: 21244273 DOI: 10.1089/cmb.2010.0170] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Model quality assessment (MQA) is an integral part of protein structure prediction methods that typically generate multiple candidate models. The challenge lies in ranking and selecting the best models using a variety of physical, knowledge-based, and geometric consensus (GC)-based scoring functions. In particular, 3D-Jury and related GC methods assume that well-predicted (sub-)structures are more likely to occur frequently in a population of candidate models, compared to incorrectly folded fragments. While this approach is very successful in the context of diversified sets of models, identifying similar substructures is computationally expensive since all pairs of models need to be superimposed using MaxSub or related heuristics for structure-to-structure alignment. Here, we consider a fast alternative, in which structural similarity is assessed using 1D profiles, e.g., consisting of relative solvent accessibilities and secondary structures of equivalent amino acid residues in the respective models. We show that the new approach, dubbed 1D-Jury, allows to implicitly compare and rank N models in O(N) time, as opposed to quadratic complexity of 3D-Jury and related clustering-based methods. In addition, 1D-Jury avoids computationally expensive 3D superposition of pairs of models. At the same time, structural similarity scores based on 1D profiles are shown to correlate strongly with those obtained using MaxSub. In terms of the ability to select the best models as top candidates 1D-Jury performs on par with other GC methods. Other potential applications of the new approach, including fast clustering of large numbers of intermediate structures generated by folding simulations, are discussed as well.
Collapse
Affiliation(s)
- Rafal Adamczak
- Department of Environmental Health, University of Cincinnati College of Medicine, Cincinnati, Ohio 45226, USA
| | | | | | | |
Collapse
|
34
|
Li Y, Rata I, Chiu SW, Jakobsson E. Improving predicted protein loop structure ranking using a Pareto-optimality consensus method. BMC STRUCTURAL BIOLOGY 2010; 10:22. [PMID: 20642859 PMCID: PMC2914074 DOI: 10.1186/1472-6807-10-22] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2009] [Accepted: 07/20/2010] [Indexed: 11/10/2022]
Abstract
Background Accurate protein loop structure models are important to understand functions of many proteins. Identifying the native or near-native models by distinguishing them from the misfolded ones is a critical step in protein loop structure prediction. Results We have developed a Pareto Optimal Consensus (POC) method, which is a consensus model ranking approach to integrate multiple knowledge- or physics-based scoring functions. The procedure of identifying the models of best quality in a model set includes: 1) identifying the models at the Pareto optimal front with respect to a set of scoring functions, and 2) ranking them based on the fuzzy dominance relationship to the rest of the models. We apply the POC method to a large number of decoy sets for loops of 4- to 12-residue in length using a functional space composed of several carefully-selected scoring functions: Rosetta, DOPE, DDFIRE, OPLS-AA, and a triplet backbone dihedral potential developed in our lab. Our computational results show that the sets of Pareto-optimal decoys, which are typically composed of ~20% or less of the overall decoys in a set, have a good coverage of the best or near-best decoys in more than 99% of the loop targets. Compared to the individual scoring function yielding best selection accuracy in the decoy sets, the POC method yields 23%, 37%, and 64% less false positives in distinguishing the native conformation, indentifying a near-native model (RMSD < 0.5A from the native) as top-ranked, and selecting at least one near-native model in the top-5-ranked models, respectively. Similar effectiveness of the POC method is also found in the decoy sets from membrane protein loops. Furthermore, the POC method outperforms the other popularly-used consensus strategies in model ranking, such as rank-by-number, rank-by-rank, rank-by-vote, and regression-based methods. Conclusions By integrating multiple knowledge- and physics-based scoring functions based on Pareto optimality and fuzzy dominance, the POC method is effective in distinguishing the best loop models from the other ones within a loop model set.
Collapse
Affiliation(s)
- Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA.
| | | | | | | |
Collapse
|
35
|
Girgis HZ, Corso JJ, Fischer D. On-line hierarchy of general linear models for selecting and ranking the best predicted protein structures. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2010; 2009:4949-53. [PMID: 19963875 DOI: 10.1109/iembs.2009.5332706] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
To predict the three dimensional structure of proteins, many computational methods sample the conformational space, generating a large number of candidate structures. Subsequently, such methods rank the generated structures using a variety of model quality assessment programs in order to obtain a small set of structures that are most likely to resemble the unknown experimentally determined structure. Model quality assessment programs suffer from two main limitations: (i) the rank-one structure is not always the best predicted structure; in other words, the best predicted structure could be ranked as the 10th structure (ii) no single assessment method can correctly rank the predicted structures for all target proteins. However, because often at least some of the methods achieve a good ranking, a model quality assessment method that is based on a consensus of a number of model quality assessment methods is likely to perform better. We have devised the STPdata algorithm, a consensus method based on five model quality assessment programs. We have applied it to build an on-line "custom-trained" hierarchy of general linear models to select and rank the best predicted structures. By "custom-trained", we mean for each target protein the STPdata algorithm trains a unique model on data related to the input target protein. To evaluate our method we participated in CASP8 as human predictors. In CASP8, the STPdata algorithm has trained 128 hierarchical models for each of the 128 target proteins. Based on the official results of CASP8 our method outperformed the best server by 6% and won the fourth position among human predictors. Our CASP results are purely based on computational methods without any human intervention.
Collapse
Affiliation(s)
- Hani Zakaria Girgis
- Computer Science Department, The Johns Hopkins University, Baltimore, MD, USA.
| | | | | |
Collapse
|
36
|
Chae MH, Krull F, Lorenzen S, Knapp EW. Predicting protein complex geometries with a neural network. Proteins 2010; 78:1026-39. [PMID: 19938153 DOI: 10.1002/prot.22626] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
A major challenge of the protein docking problem is to define scoring functions that can distinguish near-native protein complex geometries from a large number of non-native geometries (decoys) generated with noncomplexed protein structures (unbound docking). In this study, we have constructed a neural network that employs the information from atom-pair distance distributions of a large number of decoys to predict protein complex geometries. We found that docking prediction can be significantly improved using two different types of polar hydrogen atoms. To train the neural network, 2000 near-native decoys of even distance distribution were used for each of the 185 considered protein complexes. The neural network normalizes the information from different protein complexes using an additional protein complex identity input neuron for each complex. The parameters of the neural network were determined such that they mimic a scoring funnel in the neighborhood of the native complex structure. The neural network approach avoids the reference state problem, which occurs in deriving knowledge-based energy functions for scoring. We show that a distance-dependent atom pair potential performs much better than a simple atom-pair contact potential. We have compared the performance of our scoring function with other empirical and knowledge-based scoring functions such as ZDOCK 3.0, ZRANK, ITScore-PP, EMPIRE, and RosettaDock. In spite of the simplicity of the method and its functional form, our neural network-based scoring function achieves a reasonable performance in rigid-body unbound docking of proteins. Proteins 2010. (c) 2009 Wiley-Liss, Inc.
Collapse
Affiliation(s)
- Myong-Ho Chae
- Department of Biology, University of Science, Unjong-District, Pyongyang, DPR Korea
| | | | | | | |
Collapse
|
37
|
Lobanov MY, Finkel’shtein AV. Analogy-based protein structure prediction: III. Optimizing the combination of the substitution matrix and pseudopotentials used to align protein sequences with spatial structures. Mol Biol 2010. [DOI: 10.1134/s0026893310010140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
38
|
Wolff K, Vendruscolo M, Porto M. Efficient identification of near-native conformations in ab initio protein structure prediction using structural profiles. Proteins 2010; 78:249-58. [PMID: 19701942 DOI: 10.1002/prot.22533] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
One of the major bottlenecks in many ab initio protein structure prediction methods is currently the selection of a small number of candidate structures for high-resolution refinement from large sets of low-resolution decoys. This step often includes a scoring by low-resolution energy functions and a clustering of conformations by their pairwise root mean square deviations (RMSDs). As an efficient selection is crucial to reduce the overall computational cost of the predictions, any improvement in this direction can increase the overall performance of the predictions and the range of protein structures that can be predicted. We show here that the use of structural profiles, which can be predicted with good accuracy from the amino acid sequences of proteins, provides an efficient means to identify good candidate structures.
Collapse
Affiliation(s)
- Katrin Wolff
- Institut für Festkörperphysik, Technische Universität Darmstadt, 64289 Darmstadt, Germany
| | | | | |
Collapse
|
39
|
Cheng J, Wang Z, Tegge AN, Eickholt J. Prediction of global and local quality of CASP8 models by MULTICOM series. Proteins 2010; 77 Suppl 9:181-4. [PMID: 19544564 DOI: 10.1002/prot.22487] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Evaluating the quality of protein structure models is important for selecting and using models. Here, we describe the MULTICOM series of model quality predictors which contains three predictors tested in the CASP8 experiments. We evaluated these predictors on 120 CASP8 targets. The average correlations between predicted and real GDT-TS scores of the two semi-clustering methods (MULTICOM and MULTICOM-CLUSTER) and the one single-model ab initio method (MULTICOM-CMFR) are 0.90, 0.89, and 0.74, respectively; and their average difference (or GDT-TS loss) between the global GDT-TS scores of the top-ranked models and the best models are 0.05, 0.06, and 0.07, respectively. The average correlation between predicted and real local quality scores of the semi-clustering methods is above 0.64. Our results show that the novel semi-clustering approach that compares a model with top ranked reference models can improve initial quality scores generated by the ab initio method and a simple meta approach.
Collapse
Affiliation(s)
- Jianlin Cheng
- Computer Science Department, University of Missouri, Columbia, Missouri 65211, USA.
| | | | | | | |
Collapse
|
40
|
Benkert P, Tosatto SCE, Schwede T. Global and local model quality estimation at CASP8 using the scoring functions QMEAN and QMEANclust. Proteins 2010; 77 Suppl 9:173-80. [PMID: 19705484 DOI: 10.1002/prot.22532] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Identifying the best candidate model among an ensemble of alternatives is crucial in protein structure prediction. For this purpose, scoring functions have been developed which either calculate a quality estimate on the basis of a single model or derive a score from the information contained in the ensemble of models generated for a given sequence (i.e., consensus methods). At CASP7, consensus methods have performed considerably better than scoring functions operating on single models. However, consensus methods tend to fail if the best models are far from the center of the dominant structural cluster. At CASP8, we investigated whether our hybrid method QMEANclust may overcome this limitation by combining the QMEAN composite scoring function operating on single models with consensus information. We participated with four different scoring functions in the quality assessment category. The QMEANclust consensus scoring function turned out to be a successful method both for the ranking of entire models but especially for the estimation of the per-residue model quality. In this article, we briefly describe the two scoring functions QMEAN and QMEANclust and discuss their performance in the context of what went right and wrong at CASP8. Both scoring functions are publicly available at http://swissmodel.expasy.org/qmean/.
Collapse
Affiliation(s)
- Pascal Benkert
- Biozentrum, University of Basel, Basel 4056, Switzerland
| | | | | |
Collapse
|
41
|
Archie JG, Paluszewski M, Karplus K. Applying Undertaker to quality assessment. Proteins 2010; 77 Suppl 9:191-5. [PMID: 19639637 DOI: 10.1002/prot.22508] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Our group tested three quality assessment functions in CASP8: a function which used only distance constraints derived from alignments (SAM-T08-MQAO), a function which added other single-model terms to the distance constraints (SAM-T08-MQAU), and a function which used both single-model and consensus terms (SAM-T08-MQAC). We analyzed the functions both for ranking models for a single target and for producing an accurate estimate of GDT_TS. Our functions were optimized for the ranking problem, so are perhaps more appropriate for metaserver applications than for providing trustworthiness estimates for single models. On the CASP8 test, the functions with more terms performed better. The MQAC consensus method was substantially better than either single-model function, and the MQAU function was substantially better than the MQAO function that used only constraints from alignments.
Collapse
Affiliation(s)
- John G Archie
- Department of Biomolecular Engineering, University of California, Santa Cruz, California 95064, USA
| | | | | |
Collapse
|
42
|
Abstract
Undertaker is a program designed to help predict protein structure using alignments to proteins of known structure and fragment assembly. The program generates conformations and uses cost functions to select the best structures from among the generated conformations. This paper describes the use of Undertaker's cost functions for model quality assessment. We achieve an accuracy that is similar to other methods, without using consensus-based techniques. Adding consensus-based features further improves our approach substantially. We report several correlation measures, including a new weighted version of Kendall's tau (tau(3)) and show model quality assessment results superior to previously published results on all correlation measures when using only models with no missing atoms.
Collapse
Affiliation(s)
- John Archie
- University of California at Santa Cruz, Biomolecular Engineering, Santa Cruz, CA, USA
| | | |
Collapse
|
43
|
Paluszewski M, Karplus K. Model quality assessment using distance constraints from alignments. Proteins 2009; 75:540-9. [PMID: 19003987 DOI: 10.1002/prot.22262] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Given a set of alternative models for a specific protein sequence, the model quality assessment (MQA) problem asks for an assignment of scores to each model in the set. A good MQA program assigns these scores such that they correlate well with real quality of the models, ideally scoring best that model which is closest to the true structure. In this article, we present a new approach for addressing the MQA problem. It is based on distance constraints extracted from alignments to templates of known structure, and is implemented in the Undertaker program for protein structure prediction. One novel feature is that we extract noncontact constraints as well as contact constraints. We describe how the distance constraint extraction is done and we show how they can be used to address the MQA problem. We have compared our method on CASP7 targets and the results show that our method is at least comparable with the best MQA methods that were assessed at CASP7. We also propose a new evaluation measure, Kendall's tau, that is more interpretable than conventional measures used for evaluating MQA methods (Pearson's r and Spearman's rho). We show clear examples where Kendall's tau agrees much more with our intuition of a correct MQA, and we therefore propose that Kendall's tau be used for future CASP MQA assessments.
Collapse
Affiliation(s)
- Martin Paluszewski
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | |
Collapse
|
44
|
Wang Z, Tegge AN, Cheng J. Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins 2009; 75:638-47. [PMID: 19004001 DOI: 10.1002/prot.22275] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Knowing the quality of a protein structure model is important for its appropriate usage. We developed a model evaluation method to assess the absolute quality of a single protein model using only structural features with support vector machine regression. The method assigns an absolute quantitative score (i.e. GDT-TS) to a model by comparing its secondary structure, relative solvent accessibility, contact map, and beta sheet structure with their counterparts predicted from its primary sequence. We trained and tested the method on the CASP6 dataset using cross-validation. The correlation between predicted and true scores is 0.82. On the independent CASP7 dataset, the correlation averaged over 95 protein targets is 0.76; the average correlation for template-based and ab initio targets is 0.82 and 0.50, respectively. Furthermore, the predicted absolute quality scores can be used to rank models effectively. The average difference (or loss) between the scores of the top-ranked models and the best models is 5.70 on the CASP7 targets. This method performs favorably when compared with the other methods used on the same dataset. Moreover, the predicted absolute quality scores are comparable across models for different proteins. These features make the method a valuable tool for model quality assurance and ranking.
Collapse
Affiliation(s)
- Zheng Wang
- Computer Science Department, Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| | | | | |
Collapse
|
45
|
Song J, Tan H, Mahmood K, Law RHP, Buckle AM, Webb GI, Akutsu T, Whisstock JC. Prodepth: predict residue depth by support vector regression approach from protein sequences only. PLoS One 2009; 4:e7072. [PMID: 19759917 PMCID: PMC2742725 DOI: 10.1371/journal.pone.0007072] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Accepted: 08/20/2009] [Indexed: 11/24/2022] Open
Abstract
Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.
Collapse
Affiliation(s)
- Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
- * E-mail: (JS); (JCW)
| | - Hao Tan
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Khalid Mahmood
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ruby H. P. Law
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Ashley M. Buckle
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Geoffrey I. Webb
- Faculty of Information Technology, Monash University, Clayton, Melbourne, Victoria, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, Japan
| | - James C. Whisstock
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Melbourne, Victoria, Australia
- ARC Centre of Excellence for Structural and Functional Microbial Genomics, Monash University, Clayton, Melbourne, Victoria, Australia
- * E-mail: (JS); (JCW)
| |
Collapse
|
46
|
Benkert P, Schwede T, Tosatto SC. QMEANclust: estimation of protein model quality by combining a composite scoring function with structural density information. BMC STRUCTURAL BIOLOGY 2009; 9:35. [PMID: 19457232 PMCID: PMC2709111 DOI: 10.1186/1472-6807-9-35] [Citation(s) in RCA: 112] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2008] [Accepted: 05/20/2009] [Indexed: 11/10/2022]
Abstract
BACKGROUND The selection of the most accurate protein model from a set of alternatives is a crucial step in protein structure prediction both in template-based and ab initio approaches. Scoring functions have been developed which can either return a quality estimate for a single model or derive a score from the information contained in the ensemble of models for a given sequence. Local structural features occurring more frequently in the ensemble have a greater probability of being correct. Within the context of the CASP experiment, these so called consensus methods have been shown to perform considerably better in selecting good candidate models, but tend to fail if the best models are far from the dominant structural cluster. In this paper we show that model selection can be improved if both approaches are combined by pre-filtering the models used during the calculation of the structural consensus. RESULTS Our recently published QMEAN composite scoring function has been improved by including an all-atom interaction potential term. The preliminary model ranking based on the new QMEAN score is used to select a subset of reliable models against which the structural consensus score is calculated. This scoring function called QMEANclust achieves a correlation coefficient of predicted quality score and GDT_TS of 0.9 averaged over the 98 CASP7 targets and perform significantly better in selecting good models from the ensemble of server models than any other groups participating in the quality estimation category of CASP7. Both scoring functions are also benchmarked on the MOULDER test set consisting of 20 target proteins each with 300 alternatives models generated by MODELLER. QMEAN outperforms all other tested scoring functions operating on individual models, while the consensus method QMEANclust only works properly on decoy sets containing a certain fraction of near-native conformations. We also present a local version of QMEAN for the per-residue estimation of model quality (QMEANlocal) and compare it to a new local consensus-based approach. CONCLUSION Improved model selection is obtained by using a composite scoring function operating on single models in order to enrich higher quality models which are subsequently used to calculate the structural consensus. The performance of consensus-based methods such as QMEANclust highly depends on the composition and quality of the model ensemble to be analysed. Therefore, performance estimates for consensus methods based on large meta-datasets (e.g. CASP) might overrate their applicability in more realistic modelling situations with smaller sets of models based on individual methods.
Collapse
Affiliation(s)
- Pascal Benkert
- Swiss Institute of Bioinformatics, Biozentrum, University of Basel, Klingelbergstrasse 50/70, 4056 Basel, Switzerland.
| | | | | |
Collapse
|
47
|
Skolnick J, Brylinski M. FINDSITE: a combined evolution/structure-based approach to protein function prediction. Brief Bioinform 2009; 10:378-91. [PMID: 19324930 DOI: 10.1093/bib/bbp017] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the approximately 50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used.
Collapse
Affiliation(s)
- Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology 250 14th St NW, Atlanta, GA 30318, USA.
| | | |
Collapse
|
48
|
Kryshtafovych A, Fidelis K. Protein structure prediction and model quality assessment. Drug Discov Today 2009; 14:386-93. [PMID: 19100336 DOI: 10.1016/j.drudis.2008.11.010] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2008] [Revised: 11/05/2008] [Accepted: 11/18/2008] [Indexed: 01/02/2023]
Abstract
Protein structures have proven to be a crucial piece of information for biomedical research. Of the millions of currently sequenced proteins only a small fraction is experimentally solved for structure and the only feasible way to bridge the gap between sequence and structure data is computational modeling. Half a century has passed since it was shown that the amino acid sequence of a protein determines its shape, but a method to translate the sequence code reliably into the 3D structure still remains to be developed. This review summarizes modern protein structure prediction techniques with the emphasis on comparative modeling, and describes the recent advances in methods for theoretical model quality assessment.
Collapse
Affiliation(s)
- Andriy Kryshtafovych
- Protein Structure Prediction Center, Genome Center, University of California Davis, Davis, CA 95616, USA.
| | | |
Collapse
|
49
|
Eramian D, Eswar N, Shen MY, Sali A. How well can the accuracy of comparative protein structure models be predicted? Protein Sci 2008; 17:1881-93. [PMID: 18832340 DOI: 10.1110/ps.036061.108] [Citation(s) in RCA: 114] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Comparative structure models are available for two orders of magnitude more protein sequences than are experimentally determined structures. These models, however, suffer from two limitations that experimentally determined structures do not: They frequently contain significant errors, and their accuracy cannot be readily assessed. We have addressed the latter limitation by developing a protocol optimized specifically for predicting the Calpha root-mean-squared deviation (RMSD) and native overlap (NO3.5A) errors of a model in the absence of its native structure. In contrast to most traditional assessment scores that merely predict one model is more accurate than others, this approach quantifies the error in an absolute sense, thus helping to determine whether or not the model is suitable for intended applications. The assessment relies on a model-specific scoring function constructed by a support vector machine. This regression optimizes the weights of up to nine features, including various sequence similarity measures and statistical potentials, extracted from a tailored training set of models unique to the model being assessed: If possible, we use similarly sized models with the same fold; otherwise, we use similarly sized models with the same secondary structure composition. This protocol predicts the RMSD and NO3.5A errors for a diverse set of 580,317 comparative models of 6174 sequences with correlation coefficients (r) of 0.84 and 0.86, respectively, to the actual errors. This scoring function achieves the best correlation compared to 13 other tested assessment criteria that achieved correlations ranging from 0.35 to 0.71.
Collapse
Affiliation(s)
- David Eramian
- Graduate Group in Biophysics, University of California at San Francisco, California 94158, USA
| | | | | | | |
Collapse
|
50
|
Larsson P, Wallner B, Lindahl E, Elofsson A. Using multiple templates to improve quality of homology models in automated homology modeling. Protein Sci 2008; 17:990-1002. [PMID: 18441233 DOI: 10.1110/ps.073344908] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
When researchers build high-quality models of protein structure from sequence homology, it is today common to use several alternative target-template alignments. Several methods can, at least in theory, utilize information from multiple templates, and many examples of improved model quality have been reported. However, to our knowledge, thus far no study has shown that automatic inclusion of multiple alignments is guaranteed to improve models without artifacts. Here, we have carried out a systematic investigation of the potential of multiple templates to improving homology model quality. We have used test sets consisting of targets from both recent CASP experiments and a larger reference set. In addition to Modeller and Nest, a new method (Pfrag) for multiple template-based modeling is used, based on the segment-matching algorithm from Levitt's SegMod program. Our results show that all programs can produce multi-template models better than any of the single-template models, but a large part of the improvement is simply due to extension of the models. Most of the remaining improved cases were produced by Modeller. The most important factor is the existence of high-quality single-sequence input alignments. Because of the existence of models that are worse than any of the top single-template models, the average model quality does not improve significantly. However, by ranking models with a model quality assessment program such as ProQ, the average quality is improved by approximately 5% in the CASP7 test set.
Collapse
Affiliation(s)
- Per Larsson
- Center for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | | | |
Collapse
|