1
|
Gavalda-Garcia J, Bickel D, Roca-Martinez J, Raimondi D, Orlando G, Vranken W. Data-driven probabilistic definition of the low energy conformational states of protein residues. NAR Genom Bioinform 2024; 6:lqae082. [PMID: 38984065 PMCID: PMC11231583 DOI: 10.1093/nargab/lqae082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 06/14/2024] [Accepted: 06/26/2024] [Indexed: 07/11/2024] Open
Abstract
Protein dynamics and related conformational changes are essential for their function but difficult to characterise and interpret. Amino acids in a protein behave according to their local energy landscape, which is determined by their local structural context and environmental conditions. The lowest energy state for a given residue can correspond to sharply defined conformations, e.g. in a stable helix, or can cover a wide range of conformations, e.g. in intrinsically disordered regions. A good definition of such low energy states is therefore important to describe the behaviour of a residue and how it changes with its environment. We propose a data-driven probabilistic definition of six low energy conformational states typically accessible for amino acid residues in proteins. This definition is based on solution NMR information of 1322 proteins through a combined analysis of structure ensembles with interpreted chemical shifts. We further introduce a conformational state variability parameter that captures, based on an ensemble of protein structures from molecular dynamics or other methods, how often a residue moves between these conformational states. The approach enables a different perspective on the local conformational behaviour of proteins that is complementary to their static interpretation from single structure models.
Collapse
Affiliation(s)
- Jose Gavalda-Garcia
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| | - David Bickel
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| | - Joel Roca-Martinez
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| | | | | | - Wim Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium
- Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium
| |
Collapse
|
2
|
Huang B, Kong L, Wang C, Ju F, Zhang Q, Zhu J, Gong T, Zhang H, Yu C, Zheng WM, Bu D. Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:913-925. [PMID: 37001856 PMCID: PMC10928435 DOI: 10.1016/j.gpb.2022.11.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/23/2022] [Accepted: 11/30/2022] [Indexed: 03/31/2023]
Abstract
Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.
Collapse
Affiliation(s)
- Bin Huang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lupeng Kong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; Changping Laboratory, Beijing 102206, China
| | - Chao Wang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Fusong Ju
- Microsoft Research AI4Science, Beijing 100080, China
| | - Qi Zhang
- Huawei Noah's Ark Lab, Wuhan 430206, China
| | - Jianwei Zhu
- Microsoft Research AI4Science, Beijing 100080, China
| | - Tiansu Gong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haicang Zhang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Chungong Yu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Wei-Mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| |
Collapse
|
3
|
Rockenfeller R, Müller A. Augmenting the Cobb angle: Three-dimensional analysis of whole spine shapes using Bézier curves. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 225:107075. [PMID: 35998481 DOI: 10.1016/j.cmpb.2022.107075] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 07/15/2022] [Accepted: 08/11/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE The identification and classification of pathological spinal deformities poses a major challenge to any diagnostician. First, available medical images are usually two-dimensional projections, obscuring elaborated spatial information. Second, several measurement techniques with different thresholds for certain clinical syndromes make it difficult to classify measured results. Here, a method is presented to augment and standardize the analysis of spinal shapes in three dimensions. METHODS Regarding the first limitation, (semi-)automatic, three-dimensional segmentation techniques of medical images have already been developed. To overcome the second, we propose here a representation of the whole spine by a Bézier curve using the vertebral centers as control points. After normalization, a differential-geometric approach yields information on curvature and torsion at each spinal level as well as in between. RESULTS Based on literature data and multi-body simulations, we show how these quantities alter with individual posture and during motion. Robustness with respect to missing data is investigated. Approaches towards the identification of spinal disorders are motivated. CONCLUSION Our results emphasize the need for individualizable identification and classification of spinal deformities and give an outlook on how it might be achieved. The presented methodology constitutes the first fully three-dimensional analysis of spinal shapes, i.e. without the requirement of certain physiological planes (e.g. the sagittal plane) or landmarks (e.g. the apex vertebra).
Collapse
Affiliation(s)
| | - Andreas Müller
- Institute for Medical Engineering and Information Processing (MTI Mittelrhein), University Koblenz-Landau, Koblenz, Germany; Mechanical Systems Engineering, Swiss Federal Laboratories for Materials Science and Technology (EMPA), Duebendorf, Switzerland
| |
Collapse
|
4
|
Lin E, Lin CH, Lane HY. De Novo Peptide and Protein Design Using Generative Adversarial Networks: An Update. J Chem Inf Model 2022; 62:761-774. [DOI: 10.1021/acs.jcim.1c01361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Eugene Lin
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, United States
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, United States
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
| | - Chieh-Hsin Lin
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung 83301, Taiwan
- School of Medicine, Chang Gung University, Taoyuan 33302, Taiwan
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
- Department of Psychiatry, China Medical University Hospital, Taichung 40447, Taiwan
- Brain Disease Research Center, China Medical University Hospital, Taichung 40447, Taiwan
- Department of Psychology, College of Medical and Health Sciences, Asia University, Taichung 41354, Taiwan
| |
Collapse
|
5
|
Kameda T, Awazu A, Togashi Y. Molecular dynamics analysis of biomolecular systems including nucleic acids. Biophys Physicobiol 2022; 19:e190027. [DOI: 10.2142/biophysico.bppb-v19.0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 08/18/2022] [Indexed: 12/01/2022] Open
Affiliation(s)
| | - Akinori Awazu
- Graduate School of Integrated Sciences for Life, Hiroshima University
| | | |
Collapse
|
6
|
Wang J, Mei J, Ren G. Plant microRNAs: Biogenesis, Homeostasis, and Degradation. FRONTIERS IN PLANT SCIENCE 2019; 10:360. [PMID: 30972093 PMCID: PMC6445950 DOI: 10.3389/fpls.2019.00360] [Citation(s) in RCA: 131] [Impact Index Per Article: 26.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Accepted: 03/07/2019] [Indexed: 05/18/2023]
Abstract
MicroRNAs (miRNAs), a class of endogenous, tiny, non-coding RNAs, are master regulators of gene expression among most eukaryotes. Intracellular miRNA abundance is regulated under multiple levels of control including transcription, processing, RNA modification, RNA-induced silencing complex (RISC) assembly, miRNA-target interaction, and turnover. In this review, we summarize our current understanding of the molecular components and mechanisms that influence miRNA biogenesis, homeostasis, and degradation in plants. We also make comparisons with findings from other organisms where necessary.
Collapse
Affiliation(s)
| | | | - Guodong Ren
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, Institute of Plant Biology, School of Life Sciences, Fudan University, Shanghai, China
| |
Collapse
|
7
|
Panja AS, Bandopadhyay B, Nag A, Maiti S. Protein Secondary Structure Determination (PSSD): A New and Simple Approach. CURR PROTEOMICS 2019. [DOI: 10.2174/1570164615666180911113251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Our present investigation was conducted to explore the computational algorithm
for the protein secondary structure prediction as per the property of evolutionary transient and
large number (each 50) of homologous mesophilic-thermophilic proteins.
</P><P>
Objectives: These mesophilic-thermophilic proteins were used for numerical measurement of helix-sheetcoil
and turn tendency for which each amino-acid residue is screened to build up the propensity-table.
Methods:
In the current study, two different propensity windows have been introduced that allowed
predicting the secondary structure of protein more than 80% accuracy.
Results:
Using this propensity matrix and dynamic algorithm-based programme, a significant and decisive
outcome in the determination of protein (both thermophilic and mesophilic) secondary structure
was noticed over the previous algorithm based programme. It was demonstrated after comparison with
other standard methods including DSSP adopted by PDB with the help of multiple comparisons
ANOVA and Dunnett’s t-test.
Conclusion:
The PSSD is of great importance in the prediction of structural features of any unknown,
unresolved proteins. It is also useful in the studies of proteins structure-function relationship.
Collapse
Affiliation(s)
- Anindya Sundar Panja
- Post Graduate Department of Biotechnology, Oriental Institute of Science and Technology, Vidyasagar University, Midnapore-721102, West Bengal, India
| | - Bidyut Bandopadhyay
- Post Graduate Department of Biotechnology, Oriental Institute of Science and Technology, Vidyasagar University, Midnapore-721102, West Bengal, India
| | - Akash Nag
- Post Graduate Department of Computer Science, The University of Burdwan, Burdwan, 713104, Westbengal, India
| | - Smarajit Maiti
- Post Graduate Department of Biochemistry and Biotechnology, Cell and Molecular Therapeutics Laboratory, Oriental Institute of Science and Technology, Vidyasagar University, Midnapore- 721102, West Bengal, India
| |
Collapse
|
8
|
|
9
|
Simoncini D, Schiex T, Zhang KYJ. Balancing exploration and exploitation in population-based sampling improves fragment-based de novo protein structure prediction. Proteins 2017; 85:852-858. [PMID: 28066917 DOI: 10.1002/prot.25244] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Revised: 11/29/2016] [Accepted: 12/18/2016] [Indexed: 01/17/2023]
Abstract
Conformational search space exploration remains a major bottleneck for protein structure prediction methods. Population-based meta-heuristics typically enable the possibility to control the search dynamics and to tune the balance between local energy minimization and search space exploration. EdaFold is a fragment-based approach that can guide search by periodically updating the probability distribution over the fragment libraries used during model assembly. We implement the EdaFold algorithm as a Rosetta protocol and provide two different probability update policies: a cluster-based variation (EdaRosec ) and an energy-based one (EdaRoseen ). We analyze the search dynamics of our new Rosetta protocols and show that EdaRosec is able to provide predictions with lower C αRMSD to the native structure than EdaRoseen and Rosetta AbInitio Relax protocol. Our software is freely available as a C++ patch for the Rosetta suite and can be downloaded from http://www.riken.jp/zhangiru/software/. Our protocols can easily be extended in order to create alternative probability update policies and generate new search dynamics. Proteins 2017; 85:852-858. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- David Simoncini
- INRA MIAT, UR 875, Castanet-Tolosan Cedex, 31326, France.,Structural Bioinformatics Team, Division of Structural and Synthetic Biology, Center for Life Science Technologies, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa, 230-0045, Japan
| | - Thomas Schiex
- INRA MIAT, UR 875, Castanet-Tolosan Cedex, 31326, France
| | - Kam Y J Zhang
- Structural Bioinformatics Team, Division of Structural and Synthetic Biology, Center for Life Science Technologies, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa, 230-0045, Japan
| |
Collapse
|
10
|
Najibi SM, Maadooliat M, Zhou L, Huang JZ, Gao X. Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions. Comput Struct Biotechnol J 2017; 15:243-254. [PMID: 28280526 PMCID: PMC5331158 DOI: 10.1016/j.csbj.2017.01.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2016] [Revised: 01/26/2017] [Accepted: 01/28/2017] [Indexed: 11/19/2022] Open
Abstract
Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.
Collapse
Affiliation(s)
| | - Mehdi Maadooliat
- Department of Mathematics, Statistics and Computer Science, Marquette University, WI 53201-1881, USA
- Center for Human Genetics, Marshfield Clinic Research Institute, Marshfield, WI 54449, USA
| | - Lan Zhou
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Jianhua Z. Huang
- Department of Statistics, Texas A&M University, TX 77843-3143, USA
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Corresponding author.
| |
Collapse
|
11
|
Bhattacharya D, Cao R, Cheng J. UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics 2016; 32:2791-9. [PMID: 27259540 PMCID: PMC5018369 DOI: 10.1093/bioinformatics/btw316] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2016] [Accepted: 05/15/2016] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Recent experimental studies have suggested that proteins fold via stepwise assembly of structural units named 'foldons' through the process of sequential stabilization. Alongside, latest developments on computational side based on probabilistic modeling have shown promising direction to perform de novo protein conformational sampling from continuous space. However, existing computational approaches for de novo protein structure prediction often randomly sample protein conformational space as opposed to experimentally suggested stepwise sampling. RESULTS Here, we develop a novel generative, probabilistic model that simultaneously captures local structural preferences of backbone and side chain conformational space of polypeptide chains in a united-residue representation and performs experimentally motivated conditional conformational sampling via stepwise synthesis and assembly of foldon units that minimizes a composite physics and knowledge-based energy function for de novo protein structure prediction. The proposed method, UniCon3D, has been found to (i) sample lower energy conformations with higher accuracy than traditional random sampling in a small benchmark of 6 proteins; (ii) perform comparably with the top five automated methods on 30 difficult target domains from the 11th Critical Assessment of Protein Structure Prediction (CASP) experiment and on 15 difficult target domains from the 10th CASP experiment; and (iii) outperform two state-of-the-art approaches and a baseline counterpart of UniCon3D that performs traditional random sampling for protein modeling aided by predicted residue-residue contacts on 45 targets from the 10th edition of CASP. AVAILABILITY AND IMPLEMENTATION Source code, executable versions, manuals and example data of UniCon3D for Linux and OSX are freely available to non-commercial users at http://sysbio.rnet.missouri.edu/UniCon3D/ CONTACT: chengji@missouri.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Jianlin Cheng
- Department of Computer Science Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
12
|
Maadooliat M, Zhou L, Najibi SM, Gao X, Huang JZ. Collective Estimation of Multiple Bivariate Density Functions With Application to Angular-Sampling-Based Protein Loop Modeling. J Am Stat Assoc 2016. [DOI: 10.1080/01621459.2015.1099535] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
|
13
|
3D protein structure prediction using Imperialist Competitive algorithm and half sphere exposure prediction. J Theor Biol 2016; 391:81-7. [PMID: 26718864 DOI: 10.1016/j.jtbi.2015.12.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2015] [Revised: 11/22/2015] [Accepted: 12/01/2015] [Indexed: 11/23/2022]
Abstract
Predicting the native structure of proteins based on half-sphere exposure and contact numbers has been studied deeply within recent years. Online predictors of these vectors and secondary structures of amino acids sequences have made it possible to design a function for the folding process. By choosing variant structures and directs for each secondary structure, a random conformation can be generated, and a potential function can then be assigned. Minimizing the potential function utilizing meta-heuristic algorithms is the final step of finding the native structure of a given amino acid sequence. In this work, Imperialist Competitive algorithm was used in order to accelerate the process of minimization. Moreover, we applied an adaptive procedure to apply revolutionary changes. Finally, we considered a more accurate tool for prediction of secondary structure. The results of the computational experiments on standard benchmark show the superiority of the new algorithm over the previous methods with similar potential function.
Collapse
|
14
|
Bhattacharya D, Adhikari B, Li J, Cheng J. FRAGSION: ultra-fast protein fragment library generation by IOHMM sampling. Bioinformatics 2016; 32:2059-61. [PMID: 27153697 DOI: 10.1093/bioinformatics/btw067] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2015] [Accepted: 01/30/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Speed, accuracy and robustness of building protein fragment library have important implications in de novo protein structure prediction since fragment-based methods are one of the most successful approaches in template-free modeling (FM). Majority of the existing fragment detection methods rely on database-driven search strategies to identify candidate fragments, which are inherently time-consuming and often hinder the possibility to locate longer fragments due to the limited sizes of databases. Also, it is difficult to alleviate the effect of noisy sequence-based predicted features such as secondary structures on the quality of fragment. RESULTS Here, we present FRAGSION, a database-free method to efficiently generate protein fragment library by sampling from an Input-Output Hidden Markov Model. FRAGSION offers some unique features compared to existing approaches in that it (i) is lightning-fast, consuming only few seconds of CPU time to generate fragment library for a protein of typical length (300 residues); (ii) can generate dynamic-size fragments of any length (even for the whole protein sequence) and (iii) offers ways to handle noise in predicted secondary structure during fragment sampling. On a FM dataset from the most recent Critical Assessment of Structure Prediction, we demonstrate that FGRAGSION provides advantages over the state-of-the-art fragment picking protocol of ROSETTA suite by speeding up computation by several orders of magnitude while achieving comparable performance in fragment quality. AVAILABILITY AND IMPLEMENTATION Source code and executable versions of FRAGSION for Linux and MacOS is freely available to non-commercial users at http://sysbio.rnet.missouri.edu/FRAGSION/ It is bundled with a manual and example data. CONTACT chengji@missouri.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | - Jianlin Cheng
- Department of Computer Science, Informatics Institute and C. Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
15
|
Yang Y, Zhou Y. Effective protein conformational sampling based on predicted torsion angles. J Comput Chem 2015; 37:976-80. [DOI: 10.1002/jcc.24285] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2015] [Revised: 11/01/2015] [Accepted: 11/27/2015] [Indexed: 11/09/2022]
Affiliation(s)
- Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University; Queensland 4222 Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University; Queensland 4222 Australia
| |
Collapse
|
16
|
De novo protein conformational sampling using a probabilistic graphical model. Sci Rep 2015; 5:16332. [PMID: 26541939 PMCID: PMC4635387 DOI: 10.1038/srep16332] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 10/13/2015] [Indexed: 11/08/2022] Open
Abstract
Efficient exploration of protein conformational space remains challenging especially for large proteins when assembling discretized structural fragments extracted from a protein structure data database. We propose a fragment-free probabilistic graphical model, FUSION, for conformational sampling in continuous space and assess its accuracy using ‘blind’ protein targets with a length up to 250 residues from the CASP11 structure prediction exercise. The method reduces sampling bottlenecks, exhibits strong convergence, and demonstrates better performance than the popular fragment assembly method, ROSETTA, on relatively larger proteins with a length of more than 150 residues in our benchmark set. FUSION is freely available through a web server at http://protein.rnet.missouri.edu/FUSION/.
Collapse
|
17
|
Baudry JP. Estimation and model selection for model-based clustering with the conditional classification likelihood. Electron J Stat 2015. [DOI: 10.1214/15-ejs1026] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Shrestha R, Zhang KYJ. Improving fragment quality for de novo structure prediction. Proteins 2014; 82:2240-52. [PMID: 24753351 DOI: 10.1002/prot.24587] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2014] [Revised: 04/03/2014] [Accepted: 04/15/2014] [Indexed: 11/08/2022]
Abstract
De novo structure prediction can be defined as a search in conformational space under the guidance of an energy function. The most successful de novo structure prediction methods, such as Rosetta, assemble the fragments from known structures to reduce the search space. Therefore, the fragment quality is an important factor in structure prediction. In our study, a method is proposed to generate a new set of fragments from the lowest energy de novo models. These fragments were subsequently used to predict the next-round of models. In a benchmark of 30 proteins, the new set of fragments showed better performance when used to predict de novo structures. The lowest energy model predicted using our method was closer to native structure than Rosetta for 22 proteins. Following a similar trend, the best model among top five lowest energy models predicted using our method was closer to native structure than Rosetta for 20 proteins. In addition, our experiment showed that the C-alpha root mean square deviation was improved from 5.99 to 5.03 Å on average compared to Rosetta when the lowest energy models were picked as the best predicted models.
Collapse
Affiliation(s)
- Rojan Shrestha
- Zhang Initiative Research Unit, Institute Laboratories, RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan; Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-0882, Japan
| | | |
Collapse
|
19
|
Simoncini D, Zhang KYJ. Efficient sampling in fragment-based protein structure prediction using an estimation of distribution algorithm. PLoS One 2013; 8:e68954. [PMID: 23935913 PMCID: PMC3723781 DOI: 10.1371/journal.pone.0068954] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2013] [Accepted: 06/07/2013] [Indexed: 11/19/2022] Open
Abstract
Fragment assembly is a powerful method of protein structure prediction that builds protein models from a pool of candidate fragments taken from known structures. Stochastic sampling is subsequently used to refine the models. The structures are first represented as coarse-grained models and then as all-atom models for computational efficiency. Many models have to be generated independently due to the stochastic nature of the sampling methods used to search for the global minimum in a complex energy landscape. In this paper we present EdaFold(AA), a fragment-based approach which shares information between the generated models and steers the search towards native-like regions. A distribution over fragments is estimated from a pool of low energy all-atom models. This iteratively-refined distribution is used to guide the selection of fragments during the building of models for subsequent rounds of structure prediction. The use of an estimation of distribution algorithm enabled EdaFold(AA) to reach lower energy levels and to generate a higher percentage of near-native models. [Formula: see text] uses an all-atom energy function and produces models with atomic resolution. We observed an improvement in energy-driven blind selection of models on a benchmark of EdaFold(AA) in comparison with the [Formula: see text] AbInitioRelax protocol.
Collapse
Affiliation(s)
- David Simoncini
- Zhang Initiative Research Unit, Institute Laboratories, RIKEN, Wako, Saitama, Japan
| | - Kam Y. J. Zhang
- Zhang Initiative Research Unit, Institute Laboratories, RIKEN, Wako, Saitama, Japan
- * E-mail:
| |
Collapse
|
20
|
Dhingra P, Jayaram B. A homology/ab initio hybrid algorithm for sampling near-native protein conformations. J Comput Chem 2013; 34:1925-36. [PMID: 23728619 DOI: 10.1002/jcc.23339] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Revised: 03/09/2013] [Accepted: 04/21/2013] [Indexed: 12/19/2022]
Abstract
One of the major challenges for protein tertiary structure prediction strategies is the quality of conformational sampling algorithms, which can effectively and readily search the protein fold space to generate near-native conformations. In an effort to advance the field by making the best use of available homology as well as fold recognition approaches along with ab initio folding methods, we have developed Bhageerath-H Strgen, a homology/ab initio hybrid algorithm for protein conformational sampling. The methodology is tested on the benchmark CASP9 dataset of 116 targets. In 93% of the cases, a structure with TM-score ≥ 0.5 is generated in the pool of decoys. Further, the performance of Bhageerath-H Strgen was seen to be efficient in comparison with different decoy generation methods. The algorithm is web enabled as Bhageerath-H Strgen web tool which is made freely accessible for protein decoy generation (http://www.scfbio-iitd.res.in/software/Bhageerath-HStrgen1.jsp).
Collapse
Affiliation(s)
- Priyanka Dhingra
- Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi, 110016, India
| | | |
Collapse
|
21
|
Boomsma W, Frellsen J, Harder T, Bottaro S, Johansson KE, Tian P, Stovgaard K, Andreetta C, Olsson S, Valentin JB, Antonov LD, Christensen AS, Borg M, Jensen JH, Lindorff-Larsen K, Ferkinghoff-Borg J, Hamelryck T. PHAISTOS: a framework for Markov chain Monte Carlo simulation and inference of protein structure. J Comput Chem 2013; 34:1697-705. [PMID: 23619610 DOI: 10.1002/jcc.23292] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2012] [Revised: 03/14/2013] [Accepted: 03/20/2013] [Indexed: 11/10/2022]
Abstract
We present a new software framework for Markov chain Monte Carlo sampling for simulation, prediction, and inference of protein structure. The software package contains implementations of recent advances in Monte Carlo methodology, such as efficient local updates and sampling from probabilistic models of local protein structure. These models form a probabilistic alternative to the widely used fragment and rotamer libraries. Combined with an easily extendible software architecture, this makes PHAISTOS well suited for Bayesian inference of protein structure from sequence and/or experimental data. Currently, two force-fields are available within the framework: PROFASI and OPLS-AA/L, the latter including the generalized Born surface area solvent model. A flexible command-line and configuration-file interface allows users quickly to set up simulations with the desired configuration. PHAISTOS is released under the GNU General Public License v3.0. Source code and documentation are freely available from http://phaistos.sourceforge.net. The software is implemented in C++ and has been tested on Linux and OSX platforms.
Collapse
Affiliation(s)
- Wouter Boomsma
- Department of Biology, University of Copenhagen, Copenhagen, 2200, Denmark
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Abstract
De novo protein structure prediction often generates a large population of candidates (models), and then selects near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. In this paper, we present a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise root mean square deviation and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and root mean square deviation and the correlation between Dscore2 and TM-score are high. Compared to the existing methods with calculation time quadratic to the number of models, our Dscore1-based clustering achieves a linearly time complexity while obtaining almost the same accuracy for near-native model selection. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500 k) models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php.
Collapse
Affiliation(s)
- Jingfen Zhang
- Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO 65201, USA
| | | |
Collapse
|
23
|
Maadooliat M, Gao X, Huang JZ. Assessing protein conformational sampling methods based on bivariate lag-distributions of backbone angles. Brief Bioinform 2012; 14:724-36. [PMID: 22926831 DOI: 10.1093/bib/bbs052] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains. In this article, we present a thorough review of angular-sampling-based methods by assessing three main questions: What is the best distribution type to model the protein angles? What is a reasonable number of components in a mixture model that should be considered to accurately parameterize the joint distribution of the angles? and What is the order of the local sequence-structure dependency that should be considered by a prediction method? We assess the model fits for different methods using bivariate lag-distributions of the dihedral/planar angles. Moreover, the main information across the lags can be extracted using a technique called Lag singular value decomposition (LagSVD), which considers the joint distribution of the dihedral/planar angles over different lags using a nonparametric approach and monitors the behavior of the lag-distribution of the angles using singular value decomposition. As a result, we developed graphical tools and numerical measurements to compare and evaluate the performance of different model fits. Furthermore, we developed a web-tool (http://www.stat.tamu.edu/∼madoliat/LagSVD) that can be used to produce informative animations.
Collapse
Affiliation(s)
- Mehdi Maadooliat
- Mathematical and Computer Sciences and Engineering Division, 4700 King Abdullah University of Science and Technology, Thuwal 23955-6900, Kingdom of Saudi Arabia, . Jianhua Z. Huang, Department of Statistics, 447 Blocker Building, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143 (USA), E-mail:
| | | | | |
Collapse
|
24
|
Simoncini D, Berenger F, Shrestha R, Zhang KYJ. A probabilistic fragment-based protein structure prediction algorithm. PLoS One 2012; 7:e38799. [PMID: 22829868 PMCID: PMC3400640 DOI: 10.1371/journal.pone.0038799] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2012] [Accepted: 05/10/2012] [Indexed: 11/23/2022] Open
Abstract
Conformational sampling is one of the bottlenecks in fragment-based protein structure prediction approaches. They generally start with a coarse-grained optimization where mainchain atoms and centroids of side chains are considered, followed by a fine-grained optimization with an all-atom representation of proteins. It is during this coarse-grained phase that fragment-based methods sample intensely the conformational space. If the native-like region is sampled more, the accuracy of the final all-atom predictions may be improved accordingly. In this work we present EdaFold, a new method for fragment-based protein structure prediction based on an Estimation of Distribution Algorithm. Fragment-based approaches build protein models by assembling short fragments from known protein structures. Whereas the probability mass functions over the fragment libraries are uniform in the usual case, we propose an algorithm that learns from previously generated decoys and steers the search toward native-like regions. A comparison with Rosetta AbInitio protocol shows that EdaFold is able to generate models with lower energies and to enhance the percentage of near-native coarse-grained decoys on a benchmark of proteins. The best coarse-grained models produced by both methods were refined into all-atom models and used in molecular replacement. All atom decoys produced out of EdaFold’s decoy set reach high enough accuracy to solve the crystallographic phase problem by molecular replacement for some test proteins. EdaFold showed a higher success rate in molecular replacement when compared to Rosetta. Our study suggests that improving low resolution coarse-grained decoys allows computational methods to avoid subsequent sampling issues during all-atom refinement and to produce better all-atom models. EdaFold can be downloaded from http://www.riken.jp/zhangiru/software/.
Collapse
Affiliation(s)
- David Simoncini
- Zhang Initiative Research Unit, Advanced Science Institute, RIKEN, Wako, Saitama, Japan
| | - Francois Berenger
- Zhang Initiative Research Unit, Advanced Science Institute, RIKEN, Wako, Saitama, Japan
| | - Rojan Shrestha
- Zhang Initiative Research Unit, Advanced Science Institute, RIKEN, Wako, Saitama, Japan
- Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan
| | - Kam Y. J. Zhang
- Zhang Initiative Research Unit, Advanced Science Institute, RIKEN, Wako, Saitama, Japan
- Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan
- * E-mail:
| |
Collapse
|
25
|
Harder T, Borg M, Bottaro S, Boomsma W, Olsson S, Ferkinghoff-Borg J, Hamelryck T. An Efficient Null Model for Conformational Fluctuations in Proteins. Structure 2012; 20:1028-39. [DOI: 10.1016/j.str.2012.03.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 03/08/2012] [Accepted: 03/12/2012] [Indexed: 10/28/2022]
|
26
|
Li SC, Bu D, Li M. Clustering 100,000 protein structure decoys in minutes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:765-773. [PMID: 22025764 DOI: 10.1109/tcbb.2011.142] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER’s selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.
Collapse
Affiliation(s)
- Shuai Cheng Li
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong.
| | | | | |
Collapse
|
27
|
Olsson S, Boomsma W, Frellsen J, Bottaro S, Harder T, Ferkinghoff-Borg J, Hamelryck T. Generative probabilistic models extend the scope of inferential structure determination. JOURNAL OF MAGNETIC RESONANCE (SAN DIEGO, CALIF. : 1997) 2011; 213:182-186. [PMID: 21993764 DOI: 10.1016/j.jmr.2011.08.039] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2011] [Revised: 08/19/2011] [Accepted: 08/30/2011] [Indexed: 05/31/2023]
Abstract
Conventional methods for protein structure determination from NMR data rely on the ad hoc combination of physical forcefields and experimental data, along with heuristic determination of free parameters such as weight of experimental data relative to a physical forcefield. Recently, a theoretically rigorous approach was developed which treats structure determination as a problem of Bayesian inference. In this case, the forcefields are brought in as a prior distribution in the form of a Boltzmann factor. Due to high computational cost, the approach has been only sparsely applied in practice. Here, we demonstrate that the use of generative probabilistic models instead of physical forcefields in the Bayesian formalism is not only conceptually attractive, but also improves precision and efficiency. Our results open new vistas for the use of sophisticated probabilistic models of biomolecular structure in structure determination from experimental data.
Collapse
Affiliation(s)
- Simon Olsson
- Bioinformatics Center, University of Copenhagen, Department of Biology, Ole Maaløes Vej 5, DK-2200 Copenhagen N, Denmark.
| | | | | | | | | | | | | |
Collapse
|
28
|
Penner RC, Knudsen M, Wiuf C, Andersen JE. An Algebro-topological description of protein domain structure. PLoS One 2011; 6:e19670. [PMID: 21629687 PMCID: PMC3101207 DOI: 10.1371/journal.pone.0019670] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Accepted: 04/03/2011] [Indexed: 11/25/2022] Open
Abstract
The space of possible protein structures appears vast and continuous, and the relationship between primary, secondary and tertiary structure levels is complex. Protein structure comparison and classification is therefore a difficult but important task since structure is a determinant for molecular interaction and function. We introduce a novel mathematical abstraction based on geometric topology to describe protein domain structure. Using the locations of the backbone atoms and the hydrogen bonds, we build a combinatorial object – a so-called fatgraph. The description is discrete yet gives rise to a 2-dimensional mathematical surface. Thus, each protein domain corresponds to a particular mathematical surface with characteristic topological invariants, such as the genus (number of holes) and the number of boundary components. Both invariants are global fatgraph features reflecting the interconnectivity of the domain by hydrogen bonds. We introduce the notion of robust variables, that is variables that are robust towards minor changes in the structure/fatgraph, and show that the genus and the number of boundary components are robust. Further, we invesigate the distribution of different fatgraph variables and show how only four variables are capable of distinguishing different folds. We use local (secondary) and global (tertiary) fatgraph features to describe domain structures and illustrate that they are useful for classification of domains in CATH. In addition, we combine our method with two other methods thereby using primary, secondary, and tertiary structure information, and show that we can identify a large percentage of new and unclassified structures in CATH.
Collapse
Affiliation(s)
- Robert Clark Penner
- Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark
- Departments of Mathematics and Physics/Astronomy, University of Southern California, Los Angeles, California, United States of America
| | - Michael Knudsen
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
| | - Carsten Wiuf
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
- Centre for Membrane Pumps in Cells and Disease, Aarhus University, Aarhus, Denmark
- * E-mail:
| | - Jørgen Ellegaard Andersen
- Center for the Topology and Quantization of Moduli Spaces, Department of Mathematical Sciences, Aarhus University, Aarhus, Denmark
| |
Collapse
|
29
|
Aydin Z, Singh A, Bilmes J, Noble WS. Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure. BMC Bioinformatics 2011; 12:154. [PMID: 21569525 PMCID: PMC3118164 DOI: 10.1186/1471-2105-12-154] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2010] [Accepted: 05/13/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight. RESULTS In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements. CONCLUSIONS We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
Collapse
Affiliation(s)
- Zafer Aydin
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | | | | | | |
Collapse
|
30
|
Zhao F, Peng J, Debartolo J, Freed KF, Sosnick TR, Xu J. A probabilistic and continuous model of protein conformational space for template-free modeling. J Comput Biol 2011; 17:783-98. [PMID: 20583926 DOI: 10.1089/cmb.2009.0235] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One of the major challenges with protein template-free modeling is an efficient sampling algorithm that can explore a huge conformation space quickly. The popular fragment assembly method constructs a conformation by stringing together short fragments extracted from the Protein Data Base (PDB). The discrete nature of this method may limit generated conformations to a subspace in which the native fold does not belong. Another worry is that a protein with really new fold may contain some fragments not in the PDB. This article presents a probabilistic model of protein conformational space to overcome the above two limitations. This probabilistic model employs directional statistics to model the distribution of backbone angles and 2(nd)-order Conditional Random Fields (CRFs) to describe sequence-angle relationship. Using this probabilistic model, we can sample protein conformations in a continuous space, as opposed to the widely used fragment assembly and lattice model methods that work in a discrete space. We show that when coupled with a simple energy function, this probabilistic method compares favorably with the fragment assembly method in the blind CASP8 evaluation, especially on alpha or small beta proteins. To our knowledge, this is the first probabilistic method that can search conformations in a continuous space and achieves favorable performance. Our method also generated three-dimensional (3D) models better than template-based methods for a couple of CASP8 hard targets. The method described in this article can also be applied to protein loop modeling, model refinement, and even RNA tertiary structure prediction.
Collapse
Affiliation(s)
- Feng Zhao
- Toyota Technological Institute at Chicago, Chicago, Illinois 60637, USA
| | | | | | | | | | | |
Collapse
|
31
|
Zhou Y, Duan Y, Yang Y, Faraggi E, Lei H. Trends in template/fragment-free protein structure prediction. Theor Chem Acc 2011; 128:3-16. [PMID: 21423322 PMCID: PMC3030773 DOI: 10.1007/s00214-010-0799-2] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 08/15/2010] [Indexed: 12/13/2022]
Abstract
Predicting the structure of a protein from its amino acid sequence is a long-standing unsolved problem in computational biology. Its solution would be of both fundamental and practical importance as the gap between the number of known sequences and the number of experimentally solved structures widens rapidly. Currently, the most successful approaches are based on fragment/template reassembly. Lacking progress in template-free structure prediction calls for novel ideas and approaches. This article reviews trends in the development of physical and specific knowledge-based energy functions as well as sampling techniques for fragment-free structure prediction. Recent physical- and knowledge-based studies demonstrated that it is possible to sample and predict highly accurate protein structures without borrowing native fragments from known protein structures. These emerging approaches with fully flexible sampling have the potential to move the field forward.
Collapse
Affiliation(s)
- Yaoqi Zhou
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Yong Duan
- UC Davis Genome Center and Department of Applied Science, University of California, One Shields Avenue, Davis, CA USA
- College of Physics, Huazhong University of Science and Technology, 1037 Luoyu Road, 430074 Wuhan, China
| | - Yuedong Yang
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Eshel Faraggi
- School of Informatics, Indiana Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indiana University Purdue University, 719 Indiana Ave #319, Walker Plaza Building, Indianapolis, IN 46202 USA
| | - Hongxing Lei
- UC Davis Genome Center and Department of Applied Science, University of California, One Shields Avenue, Davis, CA USA
- Beijing Institute of Genomics, Chinese Academy of Sciences, 100029 Beijing, China
| |
Collapse
|
32
|
Accounting for conformational entropy in predicting binding free energies of protein-protein interactions. Proteins 2010; 79:444-62. [DOI: 10.1002/prot.22894] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
33
|
Hamelryck T, Borg M, Paluszewski M, Paulsen J, Frellsen J, Andreetta C, Boomsma W, Bottaro S, Ferkinghoff-Borg J. Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PLoS One 2010; 5:e13714. [PMID: 21103041 PMCID: PMC2978081 DOI: 10.1371/journal.pone.0013714] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2010] [Accepted: 10/04/2010] [Indexed: 11/26/2022] Open
Abstract
Understanding protein structure is of crucial importance in science, medicine and biotechnology. For about two decades, knowledge-based potentials based on pairwise distances – so-called “potentials of mean force” (PMFs) – have been center stage in the prediction and design of protein structure and the simulation of protein folding. However, the validity, scope and limitations of these potentials are still vigorously debated and disputed, and the optimal choice of the reference state – a necessary component of these potentials – is an unsolved problem. PMFs are loosely justified by analogy to the reversible work theorem in statistical physics, or by a statistical argument based on a likelihood function. Both justifications are insightful but leave many questions unanswered. Here, we show for the first time that PMFs can be seen as approximations to quantities that do have a rigorous probabilistic justification: they naturally arise when probability distributions over different features of proteins need to be combined. We call these quantities “reference ratio distributions” deriving from the application of the “reference ratio method.” This new view is not only of theoretical relevance but leads to many insights that are of direct practical use: the reference state is uniquely defined and does not require external physical insights; the approach can be generalized beyond pairwise distances to arbitrary features of protein structure; and it becomes clear for which purposes the use of these quantities is justified. We illustrate these insights with two applications, involving the radius of gyration and hydrogen bonding. In the latter case, we also show how the reference ratio method can be iteratively applied to sculpt an energy funnel. Our results considerably increase the understanding and scope of energy functions derived from known biomolecular structures.
Collapse
Affiliation(s)
- Thomas Hamelryck
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- * E-mail: (TH); (JFB)
| | - Mikael Borg
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Martin Paluszewski
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jonas Paulsen
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jes Frellsen
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Christian Andreetta
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
- Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Sandro Bottaro
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
| | - Jesper Ferkinghoff-Borg
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
- * E-mail: (TH); (JFB)
| |
Collapse
|
34
|
Abstract
Motivation: One of the major bottlenecks with ab initio protein folding is an effective conformation sampling algorithm that can generate native-like conformations quickly. The popular fragment assembly method generates conformations by restricting the local conformations of a protein to short structural fragments in the PDB. This method may limit conformations to a subspace to which the native fold does not belong because (i) a protein with really new fold may contain some structural fragments not in the PDB and (ii) the discrete nature of fragments may prevent them from building a native-like fold. Previously we have developed a conditional random fields (CRF) method for fragment-free protein folding that can sample conformations in a continuous space and demonstrated that this CRF method compares favorably to the popular fragment assembly method. However, the CRF method is still limited by its capability of generating conformations compatible with a sequence. Results: We present a new fragment-free approach to protein folding using a recently invented probabilistic graphical model conditional neural fields (CNF). This new CNF method is much more powerful than CRF in modeling the sophisticated protein sequence-structure relationship and thus, enables us to generate native-like conformations more easily. We show that when coupled with a simple energy function and replica exchange Monte Carlo simulation, our CNF method can generate decoys much better than CRF on a variety of test proteins including the CASP8 free-modeling targets. In particular, our CNF method can predict a correct fold for T0496_D1, one of the two CASP8 targets with truly new fold. Our predicted model for T0496 is significantly better than all the CASP8 models. Contact:jinboxu@gmail.com
Collapse
Affiliation(s)
- Feng Zhao
- Toyota Technological Institute, Chicago, IL 60637, USA
| | | | | |
Collapse
|
35
|
Stovgaard K, Andreetta C, Ferkinghoff-Borg J, Hamelryck T. Calculation of accurate small angle X-ray scattering curves from coarse-grained protein models. BMC Bioinformatics 2010; 11:429. [PMID: 20718956 PMCID: PMC2931518 DOI: 10.1186/1471-2105-11-429] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2010] [Accepted: 08/18/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome sequencing projects have expanded the gap between the amount of known protein sequences and structures. The limitations of current high resolution structure determination methods make it unlikely that this gap will disappear in the near future. Small angle X-ray scattering (SAXS) is an established low resolution method for routinely determining the structure of proteins in solution. The purpose of this study is to develop a method for the efficient calculation of accurate SAXS curves from coarse-grained protein models. Such a method can for example be used to construct a likelihood function, which is paramount for structure determination based on statistical inference. RESULTS We present a method for the efficient calculation of accurate SAXS curves based on the Debye formula and a set of scattering form factors for dummy atom representations of amino acids. Such a method avoids the computationally costly iteration over all atoms. We estimated the form factors using generated data from a set of high quality protein structures. No ad hoc scaling or correction factors are applied in the calculation of the curves. Two coarse-grained representations of protein structure were investigated; two scattering bodies per amino acid led to significantly better results than a single scattering body. CONCLUSION We show that the obtained point estimates allow the calculation of accurate SAXS curves from coarse-grained protein models. The resulting curves are on par with the current state-of-the-art program CRYSOL, which requires full atomic detail. Our method was also comparable to CRYSOL in recognizing native structures among native-like decoys. As a proof-of-concept, we combined the coarse-grained Debye calculation with a previously described probabilistic model of protein structure, TorusDBN. This resulted in a significant improvement in the decoy recognition performance. In conclusion, the presented method shows great promise for use in statistical inference of protein structures from SAXS data.
Collapse
Affiliation(s)
- Kasper Stovgaard
- Department of Biology, University of Copenhagen, The Bioinformatics Centre, Denmark
| | | | | | | |
Collapse
|
36
|
Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. Beyond rotamers: a generative, probabilistic model of side chains in proteins. BMC Bioinformatics 2010; 11:306. [PMID: 20525384 PMCID: PMC2902450 DOI: 10.1186/1471-2105-11-306] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Accepted: 06/05/2010] [Indexed: 11/21/2022] Open
Abstract
Background Accurately covering the conformational space of amino acid side chains is essential for important applications such as protein design, docking and high resolution structure prediction. Today, the most common way to capture this conformational space is through rotamer libraries - discrete collections of side chain conformations derived from experimentally determined protein structures. The discretization can be exploited to efficiently search the conformational space. However, discretizing this naturally continuous space comes at the cost of losing detailed information that is crucial for certain applications. For example, rigorously combining rotamers with physical force fields is associated with numerous problems. Results In this work we present BASILISK: a generative, probabilistic model of the conformational space of side chains that makes it possible to sample in continuous space. In addition, sampling can be conditional upon the protein's detailed backbone conformation, again in continuous space - without involving discretization. Conclusions A careful analysis of the model and a comparison with various rotamer libraries indicates that the model forms an excellent, fully continuous model of side chain conformational space. We also illustrate how the model can be used for rigorous, unbiased sampling with a physical force field, and how it improves side chain prediction when used as a pseudo-energy term. In conclusion, BASILISK is an important step forward on the way to a rigorous probabilistic description of protein structure in continuous space and in atomic detail.
Collapse
Affiliation(s)
- Tim Harder
- The Bioinformatics Section, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | | | | | | | |
Collapse
|
37
|
Paluszewski M, Hamelryck T. Mocapy++--a toolkit for inference and learning in dynamic Bayesian networks. BMC Bioinformatics 2010; 11:126. [PMID: 20226024 PMCID: PMC2848649 DOI: 10.1186/1471-2105-11-126] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2009] [Accepted: 03/12/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mocapy++ is a toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs). It supports a wide range of DBN architectures and probability distributions, including distributions from directional statistics (the statistics of angles, directions and orientations). RESULTS The program package is freely available under the GNU General Public Licence (GPL) from SourceForge http://sourceforge.net/projects/mocapy. The package contains the source for building the Mocapy++ library, several usage examples and the user manual. CONCLUSIONS Mocapy++ is especially suitable for constructing probabilistic models of biomolecular structure, due to its support for directional statistics. In particular, it supports the Kent distribution on the sphere and the bivariate von Mises distribution on the torus. These distributions have proven useful to formulate probabilistic models of protein and RNA structure in atomic detail.
Collapse
|
38
|
Buck PM, Bystroff C. Simulating protein folding initiation sites using an alpha-carbon-only knowledge-based force field. Proteins 2010; 76:331-42. [PMID: 19137613 DOI: 10.1002/prot.22348] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Protein folding is a hierarchical process where structure forms locally first, then globally. Some short sequence segments initiate folding through strong structural preferences that are independent of their three-dimensional context in proteins. We have constructed a knowledge-based force field in which the energy functions are conditional on local sequence patterns, as expressed in the hidden Markov model for local structure (HMMSTR). Carbon-alpha force field (CALF) builds sequence specific statistical potentials based on database frequencies for alpha-carbon virtual bond opening and dihedral angles, pair-wise contacts and hydrogen bond donor-acceptor pairs, and simulates folding via Brownian dynamics. We introduce hydrogen bond donor and acceptor potentials as alpha-carbon probability fields that are conditional on the predicted local sequence. Constant temperature simulations were carried out using 27 peptides selected as putative folding initiation sites, each 12 residues in length, representing several different local structure motifs. Each 0.6 micros trajectory was clustered based on structure. Simulation convergence or representativeness was assessed by subdividing trajectories and comparing clusters. For 21 of the 27 sequences, the largest cluster made up more than half of the total trajectory. Of these 21 sequences, 14 had cluster centers that were at most 2.6 A root mean square deviation (RMSD) from their native structure in the corresponding full-length protein. To assess the adequacy of the energy function on nonlocal interactions, 11 full length native structures were relaxed using Brownian dynamics simulations. Equilibrated structures deviated from their native states but retained their overall topology and compactness. A simple potential that folds proteins locally and stabilizes proteins globally may enable a more realistic understanding of hierarchical folding pathways.
Collapse
Affiliation(s)
- Patrick M Buck
- Department of Biology, Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute, Troy, New York, USA
| | | |
Collapse
|
39
|
Fonseca R, Paluszewski M, Winter P. Protein Structure Prediction Using Bee Colony Optimization Metaheuristic. ACTA ACUST UNITED AC 2010. [DOI: 10.1007/s10852-010-9125-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
40
|
Li SC, Ng YK. Calibur: a tool for clustering large numbers of protein decoys. BMC Bioinformatics 2010; 11:25. [PMID: 20070892 PMCID: PMC2881085 DOI: 10.1186/1471-2105-11-25] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2009] [Accepted: 01/13/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ab initio protein structure prediction methods generate numerous structural candidates, which are referred to as decoys. The decoy with the most number of neighbors of up to a threshold distance is typically identified as the most representative decoy. However, the clustering of decoys needed for this criterion involves computations with runtimes that are at best quadratic in the number of decoys. As a result currently there is no tool that is designed to exactly cluster very large numbers of decoys, thus creating a bottleneck in the analysis. RESULTS Using three strategies aimed at enhancing performance (proximate decoys organization, preliminary screening via lower and upper bounds, outliers filtering) we designed and implemented a software tool for clustering decoys called Calibur. We show empirical results indicating the effectiveness of each of the strategies employed. The strategies are further fine-tuned according to their effectiveness.Calibur demonstrated the ability to scale well with respect to increases in the number of decoys. For a sample size of approximately 30 thousand decoys, Calibur completed the analysis in one third of the time required when the strategies are not used.For practical use Calibur is able to automatically discover from the input decoys a suitable threshold distance for clustering. Several methods for this discovery are implemented in Calibur, where by default a very fast one is used. Using the default method Calibur reported relatively good decoys in our tests. CONCLUSIONS Calibur's ability to handle very large protein decoy sets makes it a useful tool for clustering decoys in ab initio protein structure prediction. As the number of decoys generated in these methods increases, we believe Calibur will come in important for progress in the field.
Collapse
Affiliation(s)
- Shuai Cheng Li
- David R, Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada.
| | | |
Collapse
|
41
|
Abstract
Various topologies for representing 3D protein structures have been advanced for purposes ranging from prediction of folding rates to ab initio structure prediction. Examples include relative contact order, Delaunay tessellations, and backbone torsion angle distributions. Here, we introduce a new topology based on a novel means for operationalizing 3D proximities with respect to the underlying chain. The measure involves first interpreting a rank-based representation of the nearest neighbors of each residue as a permutation, then determining how perturbed this permutation is relative to an unfolded chain. We show that the resultant topology provides improved association with folding and unfolding rates determined for a set of two-state proteins under standardized conditions. Furthermore, unlike existing topologies, the proposed geometry exhibits fine scale structure with respect to sequence position along the chain, potentially providing insights into folding initiation and/or nucleation sites.
Collapse
Affiliation(s)
- Mark R Segal
- Division of Biostatistics, University of California, San Francisco, California 94107, USA.
| |
Collapse
|
42
|
Frellsen J, Moltke I, Thiim M, Mardia KV, Ferkinghoff-Borg J, Hamelryck T. A probabilistic model of RNA conformational space. PLoS Comput Biol 2009; 5:e1000406. [PMID: 19543381 PMCID: PMC2691987 DOI: 10.1371/journal.pcbi.1000406] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2009] [Accepted: 05/06/2009] [Indexed: 11/29/2022] Open
Abstract
The increasing importance of non-coding RNA in biology and medicine has led to a growing interest in the problem of RNA 3-D structure prediction. As is the case for proteins, RNA 3-D structure prediction methods require two key ingredients: an accurate energy function and a conformational sampling procedure. Both are only partly solved problems. Here, we focus on the problem of conformational sampling. The current state of the art solution is based on fragment assembly methods, which construct plausible conformations by stringing together short fragments obtained from experimental structures. However, the discrete nature of the fragments necessitates the use of carefully tuned, unphysical energy functions, and their non-probabilistic nature impairs unbiased sampling. We offer a solution to the sampling problem that removes these important limitations: a probabilistic model of RNA structure that allows efficient sampling of RNA conformations in continuous space, and with associated probabilities. We show that the model captures several key features of RNA structure, such as its rotameric nature and the distribution of the helix lengths. Furthermore, the model readily generates native-like 3-D conformations for 9 out of 10 test structures, solely using coarse-grained base-pairing information. In conclusion, the method provides a theoretical and practical solution for a major bottleneck on the way to routine prediction and simulation of RNA structure and dynamics in atomic detail.
Collapse
Affiliation(s)
- Jes Frellsen
- The Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Ida Moltke
- The Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Martin Thiim
- The Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kanti V. Mardia
- Department of Statistics, University of Leeds, Leeds, United Kingdom
| | | | - Thomas Hamelryck
- The Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
43
|
Zhao F, Li S, Sterner BW, Xu J. Discriminative learning for protein conformation sampling. Proteins 2009; 73:228-40. [PMID: 18412258 DOI: 10.1002/prot.22057] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Protein structure prediction without using templates (i.e., ab initio folding) is one of the most challenging problems in structural biology. In particular, conformation sampling poses as a major bottleneck of ab initio folding. This article presents CRFSampler, an extensible protein conformation sampler, built on a probabilistic graphical model Conditional Random Fields (CRFs). Using a discriminative learning method, CRFSampler can automatically learn more than ten thousand parameters quantifying the relationship among primary sequence, secondary structure, and (pseudo) backbone angles. Using only compactness and self-avoiding constraints, CRFSampler can efficiently generate protein-like conformations from primary sequence and predicted secondary structure. CRFSampler is also very flexible in that a variety of model topologies and feature sets can be defined to model the sequence-structure relationship without worrying about parameter estimation. Our experimental results demonstrate that using a simple set of features, CRFSampler can generate decoys with much higher quality than the most recent HMM model.
Collapse
Affiliation(s)
- Feng Zhao
- Toyota Technological Institute at Chicago, Chicago, Illinois, USA
| | | | | | | |
Collapse
|
44
|
Hamelryck T. Probabilistic models and machine learning in structural bioinformatics. Stat Methods Med Res 2009; 18:505-26. [PMID: 19153168 DOI: 10.1177/0962280208099492] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Structural bioinformatics is concerned with the molecular structure of biomacromolecules on a genomic scale, using computational methods. Classic problems in structural bioinformatics include the prediction of protein and RNA structure from sequence, the design of artificial proteins or enzymes, and the automated analysis and comparison of biomacromolecules in atomic detail. The determination of macromolecular structure from experimental data (for example coming from nuclear magnetic resonance, X-ray crystallography or small angle X-ray scattering) has close ties with the field of structural bioinformatics. Recently, probabilistic models and machine learning methods based on Bayesian principles are providing efficient and rigorous solutions to challenging problems that were long regarded as intractable. In this review, I will highlight some important recent developments in the prediction, analysis and experimental determination of macromolecular structure that are based on such methods. These developments include generative models of protein structure, the estimation of the parameters of energy functions that are used in structure prediction, the superposition of macromolecules and structure determination methods that are based on inference. Although this review is not exhaustive, I believe the selected topics give a good impression of the exciting new, probabilistic road the field of structural bioinformatics is taking.
Collapse
Affiliation(s)
- Thomas Hamelryck
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| |
Collapse
|
45
|
A Probabilistic Graphical Model for Ab Initio Folding. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2009; 5541:59-73. [PMID: 23459639 PMCID: PMC3583211 DOI: 10.1007/978-3-642-02008-7_5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Despite significant progress in recent years, ab initio folding is still one of the most challenging problems in structural biology. This paper presents a probabilistic graphical model for ab initio folding, which employs Conditional Random Fields (CRFs) and directional statistics to model the relationship between the primary sequence of a protein and its three-dimensional structure. Different from the widely-used fragment assembly method and the lattice model for protein folding, our graphical model can explore protein conformations in a continuous space according to their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and predicted secondary structure. Experimental results indicate that this new method compares favorably with the fragment assembly method and the lattice model.
Collapse
|
46
|
Paluszewski M, Winter P. Protein Decoy Generation Using Branch and Bound with Efficient Bounding. LECTURE NOTES IN COMPUTER SCIENCE 2008. [DOI: 10.1007/978-3-540-87361-7_32] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
47
|
Li SC, Bu D, Xu J, Li M. Fragment-HMM: a new approach to protein structure prediction. Protein Sci 2008; 17:1925-34. [PMID: 18723665 DOI: 10.1110/ps.036442.108] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
We designed a simple position-specific hidden Markov model to predict protein structure. Our new framework naturally repeats itself to converge to a final target, conglomerating fragment assembly, clustering, target selection, refinement, and consensus, all in one process. Our initial implementation of this theory converges to within 6 A of the native structures for 100% of decoys on all six standard benchmark proteins used in ROSETTA (discussed by Simons and colleagues in a recent paper), which achieved only 14%-94% for the same data. The qualities of the best decoys and the final decoys our theory converges to are also notably better.
Collapse
Affiliation(s)
- Shuai Cheng Li
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario N2L3G1, Canada
| | | | | | | |
Collapse
|
48
|
Abstract
MOTIVATION The 3D structure of a protein sequence can be assembled from the substructures corresponding to small segments of this sequence. For each small sequence segment, there are only a few more likely substructures. We call them the 'structural alphabet' for this segment. Classical approaches such as ROSETTA used sequence profile and secondary structure information, to predict structural fragments. In contrast, we utilize more structural information, such as solvent accessibility and contact capacity, for finding structural fragments. RESULTS Integer linear programming technique is applied to derive the best combination of these sequence and structural information items. This approach generates significantly more accurate and succinct structural alphabets with more than 50% improvement over the previous accuracies. With these novel structural alphabets, we are able to construct more accurate protein structures than the state-of-art ab initio protein structure prediction programs such as ROSETTA. We are also able to reduce the Kolodny's library size by a factor of 8, at the same accuracy. AVAILABILITY The online FRazor server is under construction.
Collapse
Affiliation(s)
- Shuai Cheng Li
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ont. N2L 3G1, Canada.
| | | | | | | | | |
Collapse
|
49
|
Abstract
Despite significant progress in recent years, protein structure prediction maintains its status as one of the prime unsolved problems in computational biology. One of the key remaining challenges is an efficient probabilistic exploration of the structural space that correctly reflects the relative conformational stabilities. Here, we present a fully probabilistic, continuous model of local protein structure in atomic detail. The generative model makes efficient conformational sampling possible and provides a framework for the rigorous analysis of local sequence-structure correlations in the native state. Our method represents a significant theoretical and practical improvement over the widely used fragment assembly technique by avoiding the drawbacks associated with a discrete and nonprobabilistic approach.
Collapse
|
50
|
Lin M, Chen R, Liang J. Statistical geometry of lattice chain polymers with voids of defined shapes: sampling with strong constraints. J Chem Phys 2008; 128:084903. [PMID: 18315083 DOI: 10.1063/1.2831905] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Proteins contain many voids, which are unfilled spaces enclosed in the interior. A few of them have shapes compatible to ligands and substrates and are important for protein functions. An important general question is how the need for maintaining functional voids is influenced by, and affects other aspects of proteins structures and properties (e.g., protein folding stability, kinetic accessibility, and evolution selection pressure). In this paper, we examine in detail the effects of maintaining voids of different shapes and sizes using two-dimensional lattice models. We study the propensity for conformations to form a void of specific shape, which is related to the entropic cost of void maintenance. We also study the location that voids of a specific shape and size tend to form, and the influence of compactness on the formation of such voids. As enumeration is infeasible for long chain polymer, a key development in this work is the design of a novel sequential Monte Carlo strategy for generating large number of sample conformations under very constraining restrictions. Our method is validated by comparing results obtained from sampling and from enumeration for short polymer chains. We succeeded in accurate estimation of entropic cost of void maintenance, with and without an increasing number of restrictive conditions, such as loops forming the wall of void with fixed length, with additionally fixed starting position in the sequence. Additionally, we have identified the key structural properties of voids that are important in determining the entropic cost of void formation. We have further developed a parametric model to predict quantitatively void entropy. Our model is highly effective, and these results indicate that voids representing functional sites can be used as an improved model for studying the evolution of protein functions and how protein function relates to protein stability.
Collapse
Affiliation(s)
- Ming Lin
- Department of Information & Decision Science, University of Illinois at Chicago, 845 S. Morgan St., Chicago, Illinois 60607, USA
| | | | | |
Collapse
|