1
|
Gupta A, Ma H, Ramanathan A, Zerze GH. A Deep Learning-Driven Sampling Technique to Explore the Phase Space of an RNA Stem-Loop. J Chem Theory Comput 2024. [PMID: 39374435 DOI: 10.1021/acs.jctc.4c00669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
The folding and unfolding of RNA stem-loops are critical biological processes; however, their computational studies are often hampered by the ruggedness of their folding landscape, necessitating long simulation times at the atomistic scale. Here, we adapted DeepDriveMD (DDMD), an advanced deep learning-driven sampling technique originally developed for protein folding, to address the challenges of RNA stem-loop folding. Although tempering- and order parameter-based techniques are commonly used for similar rare-event problems, the computational costs or the need for a priori knowledge about the system often present a challenge in their effective use. DDMD overcomes these challenges by adaptively learning from an ensemble of running MD simulations using generic contact maps as the raw input. DeepDriveMD enables on-the-fly learning of a low-dimensional latent representation and guides the simulation toward the undersampled regions while optimizing the resources to explore the relevant parts of the phase space. We showed that DDMD estimates the free energy landscape of the RNA stem-loop reasonably well at room temperature. Our simulation framework runs at a constant temperature without external biasing potential, hence preserving the information on transition rates, with a computational cost much lower than that of the simulations performed with external biasing potentials. We also introduced a reweighting strategy for obtaining unbiased free energy surfaces and presented a qualitative analysis of the latent space. This analysis showed that the latent space captures the relevant slow degrees of freedom for the RNA folding problem of interest. Finally, throughout the manuscript, we outlined how different parameters are selected and optimized to adapt DDMD for this system. We believe this compendium of decision-making processes will help new users adapt this technique for the rare-event sampling problems of their interest.
Collapse
Affiliation(s)
- Ayush Gupta
- William A. Brookshire Department of Chemical and Biomolecular Engineering, University of Houston, Houston, Texas 77204, United States
| | - Heng Ma
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Arvind Ramanathan
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Gül H Zerze
- William A. Brookshire Department of Chemical and Biomolecular Engineering, University of Houston, Houston, Texas 77204, United States
| |
Collapse
|
2
|
Fakhoury Z, Sosso GC, Habershon S. Contact-Map-Driven Exploration of Heterogeneous Protein-Folding Paths. J Chem Theory Comput 2024; 20. [PMID: 39228261 PMCID: PMC11428170 DOI: 10.1021/acs.jctc.4c00878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 08/19/2024] [Accepted: 08/22/2024] [Indexed: 09/05/2024]
Abstract
We have recently shown how physically realizable protein-folding pathways can be generated using directed walks in the space of inter-residue contact-maps; combined with a back-transformation to move from protein contact-maps to Cartesian coordinates, we have demonstrated how this approach can generate protein-folding trajectory ensembles without recourse to molecular dynamics. In this article, we demonstrate that this framework can be used to study a challenging protein-folding problem that is known to exhibit two different folding paths which were previously identified through molecular dynamics simulation at several different temperatures. From the viewpoint of protein-folding mechanism prediction, this particular problem is extremely challenging to address, specifically involving folding to an identical nontrivial compact native structure along distinct pathways defined by heterogeneous secondary structural elements. Here, we show how our previously reported contact-map-based protein-folding strategy can be significantly enhanced to enable accurate and robust prediction of heterogeneous folding paths by (i) introducing a novel topologically informed metric for comparing two protein contact maps, (ii) reformulating our graph-represented folding path generation, and (iii) introducing a new and more reliable structural back-mapping algorithm. These changes improve the reliability of generating structurally sound folding intermediates and dramatically decrease the number of physically irrelevant folding intermediates generated by our previous simulation strategy. Most importantly, we demonstrate how our enhanced folding algorithm can successfully identify the alternative folding mechanisms of a multifolding-pathway protein, in line with direct molecular dynamics simulations.
Collapse
Affiliation(s)
- Ziad Fakhoury
- Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
| | - Gabriele C. Sosso
- Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
| | - Scott Habershon
- Department of Chemistry, University of Warwick, Coventry CV4 7AL, U.K.
| |
Collapse
|
3
|
Yu Z, Yu J, Wang H, Zhang S, Zhao L, Shi S. PhosAF: An integrated deep learning architecture for predicting protein phosphorylation sites with AlphaFold2 predicted structures. Anal Biochem 2024; 690:115510. [PMID: 38513769 DOI: 10.1016/j.ab.2024.115510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 03/14/2024] [Accepted: 03/18/2024] [Indexed: 03/23/2024]
Abstract
Phosphorylation is indispensable in comprehending biological processes, while biological experimental methods for identifying phosphorylation sites are tedious and arduous. With the rapid growth of biotechnology, deep learning methods have made significant progress in site prediction tasks. Nevertheless, most existing predictors only consider protein sequence information, that limits the capture of protein spatial information. Building upon the latest advancement in protein structure prediction by AlphaFold2, a novel integrated deep learning architecture PhosAF is developed to predict phosphorylation sites in human proteins by integrating CMA-Net and MFC-Net, which considers sequence and structure information predicted by AlphaFold2. Here, CMA-Net module is composed of multiple convolutional neural network layers and multi-head attention is appended to obtaining the local and long-term dependencies of sequence features. Meanwhile, the MFC-Net module composed of deep neural network layers is used to capture the complex representations of evolutionary and structure features. Furthermore, different features are combined to predict the final phosphorylation sites. In addition, we put forward a new strategy to construct reliable negative samples via protein secondary structures. Experimental results on independent test data and case study indicate that our model PhosAF surpasses the current most advanced methods in phosphorylation site prediction.
Collapse
Affiliation(s)
- Ziyuan Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Jialin Yu
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Hongmei Wang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shuai Zhang
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Long Zhao
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China.
| | - Shaoping Shi
- Department of Mathematics, School of Mathematics and Computer Sciences, Nanchang University, Nanchang, 330031, China; Institute of Mathematics and Interdisciplinary Sciences, Nanchang University, Nanchang, 330031, China.
| |
Collapse
|
4
|
Chu H, Liu T. Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models. Int J Mol Sci 2024; 25:4507. [PMID: 38674091 PMCID: PMC11049818 DOI: 10.3390/ijms25084507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 04/15/2024] [Accepted: 04/17/2024] [Indexed: 04/28/2024] Open
Abstract
Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.
Collapse
Affiliation(s)
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China;
| |
Collapse
|
5
|
Busia A, Listgarten J. MBE: model-based enrichment estimation and prediction for differential sequencing data. Genome Biol 2023; 24:218. [PMID: 37784130 PMCID: PMC10544408 DOI: 10.1186/s13059-023-03058-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 09/14/2023] [Indexed: 10/04/2023] Open
Abstract
Characterizing differences in sequences between two conditions, such as with and without drug exposure, using high-throughput sequencing data is a prevalent problem involving quantifying changes in sequence abundances, and predicting such differences for unobserved sequences. A key shortcoming of current approaches is their extremely limited ability to share information across related but non-identical reads. Consequently, they cannot use sequencing data effectively, nor be directly applied in many settings of interest. We introduce model-based enrichment (MBE) to overcome this shortcoming. We evaluate MBE using both simulated and real data. Overall, MBE improves accuracy compared to current differential analysis methods.
Collapse
Affiliation(s)
- Akosua Busia
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| | - Jennifer Listgarten
- Department of Electrical Engineering & Computer Science, University of California, Berkeley, Berkeley, 94720, CA, USA.
| |
Collapse
|
6
|
Ponlachantra K, Suginta W, Robinson RC, Kitaoku Y. AlphaFold2: A versatile tool to predict the appearance of functional adaptations in evolution: Profilin interactions in uncultured Asgard archaea: Profilin interactions in uncultured Asgard archaea. Bioessays 2023; 45:e2200119. [PMID: 36461738 DOI: 10.1002/bies.202200119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 11/07/2022] [Accepted: 11/09/2022] [Indexed: 12/05/2022]
Abstract
The release of AlphaFold2 (AF2), a deep-learning-aided, open-source protein structure prediction program, from DeepMind, opened a new era of molecular biology. The astonishing improvement in the accuracy of the structure predictions provides the opportunity to characterize protein systems from uncultured Asgard archaea, key organisms in evolutionary biology. Despite the accumulation in metagenomics-derived Asgard archaea eukaryotic-like protein sequences, limited structural and biochemical information have restricted the insight in their potential functions. In this review, we focus on profilin, an actin-dynamics regulating protein, which in eukaryotes, modulates actin polymerization through (1) direct actin interaction, (2) polyproline binding, and (3) phospholipid binding. We assess AF2-predicted profilin structures in their potential abilities to participate in these activities. We demonstrate that AF2 is a powerful new tool for understanding the emergence of biological functional traits in evolution.
Collapse
Affiliation(s)
- Khongpon Ponlachantra
- School of Biomolecular Science and Engineering (BSE), Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong, Thailand
| | - Wipa Suginta
- School of Biomolecular Science and Engineering (BSE), Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong, Thailand
| | - Robert C Robinson
- School of Biomolecular Science and Engineering (BSE), Vidyasirimedhi Institute of Science and Technology (VISTEC), Rayong, Thailand.,Research Institute for Interdisciplinary Science (RIIS), Okayama University, Okayama, Japan
| | - Yoshihito Kitaoku
- Research Institute for Interdisciplinary Science (RIIS), Okayama University, Okayama, Japan
| |
Collapse
|
7
|
Liu H, Chen Q. Computational protein design with data‐driven approaches: Recent developments and perspectives. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
- School of Data Science University of Science and Technology of China Hefei Anhui China
| | - Quan Chen
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
| |
Collapse
|
8
|
Sánchez IE, Galpern EA, Garibaldi MM, Ferreiro DU. Molecular Information Theory Meets Protein Folding. J Phys Chem B 2022; 126:8655-8668. [PMID: 36282961 DOI: 10.1021/acs.jpcb.2c04532] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
We propose an application of molecular information theory to analyze the folding of single domain proteins. We analyze results from various areas of protein science, such as sequence-based potentials, reduced amino acid alphabets, backbone configurational entropy, secondary structure content, residue burial layers, and mutational studies of protein stability changes. We found that the average information contained in the sequences of evolved proteins is very close to the average information needed to specify a fold ∼2.2 ± 0.3 bits/(site·operation). The effective alphabet size in evolved proteins equals the effective number of conformations of a residue in the compact unfolded state at around 5. We calculated an energy-to-information conversion efficiency upon folding of around 50%, lower than the theoretical limit of 70%, but much higher than human-built macroscopic machines. We propose a simple mapping between molecular information theory and energy landscape theory and explore the connections between sequence evolution, configurational entropy, and the energetics of protein folding.
Collapse
Affiliation(s)
- Ignacio E Sánchez
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Ezequiel A Galpern
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Martín M Garibaldi
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| | - Diego U Ferreiro
- Facultad de Ciencias Exactas y Naturales, Laboratorio de Fisiología de Proteínas, Consejo Nacional de Investigaciones Científicas y Técnicas, Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN), Universidad de Buenos Aires, Buenos AiresCP1428, Argentina
| |
Collapse
|
9
|
Chou HH, Hsu CT, Hsu CW, Yao KH, Wang HC, Hsieh SY. Novel Algorithm for Improved Protein Classification Using Graph Similarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3135-3143. [PMID: 34748498 DOI: 10.1109/tcbb.2021.3125836] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Considerable sequence data are produced in genome annotation projects that relate to molecular levels, structural similarities, and molecular and biological functions. In structural genomics, the most essential task involves resolving protein structures efficiently with hardware or software, understanding these structures, and assigning their biological functions. Understanding the characteristics and functions of proteins enables the exploration of the molecular mechanisms of life. In this paper, we examine the problems of protein classification. Because they perform similar biological functions, proteins in the same family usually share similar structural characteristics. We employed this premise in designing a classification algorithm. In this algorithm, auxiliary graphs are used to represent proteins, with every amino acid in a protein to a vertex in a graph. Moreover, the links between amino acids correspond to the edges between the vertices. The proposed algorithm classifies proteins according to the similarities in their graphical structures. The proposed algorithm is efficient and accurate in distinguishing proteins from different families and outperformed related algorithms experimentally.
Collapse
|
10
|
Tichkule S, Myung Y, Naung MT, Ansell BRE, Guy AJ, Srivastava N, Mehra S, Cacciò SM, Mueller I, Barry AE, van Oosterhout C, Pope B, Ascher DB, Jex AR. VIVID: a web application for variant interpretation and visualisation in multidimensional analyses. Mol Biol Evol 2022; 39:6697981. [PMID: 36103257 PMCID: PMC9514033 DOI: 10.1093/molbev/msac196] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Large-scale comparative genomics- and population genetic studies generate enormous amounts of polymorphism data in the form of DNA variants. Ultimately, the goal of many of these studies is to associate genetic variants to phenotypes or fitness. We introduce VIVID, an interactive, user-friendly web application that integrates a wide range of approaches for encoding genotypic to phenotypic information in any organism or disease, from an individual or population, in three-dimensional (3D) space. It allows mutation mapping and annotation, calculation of interactions and conservation scores, prediction of harmful effects, analysis of diversity and selection, and 3D visualization of genotypic information encoded in Variant Call Format on AlphaFold2 protein models. VIVID enables the rapid assessment of genes of interest in the study of adaptive evolution and the genetic load, and it helps prioritizing targets for experimental validation. We demonstrate the utility of VIVID by exploring the evolutionary genetics of the parasitic protist Plasmodium falciparum, revealing geographic variation in the signature of balancing selection in potential targets of functional antibodies.
Collapse
Affiliation(s)
- Swapnil Tichkule
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research , Melbourne , Australia
- Department of Medical Biology, University of Melbourne , Melbourne , Australia
| | - Yoochan Myung
- Systems and Computational Biology, Bio21 Institute, University of Melbourne , Melbourne , Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes , Melbourne , Australia
| | - Myo T Naung
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research , Melbourne , Australia
- Department of Medical Biology, University of Melbourne , Melbourne , Australia
| | - Brendan R E Ansell
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research , Melbourne , Australia
| | - Andrew J Guy
- School of Science, RMIT University , Melbourne , Australia
| | - Namrata Srivastava
- Department of Data Science and AI, Monash University , Melbourne , Australia
| | - Somya Mehra
- Life Sciences Discipline, Burnet Institute , Melbourne , Australia
| | - Simone M Cacciò
- Department of Infectious Disease, Istituto Superiore di Sanità , Rome , Italy
| | - Ivo Mueller
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research , Melbourne , Australia
| | - Alyssa E Barry
- Life Sciences Discipline, Burnet Institute , Melbourne , Australia
- Institute of Mental and Physical Health and Clinical Translation (IMPACT) and School of Medicine, Deakin University , Geelong , Australia
| | - Cock van Oosterhout
- School of Environmental Sciences, University of East Anglia, Norwich Research Park , Norwich , UK
| | - Bernard Pope
- Melbourne Bioinformatics, University of Melbourne , Melbourne , Australia
- Australian BioCommons , Sydney , Australia
- Department of Clinical Pathology, University of Melbourne , Melbourne , Australia
- Department of Surgery (Royal Melbourne Hospital), University of Melbourne , Melbourne , Australia
| | - David B Ascher
- Systems and Computational Biology, Bio21 Institute, University of Melbourne , Melbourne , Australia
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes , Melbourne , Australia
| | - Aaron R Jex
- Population Health and Immunity, Walter and Eliza Hall Institute of Medical Research , Melbourne , Australia
- Faculty of Veterinary and Agricultural Sciences, University of Melbourne , Melbourne , Australia
| |
Collapse
|
11
|
Erman B. Gaussian network model revisited: Effects of mutation and ligand binding on protein behavior. Phys Biol 2022; 19. [PMID: 35105836 DOI: 10.1088/1478-3975/ac50ba] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 02/01/2022] [Indexed: 11/12/2022]
Abstract
The coarse-grained Gaussian Network model, GNM, considers only the alpha carbons of the folded protein. Therefore it is not directly applicable to the study of mutation or ligand binding problems where atomic detail is required. This shortcoming is improved by including all atom pairs within the coordination shell of each other into the Kirchoff Adjacency Matrix. Counting all contacts rather than only alpha carbon contacts diminishes the magnitude of fluctuations in the system. But more importantly, it changes the graph-like connectivity structure, i.e., the Kirchoff Adjacency Matrix of the protein. This change depends on amino acid type which introduces amino acid specific and position specific information into the classical coarse-grained GNM which was originally modelled in analogy with the phantom network model of rubber elasticity. With this modification, it is now possible to explain the consequences of mutation and ligand binding on residue fluctuations, their pair-correlations and mutual information (MI) shared by each pair. We refer to the new model as 'all-atom GNM'. Using examples from published data we show that the all-atom GNM gives B-factors that are in better agreement with experiment, can explain effects of mutation on long range communication in PDZ domains and can predict effects of GDP and GTP binding on the dimerization of KRAS.
Collapse
Affiliation(s)
- Burak Erman
- Department of Chemical and Biological Engineering, Koc University, Rumeifeneri Yolu, Istanbul, Istanbul, 34450, TURKEY
| |
Collapse
|
12
|
Ovchinnikov S, Huang PS. Structure-based protein design with deep learning. Curr Opin Chem Biol 2021; 65:136-144. [PMID: 34547592 PMCID: PMC8671290 DOI: 10.1016/j.cbpa.2021.08.004] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/13/2021] [Indexed: 12/11/2022]
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, 02138, USA.
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
13
|
Guo X, Du Y, Tadepalli S, Zhao L, Shehu A. Generating tertiary protein structures via interpretable graph variational autoencoders. BIOINFORMATICS ADVANCES 2021; 1:vbab036. [PMID: 36700110 PMCID: PMC9710582 DOI: 10.1093/bioadv/vbab036] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 11/07/2021] [Accepted: 11/17/2021] [Indexed: 01/28/2023]
Abstract
Motivation Modeling the structural plasticity of protein molecules remains challenging. Most research has focused on obtaining one biologically active structure. This includes the recent AlphaFold2 that has been hailed as a breakthrough for protein modeling. Computing one structure does not suffice to understand how proteins modulate their interactions and even evade our immune system. Revealing the structure space available to a protein remains challenging. Data-driven approaches that learn to generate tertiary structures are increasingly garnering attention. These approaches exploit the ability to represent tertiary structures as contact or distance maps and make direct analogies with images to harness convolution-based generative adversarial frameworks from computer vision. Since such opportunistic analogies do not allow capturing highly structured data, current deep models struggle to generate physically realistic tertiary structures. Results We present novel deep generative models that build upon the graph variational autoencoder framework. In contrast to existing literature, we represent tertiary structures as 'contact' graphs, which allow us to leverage graph-generative deep learning. Our models are able to capture rich, local and distal constraints and additionally compute disentangled latent representations that reveal the impact of individual latent factors. This elucidates what the factors control and makes our models more interpretable. Rigorous comparative evaluation along various metrics shows that the models, we propose advance the state-of-the-art. While there is still much ground to cover, the work presented here is an important first step, and graph-generative frameworks promise to get us to our goal of unraveling the exquisite structural complexity of protein molecules. Availability and implementation Code is available at https://github.com/anonymous1025/CO-VAE. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Xiaojie Guo
- Department of Information Sciences and Technology, George Mason University, Fairfax, VA 22030, USA
| | - Yuanqi Du
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Sivani Tadepalli
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Liang Zhao
- Department of Computer Science, Emory University, Atlanta, GA 30322, USA
| | - Amarda Shehu
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA,To whom correspondence should be addressed.
| |
Collapse
|
14
|
Zheng W, Zhang C, Li Y, Pearce R, Bell EW, Zhang Y. Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. CELL REPORTS METHODS 2021; 1:100014. [PMID: 34355210 PMCID: PMC8336924 DOI: 10.1016/j.crmeth.2021.100014] [Citation(s) in RCA: 259] [Impact Index Per Article: 86.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/22/2021] [Accepted: 05/03/2021] [Indexed: 12/23/2022]
Abstract
Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Eric W. Bell
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
15
|
Pakhrin SC, Shrestha B, Adhikari B, KC DB. Deep Learning-Based Advances in Protein Structure Prediction. Int J Mol Sci 2021; 22:5553. [PMID: 34074028 PMCID: PMC8197379 DOI: 10.3390/ijms22115553] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 05/12/2021] [Accepted: 05/18/2021] [Indexed: 12/29/2022] Open
Abstract
Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress in the field. Finally, we provide an outlook and possible future research directions for DL-based approaches in the protein structure prediction arena.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| | - Bikash Shrestha
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Badri Adhikari
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Dukka B. KC
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| |
Collapse
|
16
|
Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks. PLoS Comput Biol 2021; 17:e1008865. [PMID: 33770072 PMCID: PMC8026059 DOI: 10.1371/journal.pcbi.1008865] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Revised: 04/07/2021] [Accepted: 03/10/2021] [Indexed: 12/24/2022] Open
Abstract
The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top-L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top-L/5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library. Ab initio protein folding has been a major unsolved problem in computational biology for more than half a century. Recent community-wide Critical Assessment of Structure Prediction (CASP) experiments have witnessed exciting progress on ab initio structure prediction, which was mainly powered by the boosting of contact-map prediction as the latter can be used as constraints to guide ab initio folding simulations. In this work, we proposed a new open-source deep-learning architecture, TripletRes, built on the residual convolutional neural networks for high-accuracy contact prediction. The large-scale benchmark and blind test results demonstrate competitive performance of the proposed methods to other top approaches in predicting medium- and long-range contact-maps that are critical for guiding protein folding simulations. Detailed data analyses showed that the major advantage of TripletRes lies in the unique protocol to fuse multiple evolutionary feature matrices which are directly extracted from whole-genome and metagenome databases and therefore minimize the information loss during the contact model training.
Collapse
|
17
|
Hu J, Yang W, Dong R, Li Y, Li X, Li S, Siriwardane EMD. Contact map based crystal structure prediction using global optimization. CrystEngComm 2021. [DOI: 10.1039/d0ce01714k] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Crystal structure prediction is now playing an increasingly important role in the discovery of new materials or crystal engineering.
Collapse
Affiliation(s)
- Jianjun Hu
- Department of Computer Science and Engineering
- University of South Carolina
- Columbia
- USA
| | - Wenhui Yang
- School of Mechanical Engineering
- Guizhou University
- Guiyang 550050
- China
| | - Rongzhi Dong
- School of Mechanical Engineering
- Guizhou University
- Guiyang 550050
- China
| | - Yuxin Li
- School of Mechanical Engineering
- Guizhou University
- Guiyang 550050
- China
| | - Xiang Li
- School of Mechanical Engineering
- Guizhou University
- Guiyang 550050
- China
| | - Shaobo Li
- School of Mechanical Engineering
- Guizhou University
- Guiyang 550050
- China
| | | |
Collapse
|
18
|
Hu J, Yang W, Dilanga Siriwardane EM. Distance Matrix-Based Crystal Structure Prediction Using Evolutionary Algorithms. J Phys Chem A 2020; 124:10909-10919. [DOI: 10.1021/acs.jpca.0c08775] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Jianjun Hu
- Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29201, United States
| | - Wenhui Yang
- School of Mechanical Engineering, Guizhou University, Guiyang 550050, China
| | | |
Collapse
|
19
|
Xia CQ, Pan X, Shen HB. Protein-ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 2020; 36:3018-3027. [PMID: 32091580 DOI: 10.1093/bioinformatics/btaa110] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 01/19/2020] [Accepted: 02/18/2020] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Knowledge of protein-ligand binding residues is important for understanding the functions of proteins and their interaction mechanisms. From experimentally solved protein structures, how to accurately identify its potential binding sites of a specific ligand on the protein is still a challenging problem. Compared with structure-alignment-based methods, machine learning algorithms provide an alternative flexible solution which is less dependent on annotated homogeneous protein structures. Several factors are important for an efficient protein-ligand prediction model, e.g. discriminative feature representation and effective learning architecture to deal with both the large-scale and severely imbalanced data. RESULTS In this study, we propose a novel deep-learning-based method called DELIA for protein-ligand binding residue prediction. In DELIA, a hybrid deep neural network is designed to integrate 1D sequence-based features with 2D structure-based amino acid distance matrices. To overcome the problem of severe data imbalance between the binding and nonbinding residues, strategies of oversampling in mini-batch, random undersampling and stacking ensemble are designed to enhance the model. Experimental results on five benchmark datasets demonstrate the effectiveness of proposed DELIA pipeline. AVAILABILITY AND IMPLEMENTATION The web server of DELIA is available at www.csbio.sjtu.edu.cn/bioinf/delia/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chun-Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University
- Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| |
Collapse
|
20
|
Acharya V, Chakraborty HJ, Rout AK, Balabantaray S, Behera BK, Das BK. Structural Characterization of Open Reading Frame-Encoded Functional Genes from Tilapia Lake Virus (TiLV). Mol Biotechnol 2020; 61:945-957. [PMID: 31664705 DOI: 10.1007/s12033-019-00217-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
In recent years, large-scale mortalities are observed in tilapia due to infection with a novel orthomyxo-like virus named, tilapia lake virus (TiLV) which is marked to be a severe threat to universal tilapia industry. Currently, there are knowledge gaps relating to the antiviral peptide as well as there are no affordable vaccines or drugs available against TiLV yet. To understand the spreading of infection of TiLV in different organs of Oreochromis niloticus, RT-PCR analysis has been carried out. The gene segments of TiLV were retrieved from the NCBI database for computational biology analysis. The 14 functional genes were predicted from the 10 gene segments of TiLV. Phylogenetic analysis was employed to find out a better understanding for the evolution of tilapia lake virus genes. Out of 14 proteins, only six proteins show transmembrane helix region. Moreover, molecular modeling and molecular dynamics simulations of the predicted proteins revealed structural stability of the protein stabilized after 10-ns simulation. Overall, our study provided a basic bioinformatics on functional proteome of TiLV. Further, this study could be useful for development of novel peptide-based therapeutics to control TiLV infection.
Collapse
Affiliation(s)
- Varsha Acharya
- Biotechnology Laboratory, ICAR-Central Inland Fisheries Research Institute, Barrackpore, Kolkata, West Bengal, 700120, India
| | - Hirak Jyoti Chakraborty
- Biotechnology Laboratory, ICAR-Central Inland Fisheries Research Institute, Barrackpore, Kolkata, West Bengal, 700120, India
| | - Ajaya Kumar Rout
- Biotechnology Laboratory, ICAR-Central Inland Fisheries Research Institute, Barrackpore, Kolkata, West Bengal, 700120, India
| | - Sucharita Balabantaray
- Department of Bioinformatics, Odisha University of Agriculture and Technology, Bhubaneswar, Odisha, 751003, India
| | - Bijay Kumar Behera
- Biotechnology Laboratory, ICAR-Central Inland Fisheries Research Institute, Barrackpore, Kolkata, West Bengal, 700120, India
| | - Basanta Kumar Das
- Biotechnology Laboratory, ICAR-Central Inland Fisheries Research Institute, Barrackpore, Kolkata, West Bengal, 700120, India.
| |
Collapse
|
21
|
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18:1301-1310. [PMID: 32612753 PMCID: PMC7305407 DOI: 10.1016/j.csbj.2019.12.011] [Citation(s) in RCA: 116] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 01/01/2023] Open
Abstract
Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the '60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Ireland
| | | | - Quan Le
- Centre for Applied Data Analytics Research, University College Dublin, Ireland
| |
Collapse
|
22
|
Wozniak PP, Pelc J, Skrzypecki M, Vriend G, Kotulska M. Bio-knowledge-based filters improve residue-residue contact prediction accuracy. Bioinformatics 2019; 34:3675-3683. [PMID: 29850768 DOI: 10.1093/bioinformatics/bty416] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 05/19/2018] [Indexed: 11/13/2022] Open
Abstract
Motivation Residue-residue contact prediction through direct coupling analysis has reached impressive accuracy, but yet higher accuracy will be needed to allow for routine modelling of protein structures. One way to improve the prediction accuracy is to filter predicted contacts using knowledge about the particular protein of interest or knowledge about protein structures in general. Results We focus on the latter and discuss a set of filters that can be used to remove false positive contact predictions. Each filter depends on one or a few cut-off parameters for which the filter performance was investigated. Combining all filters while using default parameters resulted for a test set of 851 protein domains in the removal of 29% of the predictions of which 92% were indeed false positives. Availability and implementation All data and scripts are available at http://comprec-lin.iiar.pwr.edu.pl/FPfilter/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- P P Wozniak
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - J Pelc
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - M Skrzypecki
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - M Kotulska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| |
Collapse
|
23
|
Wu Q, Peng Z, Anishchenko I, Cong Q, Baker D, Yang J. Protein contact prediction using metagenome sequence data and residual neural networks. Bioinformatics 2019; 36:41-48. [PMID: 31173061 PMCID: PMC8792440 DOI: 10.1093/bioinformatics/btz477] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 05/30/2019] [Accepted: 06/04/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Almost all protein residue contact prediction methods rely on the availability of deep multiple sequence alignments (MSAs). However, many proteins from the poorly populated families do not have sufficient number of homologs in the conventional UniProt database. Here we aim to solve this issue by exploring the rich sequence data from the metagenome sequencing projects. RESULTS Based on the improved MSA constructed from the metagenome sequence data, we developed MapPred, a new deep learning-based contact prediction method. MapPred consists of two component methods, DeepMSA and DeepMeta, both trained with the residual neural networks. DeepMSA was inspired by the recent method DeepCov, which was trained on 441 matrices of covariance features. By considering the symmetry of contact map, we reduced the number of matrices to 231, which makes the training more efficient in DeepMSA. Experiments show that DeepMSA outperforms DeepCov by 10-13% in precision. DeepMeta works by combining predicted contacts and other sequence profile features. Experiments on three benchmark datasets suggest that the contribution from the metagenome sequence data is significant with P-values less than 4.04E-17. MapPred is shown to be complementary and comparable the state-of-the-art methods. The success of MapPred is attributed to three factors: the deeper MSA from the metagenome sequence data, improved feature design in DeepMSA and optimized training by the residual neural networks. AVAILABILITY AND IMPLEMENTATION http://yanglab.nankai.edu.cn/mappred/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qi Wu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- To whom correspondence should be addressed. E-mail: or
| | - Ivan Anishchenko
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Qian Cong
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - David Baker
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Jianyi Yang
- To whom correspondence should be addressed. E-mail: or
| |
Collapse
|
24
|
Jing X, Dong Q, Lu R, Dong Q. Protein Inter-Residue Contacts Prediction: Methods, Performances and Applications. Curr Bioinform 2019. [DOI: 10.2174/1574893613666181109130430] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:Protein inter-residue contacts prediction play an important role in the field of protein structure and function research. As a low-dimensional representation of protein tertiary structure, protein inter-residue contacts could greatly help de novo protein structure prediction methods to reduce the conformational search space. Over the past two decades, various methods have been developed for protein inter-residue contacts prediction.Objective:We provide a comprehensive and systematic review of protein inter-residue contacts prediction methods.Results:Protein inter-residue contacts prediction methods are roughly classified into five categories: correlated mutations methods, machine-learning methods, fusion methods, templatebased methods and 3D model-based methods. In this paper, firstly we describe the common definition of protein inter-residue contacts and show the typical application of protein inter-residue contacts. Then, we present a comprehensive review of the three main categories for protein interresidue contacts prediction: correlated mutations methods, machine-learning methods and fusion methods. Besides, we analyze the constraints for each category. Furthermore, we compare several representative methods on the CASP11 dataset and discuss performances of these methods in detail.Conclusion:Correlated mutations methods achieve better performances for long-range contacts, while the machine-learning method performs well for short-range contacts. Fusion methods could take advantage of the machine-learning and correlated mutations methods. Employing more effective fusion strategy could be helpful to further improve the performances of fusion methods.
Collapse
Affiliation(s)
- Xiaoyang Jing
- School of Computer Science, Fudan University, Shanghai, China
| | - Qimin Dong
- Vocational and Technical Education Center of Linxi County, Chifeng, Inner Mongolia, China
| | - Ruqian Lu
- School of Computer Science, Fudan University, Shanghai, China
| | - Qiwen Dong
- Faculty of Education, East China Normal University, Shanghai, China
| |
Collapse
|
25
|
Wuyun Q, Zheng W, Peng Z, Yang J. A large-scale comparative assessment of methods for residue-residue contact prediction. Brief Bioinform 2019; 19:219-230. [PMID: 27802931 DOI: 10.1093/bib/bbw106] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2016] [Indexed: 11/14/2022] Open
Abstract
Sequence-based prediction of residue-residue contact in proteins becomes increasingly more important for improving protein structure prediction in the big data era. In this study, we performed a large-scale comparative assessment of 15 locally installed contact predictors. To assess these methods, we collected a big data set consisting of 680 nonredundant proteins covering different structural classes and target difficulties. We investigated a wide range of factors that may influence the precision of contact prediction, including target difficulty, structural class, the alignment depth and distribution of contact pairs in a protein structure. We found that: (1) the machine learning-based methods outperform the direct-coupling-based methods for short-range contact prediction, while the latter are significantly better for long-range contact prediction. The consensus-based methods, which combine machine learning and direct-coupling methods, perform the best. (2) The target difficulty does not have clear influence on the machine learning-based methods, while it does affect the direct-coupling and consensus-based methods significantly. (3) The alignment depth has relatively weak effect on the machine learning-based methods. However, for the direct-coupling-based methods and consensus-based methods, the predicted contacts for targets with deeper alignment tend to be more accurate. (4) All methods perform relatively better on β and α + β proteins than on α proteins. (5) Residues buried in the core of protein structure are more prone to be in contact than residues on the surface (22 versus 6%). We believe these are useful results for guiding future development of new approach to contact prediction.
Collapse
Affiliation(s)
- Qiqige Wuyun
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Wei Zheng
- School of Mathematical Sciences, Nankai University, Tianjin, China
| | - Zhenling Peng
- Center for Applied Mathematics, Tianjin University, Tianjin, China
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Tianjin, China
| |
Collapse
|
26
|
Abstract
BACKGROUND We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. RESULTS We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. CONCLUSIONS Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
Collapse
Affiliation(s)
- Debsindhu Bhowmik
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Shang Gao
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Michael T Young
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Arvind Ramanathan
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA.
| |
Collapse
|
27
|
Amala A, Emerson IA. Understanding contact patterns of protein structures from protein contact map and investigation of unique patterns in the globin-like folded domains. J Cell Biochem 2018; 120:9877-9886. [PMID: 30525229 DOI: 10.1002/jcb.28270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Accepted: 10/24/2018] [Indexed: 11/06/2022]
Abstract
Proteins are biochemical compounds made up of one or more polypeptides in a specific order, typically folded into a functionally active form. Proteins are categorized into four different structural classes according to the topology of α-helices and β-strands. In this study, we modeled these four structural classes as an undirected network depicting amino acids as nodes and interaction between them as edges. Results infer that basic protein classes can be easily recognized as well as distinguished by utilizing protein contact maps (PCM). Toward studying the globin-like fold, the helix-loop-helix region contacts were seen to be of a unique pattern, and these remained in all the folds. Further, the averaged diagonal contacts were analyzed and identified those contacts in α/β proteins were higher in comparison with the other class. Interesting, we noticed that anti-parallel beta sheets were dominant in all-β and α + β classes that lead to similar diagonal patterns. Network properties of all four basic classes were analyzed and found to possess small-world property. Findings infer that PCM may assist classify protein structure classes and it also helps in evaluating the predicted protein structures.
Collapse
Affiliation(s)
- Arumugam Amala
- Bioinformatics Programming Laboratory, Department of Biotechnology, School of Bio Sciences and Technology, Vellore Institute of Technology, Tamil Nadu, India
| | - Isaac Arnold Emerson
- Bioinformatics Programming Laboratory, Department of Biotechnology, School of Bio Sciences and Technology, Vellore Institute of Technology, Tamil Nadu, India
| |
Collapse
|
28
|
Abstract
Since the 1980s, deep learning and biomedical data have been coevolving and feeding each other. The breadth, complexity, and rapidly expanding size of biomedical data have stimulated the development of novel deep learning methods, and application of these methods to biomedical data have led to scientific discoveries and practical solutions. This overview provides technical and historical pointers to the field, and surveys current applications of deep learning to biomedical data organized around five subareas, roughly of increasing spatial scale: chemoinformatics, proteomics, genomics and transcriptomics, biomedical imaging, and health care. The black box problem of deep learning methods is also briefly discussed.
Collapse
Affiliation(s)
- Pierre Baldi
- Department of Computer Science, Institute for Genomics and Bioinformatics, and Center for Machine Learning and Intelligent Systems, University of California, Irvine, California 92697, USA
| |
Collapse
|
29
|
Kurczynska M, Kotulska M. Automated method to differentiate between native and mirror protein models obtained from contact maps. PLoS One 2018; 13:e0196993. [PMID: 29787567 PMCID: PMC5963800 DOI: 10.1371/journal.pone.0196993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Accepted: 04/24/2018] [Indexed: 11/23/2022] Open
Abstract
Mirror protein structures are often considered as artifacts in modeling protein structures. However, they may soon become a new branch of biochemistry. Moreover, methods of protein structure reconstruction, based on their residue-residue contact maps, need methodology to differentiate between models of native and mirror orientation, especially regarding the reconstructed backbones. We analyzed 130 500 structural protein models obtained from contact maps of 1 305 SCOP domains belonging to all 7 structural classes. On average, the same numbers of native and mirror models were obtained among 100 models generated for each domain. Since their structural features are often not sufficient for differentiating between the two types of model orientations, we proposed to apply various energy terms (ETs) from PyRosetta to separate native and mirror models. To automate the procedure for differentiating these models, the k-means clustering algorithm was applied. Using total energy did not allow to obtain appropriate clusters–the accuracy of the clustering for class A (all helices) was no more than 0.52. Therefore, we tested a series of different k-means clusterings based on various combinations of ETs. Finally, applying two most differentiating ETs for each class allowed to obtain satisfying results. To unify the method for differentiating between native and mirror models, independent of their structural class, the two best ETs for each class were considered. Finally, the k-means clustering algorithm used three common ETs: probability of amino acid assuming certain values of dihedral angles Φ and Ψ, Ramachandran preferences and Coulomb interactions. The accuracies of clustering with these ETs were in the range between 0.68 and 0.76, with sensitivity and selectivity in the range between 0.68 and 0.87, depending on the structural class. The method can be applied to all fully-automated tools for protein structure reconstruction based on contact maps, especially those analyzing big sets of models.
Collapse
Affiliation(s)
- Monika Kurczynska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - Malgorzata Kotulska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
- * E-mail:
| |
Collapse
|
30
|
He B, Mortuza SM, Wang Y, Shen HB, Zhang Y. NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers. Bioinformatics 2018; 33:2296-2306. [PMID: 28369334 DOI: 10.1093/bioinformatics/btx164] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 03/21/2017] [Indexed: 12/12/2022] Open
Abstract
Motivation Recent CASP experiments have witnessed exciting progress on folding large-size non-humongous proteins with the assistance of co-evolution based contact predictions. The success is however anecdotal due to the requirement of the contact prediction methods for the high volume of sequence homologs that are not available to most of the non-humongous protein targets. Development of efficient methods that can generate balanced and reliable contact maps for different type of protein targets is essential to enhance the success rate of the ab initio protein structure prediction. Results We developed a new pipeline, NeBcon, which uses the naïve Bayes classifier (NBC) theorem to combine eight state of the art contact methods that are built from co-evolution and machine learning approaches. The posterior probabilities of the NBC model are then trained with intrinsic structural features through neural network learning for the final contact map prediction. NeBcon was tested on 98 non-redundant proteins, which improves the accuracy of the best co-evolution based meta-server predictor by 22%; the magnitude of the improvement increases to 45% for the hard targets that lack sequence and structural homologs in the databases. Detailed data analysis showed that the major contribution to the improvement is due to the optimized NBC combination of the complementary information from both co-evolution and machine learning predictions. The neural network training also helps to improve the coupling of the NBC posterior probability and the intrinsic structural features, which were found particularly important for the proteins that do not have sufficient number of homologous sequences to derive reliable co-evolution profiles. Availiablity and Implementation On-line server and standalone package of the program are available at http://zhanglab.ccmb.med.umich.edu/NeBcon/ . Contact zhng@umich.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Baoji He
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.,School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yanting Wang
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.,School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hong-Bin Shen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
31
|
Guo QH, Sun LH. Combinatorics of Contacts in Protein Contact Maps. Bull Math Biol 2017; 80:385-403. [PMID: 29230703 DOI: 10.1007/s11538-017-0380-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2016] [Accepted: 12/05/2017] [Indexed: 10/18/2022]
Abstract
Contacts play a fundamental role in the study of protein structure and folding problems. The contact map of a protein can be represented by arranging its amino acids on a horizontal line and drawing an arc between two residues if they form a contact. In this paper, we are mainly concerned with the combinatorial enumeration of the arcs in m-regular linear stack, an elementary structure of the protein contact map, which was introduced by Chen et al. (J Comput Biol 21(12):915-935, 2014). We modify the generating function for m-regular linear stacks by introducing a new variable y regarding to the number of arcs and obtain an equation satisfied by the generating function for m-regular linear stacks with n vertices and k arcs. Consequently, we also derive an equation satisfied by the generating function of the overall number of arcs in m-regular linear stacks with n vertices.
Collapse
Affiliation(s)
- Qiang-Hui Guo
- Center for Combinatorics, LPMC, Nankai University, Tianjin, 300071, People's Republic of China
| | - Lisa H Sun
- Center for Combinatorics, LPMC, Nankai University, Tianjin, 300071, People's Republic of China.
| |
Collapse
|
32
|
Wozniak PP, Konopka BM, Xu J, Vriend G, Kotulska M. Forecasting residue-residue contact prediction accuracy. Bioinformatics 2017; 33:3405-3414. [PMID: 29036497 DOI: 10.1093/bioinformatics/btx416] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 06/22/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Apart from meta-predictors, most of today's methods for residue-residue contact prediction are based entirely on Direct Coupling Analysis (DCA) of correlated mutations in multiple sequence alignments (MSAs). These methods are on average ∼40% correct for the 100 strongest predicted contacts in each protein. The end-user who works on a single protein of interest will not know if predictions are either much more or much less correct than 40%, which is especially a problem if contacts are predicted to steer experimental research on that protein. Results We designed a regression model that forecasts the accuracy of residue-residue contact prediction for individual proteins with an average error of 7 percentage points. Contacts were predicted with two DCA methods (gplmDCA and PSICOV). The models were built on parameters that describe the MSA, the predicted secondary structure, the predicted solvent accessibility and the contact prediction scores for the target protein. Results show that our models can be also applied to the meta-methods, which was tested on RaptorX. Availability and implementation All data and scripts are available from http://comprec-lin.iiar.pwr.edu.pl/dcaQ/. Contact malgorzata.kotulska@pwr.edu.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- P P Wozniak
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - B M Konopka
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - J Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, GA 6525, Nijmegen, The Netherlands
| | - M Kotulska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| |
Collapse
|
33
|
Adhikari B, Cheng J. Improved protein structure reconstruction using secondary structures, contacts at higher distance thresholds, and non-contacts. BMC Bioinformatics 2017; 18:380. [PMID: 28851269 PMCID: PMC5576353 DOI: 10.1186/s12859-017-1807-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Accepted: 08/22/2017] [Indexed: 11/12/2022] Open
Abstract
Background Residue-residue contacts are key features for accurate de novo protein structure prediction. For the optimal utilization of these predicted contacts in folding proteins accurately, it is important to study the challenges of reconstructing protein structures using true contacts. Because contact-guided protein modeling approach is valuable for predicting the folds of proteins that do not have structural templates, it is necessary for reconstruction studies to focus on hard-to-predict protein structures. Results Using a data set consisting of 496 structural domains released in recent CASP experiments and a dataset of 150 representative protein structures, in this work, we discuss three techniques to improve the reconstruction accuracy using true contacts – adding secondary structures, increasing contact distance thresholds, and adding non-contacts. We find that reconstruction using secondary structures and contacts can deliver accuracy higher than using full contact maps. Similarly, we demonstrate that non-contacts can improve reconstruction accuracy not only when the used non-contacts are true but also when they are predicted. On the dataset consisting of 150 proteins, we find that by simply using low ranked predicted contacts as non-contacts and adding them as additional restraints, can increase the reconstruction accuracy by 5% when the reconstructed models are evaluated using TM-score. Conclusions Our findings suggest that secondary structures are invaluable companions of contacts for accurate reconstruction. Confirming some earlier findings, we also find that larger distance thresholds are useful for folding many protein structures which cannot be folded using the standard definition of contacts. Our findings also suggest that for more accurate reconstruction using predicted contacts it is useful to predict contacts at higher distance thresholds (beyond 8 Å) and predict non-contacts. Electronic supplementary material The online version of this article (10.1186/s12859-017-1807-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Badri Adhikari
- Department of Mathematics and Computer Science, University of Missouri-St.Louis, St. Louis, MO, 63121, USA
| | - Jianlin Cheng
- Department of Electrical Engineering & Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
34
|
Ochoa-Montaño B, Blundell TL. XSuLT: a web server for structural annotation and representation of sequence-structure alignments. Nucleic Acids Res 2017; 45:W381-W387. [PMID: 28510698 PMCID: PMC5793734 DOI: 10.1093/nar/gkx421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Accepted: 05/04/2017] [Indexed: 12/16/2022] Open
Abstract
The web server XSuLT, an enhanced version of the protein alignment annotation program JoY, formats a submitted multiple-sequence alignment using three-dimensional (3D) structural information in order to assist in the comparative analysis of protein evolution and in the optimization of alignments for comparative modelling and construct design. In addition to the features analysed by JoY, which include secondary structure, solvent accessibility and sidechain hydrogen bonds, XSuLT annotates each amino acid residue with residue depth, chain and ligand interactions, inter-residue contacts, sequence entropy, root mean square deviation and secondary structure and disorder prediction. It is also now integrated with built-in 3D visualization which interacts with the formatted alignment to facilitate inspection and understanding. Results can be downloaded as stand-alone HTML for the formatted alignment and as XML with the underlying annotation data. XSuLT is freely available at http://structure.bioc.cam.ac.uk/xsult/.
Collapse
Affiliation(s)
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| |
Collapse
|
35
|
Stahl K, Schneider M, Brock O. EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction. BMC Bioinformatics 2017; 18:303. [PMID: 28623886 PMCID: PMC5474060 DOI: 10.1186/s12859-017-1713-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 05/30/2017] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Accurately predicted contacts allow to compute the 3D structure of a protein. Since the solution space of native residue-residue contact pairs is very large, it is necessary to leverage information to identify relevant regions of the solution space, i.e. correct contacts. Every additional source of information can contribute to narrowing down candidate regions. Therefore, recent methods combined evolutionary and sequence-based information as well as evolutionary and physicochemical information. We develop a new contact predictor (EPSILON-CP) that goes beyond current methods by combining evolutionary, physicochemical, and sequence-based information. The problems resulting from the increased dimensionality and complexity of the learning problem are combated with a careful feature analysis, which results in a drastically reduced feature set. The different information sources are combined using deep neural networks. RESULTS On 21 hard CASP11 FM targets, EPSILON-CP achieves a mean precision of 35.7% for top- L/10 predicted long-range contacts, which is 11% better than the CASP11 winning version of MetaPSICOV. The improvement on 1.5L is 17%. Furthermore, in this study we find that the amino acid composition, a commonly used feature, is rendered ineffective in the context of meta approaches. The size of the refined feature set decreased by 75%, enabling a significant increase in training data for machine learning, contributing significantly to the observed improvements. CONCLUSIONS Exploiting as much and diverse information as possible is key to accurate contact prediction. Simply merging the information introduces new challenges. Our study suggests that critical feature analysis can improve the performance of contact prediction methods that combine multiple information sources. EPSILON-CP is available as a webservice: http://compbio.robotics.tu-berlin.de/epsilon/.
Collapse
Affiliation(s)
- Kolja Stahl
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, Berlin, 10587 Germany
| | - Michael Schneider
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, Berlin, 10587 Germany
| | - Oliver Brock
- Robotics and Biology Laboratory, Department of Electrical Engineering and Computer Science, Technische Universität Berlin, Marchstraße 23, Berlin, 10587 Germany
| |
Collapse
|
36
|
Chu JW, Yang H. Identifying the structural and kinetic elements in protein large-amplitude conformational motions. INT REV PHYS CHEM 2017. [DOI: 10.1080/0144235x.2017.1283885] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
37
|
Wozniak PP, Vriend G, Kotulska M. Correlated mutations select misfolded from properly folded proteins. Bioinformatics 2017; 33:1497-1504. [PMID: 28203707 DOI: 10.1093/bioinformatics/btx013] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 01/11/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- P P Wozniak
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - M Kotulska
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland
| |
Collapse
|
38
|
Regular Simple Queues of Protein Contact Maps. Bull Math Biol 2016; 79:21-35. [PMID: 27844300 DOI: 10.1007/s11538-016-0212-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 09/21/2016] [Indexed: 10/20/2022]
Abstract
A protein fold can be viewed as a self-avoiding walk in certain lattice model, and its contact map is a graph that represents the patterns of contacts in the fold. Goldman, Istrail, and Papadimitriou showed that a contact map in the 2D square lattice can be decomposed into at most two stacks and one queue. In the terminology of combinatorics, stacks and queues are noncrossing and nonnesting partitions, respectively. In this paper, we are concerned with 2-regular and 3-regular simple queues, for which the degree of each vertex is at most one and the arc lengths are at least 2 and 3, respectively. We show that 2-regular simple queues are in one-to-one correspondence with hill-free Motzkin paths, which have been enumerated by Barcucci, Pergola, Pinzani, and Rinaldi by using the Enumerating Combinatorial Objects method. We derive a recurrence relation for the generating function of Motzkin paths with [Formula: see text] peaks at level i, which reduces to the generating function for hill-free Motzkin paths. Moreover, we show that 3-regular simple queues are in one-to-one correspondence with Motzkin paths avoiding certain patterns. Then we obtain a formula for the generating function of 3-regular simple queues. Asymptotic formulas for 2-regular and 3-regular simple queues are derived based on the generating functions.
Collapse
|
39
|
Simkovic F, Thomas JMH, Keegan RM, Winn MD, Mayans O, Rigden DJ. Residue contacts predicted by evolutionary covariance extend the application of ab initio molecular replacement to larger and more challenging protein folds. IUCRJ 2016; 3:259-70. [PMID: 27437113 PMCID: PMC4937781 DOI: 10.1107/s2052252516008113] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 05/18/2016] [Indexed: 05/05/2023]
Abstract
For many protein families, the deluge of new sequence information together with new statistical protocols now allow the accurate prediction of contacting residues from sequence information alone. This offers the possibility of more accurate ab initio (non-homology-based) structure prediction. Such models can be used in structure solution by molecular replacement (MR) where the target fold is novel or is only distantly related to known structures. Here, AMPLE, an MR pipeline that assembles search-model ensembles from ab initio structure predictions ('decoys'), is employed to assess the value of contact-assisted ab initio models to the crystallographer. It is demonstrated that evolutionary covariance-derived residue-residue contact predictions improve the quality of ab initio models and, consequently, the success rate of MR using search models derived from them. For targets containing β-structure, decoy quality and MR performance were further improved by the use of a β-strand contact-filtering protocol. Such contact-guided decoys achieved 14 structure solutions from 21 attempted protein targets, compared with nine for simple Rosetta decoys. Previously encountered limitations were superseded in two key respects. Firstly, much larger targets of up to 221 residues in length were solved, which is far larger than the previously benchmarked threshold of 120 residues. Secondly, contact-guided decoys significantly improved success with β-sheet-rich proteins. Overall, the improved performance of contact-guided decoys suggests that MR is now applicable to a significantly wider range of protein targets than were previously tractable, and points to a direct benefit to structural biology from the recent remarkable advances in sequencing.
Collapse
Affiliation(s)
- Felix Simkovic
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Jens M. H. Thomas
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ronan M. Keegan
- Research Complex at Harwell, STFC Rutherford Appleton Laboratory, Didcot OX11 0FA, England
| | - Martyn D. Winn
- Science and Technology Facilities Council, Daresbury Laboratory, Warrington WA4 4AD, England
| | - Olga Mayans
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Daniel J. Rigden
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
40
|
Abstract
The contact map of a protein fold in the two-dimensional (2D) square lattice has arc length at least 3, and each internal vertex has degree at most 2, whereas the two terminal vertices have degree at most 3. Recently, Chen, Guo, Sun, and Wang studied the enumeration of [Formula: see text]-regular linear stacks, where each arc has length at least [Formula: see text] and the degree of each vertex is bounded by 2. Since the two terminal points in a protein fold in the 2D square lattice may form contacts with at most three adjacent lattice points, we are led to the study of extended [Formula: see text]-regular linear stacks, in which the degree of each terminal point is bounded by 3. This model is closed to real protein contact maps. Denote the generating functions of the [Formula: see text]-regular linear stacks and the extended [Formula: see text]-regular linear stacks by [Formula: see text] and [Formula: see text], respectively. We show that [Formula: see text] can be written as a rational function of [Formula: see text]. For a certain [Formula: see text], by eliminating [Formula: see text], we obtain an equation satisfied by [Formula: see text] and derive the asymptotic formula of the numbers of [Formula: see text]-regular linear stacks of length [Formula: see text].
Collapse
Affiliation(s)
- Qiang-Hui Guo
- Center for Combinatorics, LPMC-TJKLC, Nankai University , Tianjin, P.R. China
| | - Lisa H Sun
- Center for Combinatorics, LPMC-TJKLC, Nankai University , Tianjin, P.R. China
| | - Jian Wang
- Center for Combinatorics, LPMC-TJKLC, Nankai University , Tianjin, P.R. China
| |
Collapse
|
41
|
Morlot JB, Mozziconacci J, Lesne A. Network concepts for analyzing 3D genome structure from chromosomal contact maps. ACTA ACUST UNITED AC 2016. [DOI: 10.1140/epjnbp/s40366-016-0029-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
42
|
Kurczynska M, Kania E, Konopka BM, Kotulska M. Applying PyRosetta molecular energies to separate properly oriented protein models from mirror models, obtained from contact maps. J Mol Model 2016; 22:111. [PMID: 27107578 PMCID: PMC4842210 DOI: 10.1007/s00894-016-2975-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 04/05/2016] [Indexed: 11/30/2022]
Abstract
Reconstructing protein structure based on contact maps leads to two types of models: properly oriented models and mirror models. This is due to the fact that contact maps do not include information on protein chirality. Therefore, both types of model orientations share the same contact map and are geometrically allowed. In this work, we verified the hypothesis that some of the energy terms calculated by PyRosetta could be useful to distinguish between properly oriented and mirror models. We studied 440 models of all-alpha protein domains reconstructed manually from their contact maps, where 50 % of the models were properly oriented and 50 % had mirror orientation. We showed that dihedral angles and energy terms, based on the probability of specific geometrical arrangement of the residues, differed significantly for properly oriented and mirror models.
Collapse
Affiliation(s)
- Monika Kurczynska
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
| | - Ewa Kania
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland.,Biotechnology Center, Dresden University of Technology, Tatzberg 47/49, 01307, Dresden, Germany
| | - Bogumil M Konopka
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
| | - Malgorzata Kotulska
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland.
| |
Collapse
|
43
|
Kandathil SM, Handl J, Lovell SC. Toward a detailed understanding of search trajectories in fragment assembly approaches to protein structure prediction. Proteins 2016; 84:411-26. [PMID: 26799916 PMCID: PMC4982100 DOI: 10.1002/prot.24987] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2015] [Revised: 12/03/2015] [Accepted: 12/31/2015] [Indexed: 11/30/2022]
Abstract
Energy functions, fragment libraries, and search methods constitute three key components of fragment‐assembly methods for protein structure prediction, which are all crucial for their ability to generate high‐accuracy predictions. All of these components are tightly coupled; efficient searching becomes more important as the quality of fragment libraries decreases. Given these relationships, there is currently a poor understanding of the strengths and weaknesses of the sampling approaches currently used in fragment‐assembly techniques. Here, we determine how the performance of search techniques can be assessed in a meaningful manner, given the above problems. We describe a set of techniques that aim to reduce the impact of the energy function, and assess exploration in view of the search space defined by a given fragment library. We illustrate our approach using Rosetta and EdaFold, and show how certain features of these methods encourage or limit conformational exploration. We demonstrate that individual trajectories of Rosetta are susceptible to local minima in the energy landscape, and that this can be linked to non‐uniform sampling across the protein chain. We show that EdaFold's novel approach can help balance broad exploration with locating good low‐energy conformations. This occurs through two mechanisms which cannot be readily differentiated using standard performance measures: exclusion of false minima, followed by an increasingly focused search in low‐energy regions of conformational space. Measures such as ours can be helpful in characterizing new fragment‐based methods in terms of the quality of conformational exploration realized. Proteins 2016; 84:411–426. © 2016 The Authors Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Shaun M Kandathil
- Faculty of Life Sciences, the University of Manchester, Manchester, M13 9PL, United Kingdom
| | - Julia Handl
- Alliance Manchester Business School, Faculty of Humanities, the University of Manchester, Manchester, M13 9PL, United Kingdom
| | - Simon C Lovell
- Faculty of Life Sciences, the University of Manchester, Manchester, M13 9PL, United Kingdom
| |
Collapse
|
44
|
Sawle L, Ghosh K. Convergence of Molecular Dynamics Simulation of Protein Native States: Feasibility vs Self-Consistency Dilemma. J Chem Theory Comput 2016; 12:861-9. [DOI: 10.1021/acs.jctc.5b00999] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Lucas Sawle
- Department of Physics and
Astronomy, University of Denver, Denver, Colorado 80209, United States
| | - Kingshuk Ghosh
- Department of Physics and
Astronomy, University of Denver, Denver, Colorado 80209, United States
| |
Collapse
|
45
|
Zhang H, Huang Q, Bei Z, Wei Y, Floudas CA. COMSAT: Residue contact prediction of transmembrane proteins based on support vector machines and mixed integer linear programming. Proteins 2016; 84:332-48. [PMID: 26756402 DOI: 10.1002/prot.24979] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Revised: 11/19/2015] [Accepted: 12/10/2015] [Indexed: 12/28/2022]
Abstract
In this article, we present COMSAT, a hybrid framework for residue contact prediction of transmembrane (TM) proteins, integrating a support vector machine (SVM) method and a mixed integer linear programming (MILP) method. COMSAT consists of two modules: COMSAT_SVM which is trained mainly on position-specific scoring matrix features, and COMSAT_MILP which is an ab initio method based on optimization models. Contacts predicted by the SVM model are ranked by SVM confidence scores, and a threshold is trained to improve the reliability of the predicted contacts. For TM proteins with no contacts above the threshold, COMSAT_MILP is used. The proposed hybrid contact prediction scheme was tested on two independent TM protein sets based on the contact definition of 14 Å between Cα-Cα atoms. First, using a rigorous leave-one-protein-out cross validation on the training set of 90 TM proteins, an accuracy of 66.8%, a coverage of 12.3%, a specificity of 99.3% and a Matthews' correlation coefficient (MCC) of 0.184 were obtained for residue pairs that are at least six amino acids apart. Second, when tested on a test set of 87 TM proteins, the proposed method showed a prediction accuracy of 64.5%, a coverage of 5.3%, a specificity of 99.4% and a MCC of 0.106. COMSAT shows satisfactory results when compared with 12 other state-of-the-art predictors, and is more robust in terms of prediction accuracy as the length and complexity of TM protein increase. COMSAT is freely accessible at http://hpcc.siat.ac.cn/COMSAT/.
Collapse
Affiliation(s)
- Huiling Zhang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Qingsheng Huang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Zhendong Bei
- Center for Cloud Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yanjie Wei
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Christodoulos A Floudas
- Department of Chemical Engineering, Texas A&M University, College Station, Texas, 77843.,Texas A&M Energy Institute, Texas A&M University, College Station, Texas, 77843
| |
Collapse
|
46
|
Sneha P, Doss CGP. Molecular Dynamics: New Frontier in Personalized Medicine. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2015; 102:181-224. [PMID: 26827606 DOI: 10.1016/bs.apcsb.2015.09.004] [Citation(s) in RCA: 109] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The field of drug discovery has witnessed infinite development over the last decade with the demand for discovery of novel efficient lead compounds. Although the development of novel compounds in this field has seen large failure, a breakthrough in this area might be the establishment of personalized medicine. The trend of personalized medicine has shown stupendous growth being a hot topic after the successful completion of Human Genome Project and 1000 genomes pilot project. Genomic variant such as SNPs play a vital role with respect to inter individual's disease susceptibility and drug response. Hence, identification of such genetic variants has to be performed before administration of a drug. This process requires high-end techniques to understand the complexity of the molecules which might bring an insight to understand the compounds at their molecular level. To sustenance this, field of bioinformatics plays a crucial role in revealing the molecular mechanism of the mutation and thereby designing a drug for an individual in fast and affordable manner. High-end computational methods, such as molecular dynamics (MD) simulation has proved to be a constitutive approach to detecting the minor changes associated with an SNP for better understanding of the structural and functional relationship. The parameters used in molecular dynamic simulation elucidate different properties of a macromolecule, such as protein stability and flexibility. MD along with docking analysis can reveal the synergetic effect of an SNP in protein-ligand interaction and provides a foundation for designing a particular drug molecule for an individual. This compelling application of computational power and the advent of other technologies have paved a promising way toward personalized medicine. In this in-depth review, we tried to highlight the different wings of MD toward personalized medicine.
Collapse
Affiliation(s)
- P Sneha
- Medical Biotechnology Division, School of Biosciences and Technology, VIT University, Vellore, Tamil Nadu, India
| | - C George Priya Doss
- Medical Biotechnology Division, School of Biosciences and Technology, VIT University, Vellore, Tamil Nadu, India.
| |
Collapse
|
47
|
Nowotny J, Ahmed S, Xu L, Oluwadare O, Chen H, Hensley N, Trieu T, Cao R, Cheng J. Iterative reconstruction of three-dimensional models of human chromosomes from chromosomal contact data. BMC Bioinformatics 2015; 16:338. [PMID: 26493399 PMCID: PMC4619219 DOI: 10.1186/s12859-015-0772-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2015] [Accepted: 10/13/2015] [Indexed: 11/10/2022] Open
Abstract
Background The entire collection of genetic information resides within the chromosomes, which themselves reside within almost every cell nucleus of eukaryotic organisms. Each individual chromosome is found to have its own preferred three-dimensional (3D) structure independent of the other chromosomes. The structure of each chromosome plays vital roles in controlling certain genome operations, including gene interaction and gene regulation. As a result, knowing the structure of chromosomes assists in the understanding of how the genome functions. Fortunately, the 3D structure of chromosomes proves possible to construct through computational methods via contact data recorded from the chromosome. We developed a unique computational approach based on optimization procedures known as adaptation, simulated annealing, and genetic algorithm to construct 3D models of human chromosomes, using chromosomal contact data. Results Our models were evaluated using a percentage-based scoring function. Analysis of the scores of the final 3D models demonstrated their effective construction from our computational approach. Specifically, the models resulting from our approach yielded an average score of 80.41 %, with a high of 91 %, across models for all chromosomes of a normal human B-cell. Comparisons made with other methods affirmed the effectiveness of our strategy. Particularly, juxtaposition with models generated through the publicly available method Markov chain Monte Carlo 5C (MCMC5C) illustrated the outperformance of our approach, as seen through a higher average score for all chromosomes. Our methodology was further validated using two consistency checking techniques known as convergence testing and robustness checking, which both proved successful. Conclusions The pursuit of constructing accurate 3D chromosomal structures is fueled by the benefits revealed by the findings as well as any possible future areas of study that arise. This motivation has led to the development of our computational methodology. The implementation of our approach proved effective in constructing 3D chromosome models and proved consistent with, and more effective than, some other methods thereby achieving our goal of creating a tool to help advance certain research efforts. The source code, test data, test results, and documentation of our method, Gen3D, are available at our sourceforge site at: http://sourceforge.net/projects/gen3d/.
Collapse
Affiliation(s)
- Jackson Nowotny
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Sharif Ahmed
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Lingfei Xu
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Oluwatosin Oluwadare
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Hannah Chen
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Noelan Hensley
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Tuan Trieu
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Renzhi Cao
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| | - Jianlin Cheng
- Department of Computer Science, Informatics Institute, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
48
|
Dabrowski-Tumanski P, Jarmolinska AI, Sulkowska JI. Prediction of the optimal set of contacts to fold the smallest knotted protein. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2015; 27:354109. [PMID: 26291339 DOI: 10.1088/0953-8984/27/35/354109] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Knotted protein chains represent a new motif in protein folds. They have been linked to various diseases, and recent extensive analysis of the Protein Data Bank shows that they constitute 1.5% of all deposited protein structures. Despite thorough theoretical and experimental investigations, the role of knots in proteins still remains elusive. Nonetheless, it is believed that knots play an important role in mechanical and thermal stability of proteins. Here, we perform a comprehensive analysis of native, shadow-specific and non-native interactions which describe free energy landscape of the smallest knotted protein (PDB id 2efv). We show that the addition of shadow-specific contacts in the loop region greatly enhances folding kinetics, while the addition of shadow-specific contacts along the C-terminal region (H3 or H4) results in a new folding route with slower kinetics. By means of direct coupling analysis (DCA) we predict non-native contacts which also can accelerate kinetics. Next, we show that the length of the C-terminal knot tail is responsible for the shape of the free energy barrier, while the influence of the elongation of the N-terminus is not significant. Finally, we develop a concept of a minimal contact map sufficient for 2efv protein to fold and analyze properties of this protein using this map.
Collapse
Affiliation(s)
- P Dabrowski-Tumanski
- Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland. Faculty of Chemistry, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
| | | | | |
Collapse
|
49
|
Adhikari B, Bhattacharya D, Cao R, Cheng J. CONFOLD: Residue-residue contact-guided ab initio protein folding. Proteins 2015; 83:1436-49. [PMID: 25974172 PMCID: PMC4509844 DOI: 10.1002/prot.24829] [Citation(s) in RCA: 98] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 04/11/2015] [Accepted: 05/02/2015] [Indexed: 12/20/2022]
Abstract
Predicted protein residue-residue contacts can be used to build three-dimensional models and consequently to predict protein folds from scratch. A considerable amount of effort is currently being spent to improve contact prediction accuracy, whereas few methods are available to construct protein tertiary structures from predicted contacts. Here, we present an ab initio protein folding method to build three-dimensional models using predicted contacts and secondary structures. Our method first translates contacts and secondary structures into distance, dihedral angle, and hydrogen bond restraints according to a set of new conversion rules, and then provides these restraints as input for a distance geometry algorithm to build tertiary structure models. The initially reconstructed models are used to regenerate a set of physically realistic contact restraints and detect secondary structure patterns, which are then used to reconstruct final structural models. This unique two-stage modeling approach of integrating contacts and secondary structures improves the quality and accuracy of structural models and in particular generates better β-sheets than other algorithms. We validate our method on two standard benchmark datasets using true contacts and secondary structures. Our method improves TM-score of reconstructed protein models by 45% and 42% over the existing method on the two datasets, respectively. On the dataset for benchmarking reconstructions methods with predicted contacts and secondary structures, the average TM-score of best models reconstructed by our method is 0.59, 5.5% higher than the existing method. The CONFOLD web server is available at http://protein.rnet.missouri.edu/confold/.
Collapse
Affiliation(s)
- Badri Adhikari
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA
| | | | - Renzhi Cao
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA
| | - Jianlin Cheng
- Department of Computer Science, University of Missouri, Columbia, MO 65211 USA
| |
Collapse
|
50
|
Pietal MJ, Bujnicki JM, Kozlowski LP. GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function. Bioinformatics 2015; 31:3499-505. [PMID: 26130575 DOI: 10.1093/bioinformatics/btv390] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2014] [Accepted: 06/23/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION To date, only a few distinct successful approaches have been introduced to reconstruct a protein 3D structure from a map of contacts between its amino acid residues (a 2D contact map). Current algorithms can infer structures from information-rich contact maps that contain a limited fraction of erroneous predictions. However, it is difficult to reconstruct 3D structures from predicted contact maps that usually contain a high fraction of false contacts. RESULTS We describe a new, multi-step protocol that predicts protein 3D structures from the predicted contact maps. The method is based on a novel distance function acting on a fuzzy residue proximity graph, which predicts a 2D distance map from a 2D predicted contact map. The application of a Multi-Dimensional Scaling algorithm transforms that predicted 2D distance map into a coarse 3D model, which is further refined by typical modeling programs into an all-atom representation. We tested our approach on contact maps predicted de novo by MULTICOM, the top contact map predictor according to CASP10. We show that our method outperforms FT-COMAR, the state-of-the-art method for 3D structure reconstruction from 2D maps. For all predicted 2D contact maps of relatively low sensitivity (60-84%), GDFuzz3D generates more accurate 3D models, with the average improvement of 4.87 Å in terms of RMSD. AVAILABILITY AND IMPLEMENTATION GDFuzz3D server and standalone version are freely available at http://iimcb.genesilico.pl/gdserver/GDFuzz3D/. CONTACT iamb@genesilico.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michal J Pietal
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland and
| | - Janusz M Bujnicki
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Adam Mickiewicz University, Poznan, Poland
| | - Lukasz P Kozlowski
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland
| |
Collapse
|