1
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
2
|
Martin NS, Ahnert SE. The Boltzmann distributions of molecular structures predict likely changes through random mutations. Biophys J 2023; 122:4467-4475. [PMID: 37897043 PMCID: PMC10698324 DOI: 10.1016/j.bpj.2023.10.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 08/19/2023] [Accepted: 10/20/2023] [Indexed: 10/29/2023] Open
Abstract
New folded molecular structures can only evolve after arising through mutations. This aspect is modeled using genotype-phenotype maps, which connect sequence changes through mutations to changes in molecular structures. Previous work has shown that the likelihood of appearing through mutations can differ by orders of magnitude from structure to structure and that this can affect the outcomes of evolutionary processes. Thus, we focus on the phenotypic mutation probabilities φqp, i.e., the likelihood that a random mutation changes structure p into structure q. For both RNA secondary structures and the HP protein model, we show that a simple biophysical principle can explain and predict how this likelihood depends on the new structure q: φqp is high if sequences that fold into p as the minimum-free-energy structure are likely to have q as an alternative structure with high Boltzmann frequency. This generalizes the existing concept of plastogenetic congruence from individual sequences to the entire neutral spaces of structures. Our result helps us understand why some structural changes are more likely than others, may be useful for estimating these likelihoods via sampling and makes a connection to alternative structures with high Boltzmann frequency, which could be relevant in evolutionary processes.
Collapse
Affiliation(s)
- Nora S Martin
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, United Kingdom; Theory of Condensed Matter Group, Cavendish Laboratory, University of Cambridge, Cambridge, United Kingdom; Sainsbury Laboratory, University of Cambridge, Cambridge, United Kingdom.
| | - Sebastian E Ahnert
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, United Kingdom; The Alan Turing Institute, London, United Kingdom
| |
Collapse
|
3
|
Posani L, Rizzato F, Monasson R, Cocco S. Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data. PLoS Comput Biol 2023; 19:e1011521. [PMID: 37883593 PMCID: PMC10645369 DOI: 10.1371/journal.pcbi.1011521] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 11/14/2023] [Accepted: 09/15/2023] [Indexed: 10/28/2023] Open
Abstract
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
Collapse
Affiliation(s)
- Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Francesca Rizzato
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| |
Collapse
|
4
|
Mauri E, Cocco S, Monasson R. Transition paths in Potts-like energy landscapes: General properties and application to protein sequence models. Phys Rev E 2023; 108:024141. [PMID: 37723761 DOI: 10.1103/physreve.108.024141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 08/07/2023] [Indexed: 09/20/2023]
Abstract
We study transition paths in energy landscapes over multicategorical Potts configurations using the mean-field approach introduced by Mauri et al. [Phys. Rev. Lett. 130, 158402 (2023)0031-900710.1103/PhysRevLett.130.158402]. Paths interpolate between two fixed configurations or are anchored at one extremity only. We characterize the properties of "good" transition paths realizing a trade-off between exploring low-energy regions in the landscape and being not too long, such as their entropy or the probability of escape from a region of the landscape. We unveil the existence of a phase transition separating a regime in which paths are stretched in between their anchors from another regime where paths can explore the energy landscape more globally to minimize the energy. This phase transition is first illustrated and studied in detail on a mathematically tractable Hopfield-Potts toy model, then studied in energy landscapes inferred from protein sequence data.
Collapse
Affiliation(s)
- Eugenio Mauri
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| | - Simona Cocco
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| | - Rémi Monasson
- Laboratory of Physics, École Normale Supérieure, CNRS UMR 8023, and PSL Research, Sorbonne Université, 24 Rue Lhomond, 75231 Paris Cedex 05, France
| |
Collapse
|
5
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
6
|
Mauri E, Cocco S, Monasson R. Mutational Paths with Sequence-Based Models of Proteins: From Sampling to Mean-Field Characterization. PHYSICAL REVIEW LETTERS 2023; 130:158402. [PMID: 37115874 DOI: 10.1103/physrevlett.130.158402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 03/16/2023] [Indexed: 06/19/2023]
Abstract
Identifying and characterizing mutational paths is an important issue in evolutionary biology, with potential applications to bioengineering. We here propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with restricted Boltzmann machines. We then use mean-field theory to characterize paths for different mutational dynamics of interest, and to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.
Collapse
Affiliation(s)
- Eugenio Mauri
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| |
Collapse
|
7
|
Kleeorin Y, Russ WP, Rivoire O, Ranganathan R. Undersampling and the inference of coevolution in proteins. Cell Syst 2023; 14:210-219.e7. [PMID: 36693377 PMCID: PMC10911952 DOI: 10.1016/j.cels.2022.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 01/02/2022] [Accepted: 12/23/2022] [Indexed: 01/24/2023]
Abstract
Protein structure, function, and evolution depend on local and collective epistatic interactions between amino acids. A powerful approach to defining these interactions is to construct models of couplings between amino acids that reproduce the empirical statistics (frequencies and correlations) observed in sequences comprising a protein family. The top couplings are then interpreted. Here, we show that as currently implemented, this inference unequally represents epistatic interactions, a problem that fundamentally arises from limited sampling of sequences in the context of distinct scales at which epistasis occurs in proteins. We show that these issues explain the ability of current approaches to predict tertiary contacts between amino acids and the inability to obviously expose larger networks of functionally relevant, collectively evolving residues called sectors. This work provides a necessary foundation for more deeply understanding and improving evolution-based models of proteins.
Collapse
Affiliation(s)
- Yaakov Kleeorin
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA
| | - William P Russ
- Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Olivier Rivoire
- Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, PSL Research University, 75005 Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA; The Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
8
|
Colberg M, Schofield J. Configurational entropy, transition rates, and optimal interactions for rapid folding in coarse-grained model proteins. J Chem Phys 2022; 157:125101. [PMID: 36182418 DOI: 10.1063/5.0098612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Under certain conditions, the dynamics of coarse-grained models of solvated proteins can be described using a Markov state model, which tracks the evolution of populations of configurations. The transition rates among states that appear in the Markov model can be determined by computing the relative entropy of states and their mean first passage times. In this paper, we present an adaptive method to evaluate the configurational entropy and the mean first passage times for linear chain models with discontinuous potentials. The approach is based on event-driven dynamical sampling in a massively parallel architecture. Using the fact that the transition rate matrix can be calculated for any choice of interaction energies at any temperature, it is demonstrated how each state's energy can be chosen such that the average time to transition between any two states is minimized. The methods are used to analyze the optimization of the folding process of two protein systems: the crambin protein and a model with frustration and misfolding. It is shown that the folding pathways for both systems are comprised of two regimes: first, the rapid establishment of local bonds, followed by the subsequent formation of more distant contacts. The state energies that lead to the most rapid folding encourage multiple pathways, and they either penalize folding pathways through kinetic traps by raising the energies of trapping states or establish an escape route from the trapping states by lowering free energy barriers to other states that rapidly reach the native state.
Collapse
Affiliation(s)
- Margarita Colberg
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| | - Jeremy Schofield
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| |
Collapse
|
9
|
Miyazawa S. Boltzmann Machine Learning and Regularization Methods for Inferring Evolutionary Fields and Couplings From a Multiple Sequence Alignment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:328-342. [PMID: 32396099 DOI: 10.1109/tcbb.2020.2993232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using L2 regularization for fields, group L1 for couplings is shown to be very effective for sparse couplings in comparison with L2 and L1. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings and to use a soft-thresholding function for group L1. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.
Collapse
|
10
|
Roussel C, Cocco S, Monasson R. Barriers and dynamical paths in alternating Gibbs sampling of restricted Boltzmann machines. Phys Rev E 2021; 104:034109. [PMID: 34654094 DOI: 10.1103/physreve.104.034109] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Accepted: 08/18/2021] [Indexed: 11/07/2022]
Abstract
Restricted Boltzmann machines (RBM) are bilayer neural networks used for the unsupervised learning of model distributions from data. The bipartite architecture of RBM naturally defines an elegant sampling procedure, called alternating Gibbs sampling (AGS), where the configurations of the latent-variable layer are sampled conditional to the data-variable layer and vice versa. We study here the performance of AGS on several analytically tractable models borrowed from statistical mechanics. We show that standard AGS is not more efficient than classical Metropolis-Hastings (MH) sampling of the effective energy landscape defined on the data layer. However, RBM can identify meaningful representations of training data in their latent space. Furthermore, using these representations and combining Gibbs sampling with the MH algorithm in the latent space can enhance the sampling performance of the RBM when the hidden units encode weakly dependent features of the data. We illustrate our findings on three datasets: Bars and Stripes and MNIST, well known in machine learning, and the so-called lattice proteins dataset, introduced in theoretical biology to study the sequence-to-structure mapping in proteins.
Collapse
Affiliation(s)
- Clément Roussel
- Laboratory of Physics of the École Normale Supérieure, CNRS UMR 8023 & PSL Research, Sorbonne Université, 75005 Paris, France
| | - Simona Cocco
- Laboratory of Physics of the École Normale Supérieure, CNRS UMR 8023 & PSL Research, Sorbonne Université, 75005 Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the École Normale Supérieure, CNRS UMR 8023 & PSL Research, Sorbonne Université, 75005 Paris, France
| |
Collapse
|
11
|
Haldane A, Levy RM. Mi3-GPU: MCMC-based Inverse Ising Inference on GPUs for protein covariation analysis. COMPUTER PHYSICS COMMUNICATIONS 2021; 260:107312. [PMID: 33716309 PMCID: PMC7944406 DOI: 10.1016/j.cpc.2020.107312] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Inverse Ising inference is a method for inferring the coupling parameters of a Potts/Ising model based on observed site-covariation, which has found important applications in protein physics for detecting interactions between residues in protein families. We introduce Mi3-GPU ("mee-three", for MCMC Inverse Ising Inference) software for solving the inverse Ising problem for protein-sequence datasets with few analytic approximations, by parallel Markov-Chain Monte-Carlo sampling on GPUs. We also provide tools for analysis and preparation of protein-family Multiple Sequence Alignments (MSAs) to account for finite-sampling issues, which are a major source of error or bias in inverse Ising inference. Our method is "generative" in the sense that the inferred model can be used to generate synthetic MSAs whose mutational statistics (marginals) can be verified to match the dataset MSA statistics up to the limits imposed by the effects of finite sampling. Our GPU implementation enables the construction of models which reproduce the covariation patterns of the observed MSA with a precision that is not possible with more approximate methods. The main components of our method are a GPU-optimized algorithm to greatly accelerate MCMC sampling, combined with a multi-step Quasi-Newton parameter-update scheme using a "Zwanzig reweighting" technique. We demonstrate the ability of this software to produce generative models on typical protein family datasets for sequence lengths L ~ 300 with 21 residue types with tens of millions of inferred parameters in short running times.
Collapse
Affiliation(s)
- Allan Haldane
- Center for Biophysics and Computational Biology and Department of Physics, Temple University, Philadelphia, Pennsylvania 19122
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology and Department of Chemistry, Temple University, Philadelphia, Pennsylvania 19122
| |
Collapse
|
12
|
ELIHKSIR Web Server: Evolutionary Links Inferred for Histidine Kinase Sensors Interacting with Response Regulators. ENTROPY 2021; 23:e23020170. [PMID: 33573110 PMCID: PMC7911359 DOI: 10.3390/e23020170] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 01/21/2021] [Accepted: 01/26/2021] [Indexed: 12/03/2022]
Abstract
Two-component systems (TCS) are signaling machinery that consist of a histidine kinases (HK) and response regulator (RR). When an environmental change is detected, the HK phosphorylates its cognate response regulator (RR). While cognate interactions were considered orthogonal, experimental evidence shows the prevalence of crosstalk interactions between non-cognate HK–RR pairs. Currently, crosstalk interactions have been demonstrated for TCS proteins in a limited number of organisms. By providing specificity predictions across entire TCS networks for a large variety of organisms, the ELIHKSIR web server assists users in identifying interactions for TCS proteins and their mutants. To generate specificity scores, a global probabilistic model was used to identify interfacial couplings and local fields from sequence information. These couplings and local fields were then used to construct Hamiltonian scores for positions with encoded specificity, resulting in the specificity score. These methods were applied to 6676 organisms available on the ELIHKSIR web server. Due to the ability to mutate proteins and display the resulting network changes, there are nearly endless combinations of TCS networks to analyze using ELIHKSIR. The functionality of ELIHKSIR allows users to perform a variety of TCS network analyses and visualizations to support TCS research efforts.
Collapse
|
13
|
Bravi B, Ravasio R, Brito C, Wyart M. Direct coupling analysis of epistasis in allosteric materials. PLoS Comput Biol 2020; 16:e1007630. [PMID: 32119660 PMCID: PMC7067494 DOI: 10.1371/journal.pcbi.1007630] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 03/12/2020] [Accepted: 01/03/2020] [Indexed: 11/22/2022] Open
Abstract
In allosteric proteins, the binding of a ligand modifies function at a distant active site. Such allosteric pathways can be used as target for drug design, generating considerable interest in inferring them from sequence alignment data. Currently, different methods lead to conflicting results, in particular on the existence of long-range evolutionary couplings between distant amino-acids mediating allostery. Here we propose a resolution of this conundrum, by studying epistasis and its inference in models where an allosteric material is evolved in silico to perform a mechanical task. We find in our model the four types of epistasis (Synergistic, Sign, Antagonistic, Saturation), which can be both short or long-range and have a simple mechanical interpretation. We perform a Direct Coupling Analysis (DCA) and find that DCA predicts well the cost of point mutations but is a rather poor generative model. Strikingly, it can predict short-range epistasis but fails to capture long-range epistasis, in consistence with empirical findings. We propose that such failure is generic when function requires subparts to work in concert. We illustrate this idea with a simple model, which suggests that other methods may be better suited to capture long-range effects. Allostery in proteins is the property of highly specific responses to ligand binding at a distant site. To inform protocols of de novo drug design, it is fundamental to understand the impact of mutations on allosteric regulation and whether it can be predicted from evolutionary correlations. In this work we consider allosteric architectures artificially evolved to optimize the cooperativity of binding at allosteric and active site. We first characterize the emergent pattern of epistasis as well as the underlying mechanical phenomena, finding the four types of epistasis (Synergistic, Sign, Antagonistic, Saturation), which can be both short or long-range. The numerical evolution of these allosteric architectures allows us to benchmark Direct Coupling Analysis, a method which relies on co-evolution in sequence data to infer direct evolutionary couplings, in connection to allostery. We show that Direct Coupling Analysis predicts quantitatively point mutation costs but underestimates strong long-range epistasis. We provide an argument, based on a simplified model, illustrating the reasons for this discrepancy. Our analysis suggests neural networks as more promising tool to measure epistasis.
Collapse
Affiliation(s)
- Barbara Bravi
- Institute of Physics, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- * E-mail: (BB); (MW)
| | - Riccardo Ravasio
- Institute of Physics, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Carolina Brito
- Instituto de Fìsica, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Matthieu Wyart
- Institute of Physics, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- * E-mail: (BB); (MW)
| |
Collapse
|
14
|
Rivoire O. Parsimonious evolutionary scenario for the origin of allostery and coevolution patterns in proteins. Phys Rev E 2020; 100:032411. [PMID: 31640027 DOI: 10.1103/physreve.100.032411] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Indexed: 12/16/2022]
Abstract
Proteins display generic properties that are challenging to explain by direct selection, notably allostery, the capacity to be regulated through long-range effects, and evolvability, the capacity to adapt to new selective pressures. An evolutionary scenario is proposed where proteins acquire these two features indirectly as a by-product of their selection for a more fundamental property, exquisite discrimination, the capacity to bind discriminatively very similar ligands. Achieving this task is shown to typically require proteins to undergo a conformational change. We argue that physical and evolutionary constraints impel this change to be controlled by a group of sites extending from the binding site. Proteins can thus acquire a latent potential for allosteric regulation and evolutionary adaptation because of long-range effects that initially arise as evolutionary spandrels. This scenario accounts for the groups of conserved and coevolving residues observed in multiple sequence alignments. However, we propose that most pairs of coevolving and contacting residues inferred from such alignments have a different origin, related to thermal stability. A physical model is presented that illustrates this evolutionary scenario and its implications. The scenario can be implemented in experiments of protein evolution to directly test its predictions.
Collapse
Affiliation(s)
- Olivier Rivoire
- Center for Interdisciplinary Research in Biology, Collège de France, Centre National de la Recherche Scientifique, INSERM, PSL Research University, 75005 Paris, France
| |
Collapse
|
15
|
Baryakova TH, Ritter SC, Tresnak DT, Hackel BJ. Computationally Aided Discovery of LysEFm5 Variants with Improved Catalytic Activity and Stability. Appl Environ Microbiol 2020; 86:e02051-19. [PMID: 31811034 PMCID: PMC6997734 DOI: 10.1128/aem.02051-19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2019] [Accepted: 11/30/2019] [Indexed: 01/21/2023] Open
Abstract
Bacteriophage-derived lysin proteins are potentially effective antimicrobials that would benefit from engineered improvements to their bioavailability and specific activity. Here, the catalytic domain of LysEFm5, a lysin with activity against vancomycin-resistant Enterococcus faecium (VRE), was subjected to site-saturation mutagenesis at positions whose selection was guided by sequence and structural information from homologous proteins. A second-order Potts model with parameters inferred from large sets of homologous sequence information was used to predict the average change in the statistical fitness for mutant libraries with diversity at pairs of sites within the secondary catalytic shell. Guided by the statistical fitness, nine double mutant saturation libraries were created and plated on agar containing autoclaved VRE to quickly identify and segregate catalytically active (halo-forming) and inactive (non-halo-forming) variants. High-throughput DNA sequencing of 873 unique variants showed that the statistical fitness was predictive of the retention or loss of catalytic activity (area under the curve [AUC], 0.840 to 0.894), with the inclusion of more diverse sequences in the starting multiple-sequence alignment improving the classification accuracy when pairwise amino acid couplings (epistasis) were considered. Of eight random halo-forming variants selected for more sensitive testing, one showed a 1.8 (±0.4)-fold improvement in specific activity and an 11.5 ± 0.8°C increase in melting temperature compared to those of the wild type. Our results demonstrate that a computationally informed approach employing homologous protein information coupled with a mid-throughput screening assay allows for the expedited discovery of lysin variants with improved properties.IMPORTANCE Broad-spectrum antibiotics can indiscriminately kill most bacteria, including commensal species that are a part of the normal human flora. This can potentially lead to the proliferation of drug-resistant bacteria upon elimination of competing species and to unwanted autoimmune effects in patients. Bacteriophage-derived lysin proteins are an alternative to conventional antibiotics that have coevolved alongside specific bacterial hosts. Lysins are capable of targeting conserved substrates in the bacterial cell wall essential for its viability. To engineer these proteins to exhibit improved therapeutically relevant properties, homology-guided statistical approaches can be used to identify compelling sites for mutation and to quantify the functional constraints acting on these sites to direct mutagenic library creation. The platform described herein couples this informed approach with a visual plate assay that can be used to simultaneously screen hundreds of mutants for catalytic activity, allowing for the streamlined identification of improved lysin variants.
Collapse
Affiliation(s)
- Tsvetelina H Baryakova
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, Minneapolis, Minnesota, USA
| | - Seth C Ritter
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, Minneapolis, Minnesota, USA
| | - Daniel T Tresnak
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, Minneapolis, Minnesota, USA
| | - Benjamin J Hackel
- Department of Chemical Engineering and Materials Science, University of Minnesota-Twin Cities, Minneapolis, Minnesota, USA
| |
Collapse
|
16
|
Biswas A, Haldane A, Arnold E, Levy RM. Epistasis and entrenchment of drug resistance in HIV-1 subtype B. eLife 2019; 8:e50524. [PMID: 31591964 PMCID: PMC6783267 DOI: 10.7554/elife.50524] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Accepted: 09/09/2019] [Indexed: 12/17/2022] Open
Abstract
The development of drug resistance in HIV is the result of primary mutations whose effects on viral fitness depend on the entire genetic background, a phenomenon called 'epistasis'. Based on protein sequences derived from drug-experienced patients in the Stanford HIV database, we use a co-evolutionary (Potts) Hamiltonian model to provide direct confirmation of epistasis involving many simultaneous mutations. Building on earlier work, we show that primary mutations leading to drug resistance can become highly favored (or entrenched) by the complex mutation patterns arising in response to drug therapy despite being disfavored in the wild-type background, and provide the first confirmation of entrenchment for all three drug-target proteins: protease, reverse transcriptase, and integrase; a comparative analysis reveals that NNRTI-induced mutations behave differently from the others. We further show that the likelihood of resistance mutations can vary widely in patient populations, and from the population average compared to specific molecular clones.
Collapse
Affiliation(s)
- Avik Biswas
- Center for Biophysics and Computational BiologyTemple UniversityPhiladelphiaUnited States
- Department of PhysicsTemple UniversityPhiladelphiaUnited States
| | - Allan Haldane
- Center for Biophysics and Computational BiologyTemple UniversityPhiladelphiaUnited States
- Department of PhysicsTemple UniversityPhiladelphiaUnited States
| | - Eddy Arnold
- Center for Advanced Biotechnology and MedicineRutgers UniversityPiscatawayUnited States
- Department of Chemistry and Chemical BiologyRutgers UniversityPiscatawayUnited States
| | - Ronald M Levy
- Center for Biophysics and Computational BiologyTemple UniversityPhiladelphiaUnited States
- Department of PhysicsTemple UniversityPhiladelphiaUnited States
- Department of ChemistryTemple UniversityPhiladelphiaUnited States
| |
Collapse
|
17
|
Tubiana J, Cocco S, Monasson R. Learning Compositional Representations of Interacting Systems with Restricted Boltzmann Machines: Comparative Study of Lattice Proteins. Neural Comput 2019; 31:1671-1717. [DOI: 10.1162/neco_a_01210] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.
Collapse
Affiliation(s)
- Jérôme Tubiana
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS and PSL Research, 75005 Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS and PSL Research, 75005 Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS and PSL Research, 75005 Paris, France
| |
Collapse
|
18
|
Haldane A, Flynn WF, He P, Levy RM. Coevolutionary Landscape of Kinase Family Proteins: Sequence Probabilities and Functional Motifs. Biophys J 2019; 114:21-31. [PMID: 29320688 DOI: 10.1016/j.bpj.2017.10.028] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Revised: 09/11/2017] [Accepted: 10/17/2017] [Indexed: 01/25/2023] Open
Abstract
The protein kinase catalytic domain is one of the most abundant domains across all branches of life. Although kinases share a common core function of phosphoryl-transfer, they also have wide functional diversity and play varied roles in cell signaling networks, and for this reason are implicated in a number of human diseases. This functional diversity is primarily achieved through sequence variation, and uncovering the sequence-function relationships for the kinase family is a major challenge. In this study we use a statistical inference technique inspired by statistical physics, which builds a coevolutionary "Potts" Hamiltonian model of sequence variation in a protein family. We show how this model has sufficient power to predict the probability of specific subsequences in the highly diverged kinase family, which we verify by comparing the model's predictions with experimental observations in the Uniprot database. We show that the pairwise (residue-residue) interaction terms of the statistical model are necessary and sufficient to capture higher-than-pairwise mutation patterns of natural kinase sequences. We observe that previously identified functional sets of residues have much stronger correlated interaction scores than are typical.
Collapse
Affiliation(s)
- Allan Haldane
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania
| | - William F Flynn
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania; Department of Physics and Astronomy, Rutgers, The State University of New Jersey, Piscataway, New Jersey
| | - Peng He
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania
| | - Ronald M Levy
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, Pennsylvania.
| |
Collapse
|
19
|
Facco E, Pagnani A, Russo ET, Laio A. The intrinsic dimension of protein sequence evolution. PLoS Comput Biol 2019; 15:e1006767. [PMID: 30958823 PMCID: PMC6472826 DOI: 10.1371/journal.pcbi.1006767] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Revised: 04/18/2019] [Accepted: 12/25/2018] [Indexed: 01/22/2023] Open
Abstract
It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.
Collapse
Affiliation(s)
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Torino, Italy
- IIGM, Italian Institute for Genomic Medicine, Torino, Italy
- INFN, Sezione di Torino, Torino, Italy
| | | | - Alessandro Laio
- SISSA, Trieste, Italy
- ICTP, International Centre for Theoretical Physics, Trieste, Italy
| |
Collapse
|
20
|
Tubiana J, Cocco S, Monasson R. Learning protein constitutive motifs from sequence data. eLife 2019; 8:e39397. [PMID: 30857591 PMCID: PMC6436896 DOI: 10.7554/elife.39397] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 02/24/2019] [Indexed: 12/11/2022] Open
Abstract
Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype-phenotype relationship for protein families.
Collapse
Affiliation(s)
- Jérôme Tubiana
- Laboratory of Physics of the Ecole Normale SupérieureCNRS UMR 8023 & PSL ResearchParisFrance
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale SupérieureCNRS UMR 8023 & PSL ResearchParisFrance
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale SupérieureCNRS UMR 8023 & PSL ResearchParisFrance
| |
Collapse
|
21
|
Haldane A, Levy RM. Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation. Phys Rev E 2019; 99:032405. [PMID: 30999494 PMCID: PMC6508952 DOI: 10.1103/physreve.99.032405] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Indexed: 02/02/2023]
Abstract
Potts statistical models have become a popular and promising way to analyze mutational covariation in protein multiple sequence alignments (MSAs) in order to understand protein structure, function, and fitness. But the statistical limitations of these models, which can have millions of parameters and are fit to MSAs of only thousands or hundreds of effective sequences using a procedure known as inverse Ising inference, are incompletely understood. In this work we predict how model quality degrades as a function of the number of sequences N, sequence length L, amino-acid alphabet size q, and the degree of conservation of the MSA, in different applications of the Potts models: in "fitness" predictions of individual protein sequences, in predictions of the effects of single-point mutations, in "double mutant cycle" predictions of epistasis, and in 3D contact prediction in protein structure. We show how as MSA depth N decreases an "overfitting" effect occurs such that sequences in the training MSA have overestimated fitness, and we predict the magnitude of this effect and discuss how regularization can help correct for it, using a regularization procedure motivated by statistical analysis of the effects of finite sampling. We find that as N decreases the quality of point-mutation effect predictions degrade least, fitness and epistasis predictions degrade more rapidly, and contact predictions are most affected. However, overfitting becomes negligible for MSA depths of more than a few thousand effective sequences, as often used in practice, and regularization becomes less necessary. We discuss the implications of these results for users of Potts covariation analysis.
Collapse
Affiliation(s)
- Allan Haldane
- Center for Biophysics and Computational Biology, Department of
Physics, and Institute for Computational Molecular Science, Temple
University, Philadelphia, Pennsylvania 19122
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology, Department of
Chemistry, and Institute for Computational Molecular Science, Temple
University, Philadelphia, Pennsylvania 19122
| |
Collapse
|
22
|
Nelson ED, Grishin NV. Inference of epistatic effects in a key mitochondrial protein. Phys Rev E 2018; 97:062404. [PMID: 30011480 DOI: 10.1103/physreve.97.062404] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2017] [Indexed: 12/17/2022]
Abstract
We use Potts model inference to predict pair epistatic effects in a key mitochondrial protein-cytochrome c oxidase subunit 2-for ray-finned fishes. We examine the effect of phylogenetic correlations on our predictions using a simple exact fitness model, and we find that, although epistatic effects are underpredicted, they maintain a roughly linear relationship to their true (model) values. After accounting for this correction, epistatic effects in the protein are still relatively weak, leading to fitness valleys of depth 2Ns≃-5 in compensatory double mutants. Interestingly, positive epistasis is more pronounced than negative epistasis, and the strongest positive effects capture nearly all sites subject to positive selection in fishes, similar to virus proteins evolving under selection pressure in the context of drug therapy.
Collapse
Affiliation(s)
- Erik D Nelson
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 6001 Forest Park Blvd., Room ND10.124, Dallas, Texas 75235-9050, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 6001 Forest Park Blvd., Room ND10.124, Dallas, Texas 75235-9050, USA
| |
Collapse
|
23
|
Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M. Inverse statistical physics of protein sequences: a key issues review. REPORTS ON PROGRESS IN PHYSICS. PHYSICAL SOCIETY (GREAT BRITAIN) 2018; 81:032601. [PMID: 29120346 DOI: 10.1088/1361-6633/aa9965] [Citation(s) in RCA: 111] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e. evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratoire de Physique Statistique de l'Ecole Normale Supérieure-UMR 8549, CNRS and PSL Research, Sorbonne Universités UPMC, Paris, France
| | | | | | | | | |
Collapse
|
24
|
Abstract
Covariance analysis of protein sequence alignments uses coevolving pairs of sequence positions to predict features of protein structure and function. However, current methods ignore the phylogenetic relationships between sequences, potentially corrupting the identification of covarying positions. Here, we use random matrix theory to demonstrate the existence of a power law tail that distinguishes the spectrum of covariance caused by phylogeny from that caused by structural interactions. The power law is essentially independent of the phylogenetic tree topology, depending on just two parameters-the sequence length and the average branch length. We demonstrate that these power law tails are ubiquitous in the large protein sequence alignments used to predict contacts in 3D structure, as predicted by our theory. This suggests that to decouple phylogenetic effects from the interactions between sequence distal sites that control biological function, it is necessary to remove or down-weight the eigenvectors of the covariance matrix with largest eigenvalues. We confirm that truncating these eigenvectors improves contact prediction.
Collapse
|
25
|
Prediction of Structures and Interactions from Genome Information. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2018; 1105:123-152. [DOI: 10.1007/978-981-13-2200-6_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
26
|
Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness. J Theor Biol 2017; 433:21-38. [DOI: 10.1016/j.jtbi.2017.08.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2017] [Revised: 07/27/2017] [Accepted: 08/21/2017] [Indexed: 11/19/2022]
|
27
|
Tian P, Best RB. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis. Biophys J 2017; 113:1719-1730. [PMID: 29045866 PMCID: PMC5647607 DOI: 10.1016/j.bpj.2017.08.039] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2017] [Revised: 08/03/2017] [Accepted: 08/08/2017] [Indexed: 12/23/2022] Open
Abstract
Quantifying the relationship between protein sequence and structure is key to understanding the protein universe. A fundamental measure of this relationship is the total number of amino acid sequences that can fold to a target protein structure, known as the "sequence capacity," which has been suggested as a proxy for how designable a given protein fold is. Although sequence capacity has been extensively studied using lattice models and theory, numerical estimates for real protein structures are currently lacking. In this work, we have quantitatively estimated the sequence capacity of 10 proteins with a variety of different structures using a statistical model based on residue-residue co-evolution to capture the variation of sequences from the same protein family. Remarkably, we find that even for the smallest protein folds, such as the WW domain, the number of foldable sequences is extremely large, exceeding the Avogadro constant. In agreement with earlier theoretical work, the calculated sequence capacity is positively correlated with the size of the protein, or better, the density of contacts. This allows the absolute sequence capacity of a given protein to be approximately predicted from its structure. On the other hand, the relative sequence capacity, i.e., normalized by the total number of possible sequences, is an extremely tiny number and is strongly anti-correlated with the protein length. Thus, although there may be more foldable sequences for larger proteins, it will be much harder to find them. Lastly, we have correlated the evolutionary age of proteins in the CATH database with their sequence capacity as predicted by our model. The results suggest a trade-off between the opposing requirements of high designability and the likelihood of a novel fold emerging by chance.
Collapse
Affiliation(s)
- Pengfei Tian
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland
| | - Robert B Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, Maryland.
| |
Collapse
|
28
|
Flynn WF, Haldane A, Torbett BE, Levy RM. Inference of Epistatic Effects Leading to Entrenchment and Drug Resistance in HIV-1 Protease. Mol Biol Evol 2017; 34:1291-1306. [PMID: 28369521 PMCID: PMC5435099 DOI: 10.1093/molbev/msx095] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Understanding the complex mutation patterns that give rise to drug resistant viral strains provides a foundation for developing more effective treatment strategies for HIV/AIDS. Multiple sequence alignments of drug-experienced HIV-1 protease sequences contain networks of many pair correlations which can be used to build a (Potts) Hamiltonian model of these mutation patterns. Using this Hamiltonian model, we translate HIV-1 protease sequence covariation data into quantitative predictions for the probability of observing specific mutation patterns which are in agreement with the observed sequence statistics. We find that the statistical energies of the Potts model are correlated with the fitness of individual proteins containing therapy-associated mutations as estimated by in vitro measurements of protein stability and viral infectivity. We show that the penalty for acquiring primary resistance mutations depends on the epistatic interactions with the sequence background. Primary mutations which lead to drug resistance can become highly advantageous (or entrenched) by the complex mutation patterns which arise in response to drug therapy despite being destabilizing in the wildtype background. Anticipating epistatic effects is important for the design of future protease inhibitor therapies.
Collapse
Affiliation(s)
- William F. Flynn
- Department of Physics and Astronomy, Rutgers University, New Brunswick, NJ
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, PA
| | - Allan Haldane
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, PA
- Department of Chemistry, Temple University, Philadelphia, PA
| | - Bruce E. Torbett
- Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, PA
- Department of Chemistry, Temple University, Philadelphia, PA
| |
Collapse
|
29
|
The Role of Evolutionary Selection in the Dynamics of Protein Structure Evolution. Biophys J 2017; 112:1350-1365. [PMID: 28402878 DOI: 10.1016/j.bpj.2017.02.029] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 02/16/2017] [Accepted: 02/22/2017] [Indexed: 02/05/2023] Open
Abstract
Homology modeling is a powerful tool for predicting a protein's structure. This approach is successful because proteins whose sequences are only 30% identical still adopt the same structure, while structure similarity rapidly deteriorates beyond the 30% threshold. By studying the divergence of protein structure as sequence evolves in real proteins and in evolutionary simulations, we show that this nonlinear sequence-structure relationship emerges as a result of selection for protein folding stability in divergent evolution. Fitness constraints prevent the emergence of unstable protein evolutionary intermediates, thereby enforcing evolutionary paths that preserve protein structure despite broad sequence divergence. However, on longer timescales, evolution is punctuated by rare events where the fitness barriers obstructing structure evolution are overcome and discovery of new structures occurs. We outline biophysical and evolutionary rationale for broad variation in protein family sizes, prevalence of compact structures among ancient proteins, and more rapid structure evolution of proteins with lower packing density.
Collapse
|
30
|
Skwark MJ, Croucher NJ, Puranen S, Chewapreecha C, Pesonen M, Xu YY, Turner P, Harris SR, Beres SB, Musser JM, Parkhill J, Bentley SD, Aurell E, Corander J. Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis. PLoS Genet 2017; 13:e1006508. [PMID: 28207813 PMCID: PMC5312804 DOI: 10.1371/journal.pgen.1006508] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2016] [Accepted: 11/24/2016] [Indexed: 12/05/2022] Open
Abstract
Recent advances in the scale and diversity of population genomic datasets for bacteria now provide the potential for genome-wide patterns of co-evolution to be studied at the resolution of individual bases. Here we describe a new statistical method, genomeDCA, which uses recent advances in computational structural biology to identify the polymorphic loci under the strongest co-evolutionary pressures. We apply genomeDCA to two large population data sets representing the major human pathogens Streptococcus pneumoniae (pneumococcus) and Streptococcus pyogenes (group A Streptococcus). For pneumococcus we identified 5,199 putative epistatic interactions between 1,936 sites. Over three-quarters of the links were between sites within the pbp2x, pbp1a and pbp2b genes, the sequences of which are critical in determining non-susceptibility to beta-lactam antibiotics. A network-based analysis found these genes were also coupled to that encoding dihydrofolate reductase, changes to which underlie trimethoprim resistance. Distinct from these antibiotic resistance genes, a large network component of 384 protein coding sequences encompassed many genes critical in basic cellular functions, while another distinct component included genes associated with virulence. The group A Streptococcus (GAS) data set population represents a clonal population with relatively little genetic variation and a high level of linkage disequilibrium across the genome. Despite this, we were able to pinpoint two RNA pseudouridine synthases, which were each strongly linked to a separate set of loci across the chromosome, representing biologically plausible targets of co-selection. The population genomic analysis method applied here identifies statistically significantly co-evolving locus pairs, potentially arising from fitness selection interdependence reflecting underlying protein-protein interactions, or genes whose product activities contribute to the same phenotype. This discovery approach greatly enhances the future potential of epistasis analysis for systems biology, and can complement genome-wide association studies as a means of formulating hypotheses for targeted experimental work. Epistatic interactions between polymorphisms in DNA are recognized as important drivers of evolution in numerous organisms. Study of epistasis in bacteria has been hampered by the lack of densely sampled population genomic data, suitable statistical models and inference algorithms sufficiently powered for extremely high-dimensional parameter spaces. We introduce the first model-based method for genome-wide epistasis analysis and use two of the largest available bacterial population genome data sets on Streptococcus pneumoniae (the pneumococcus) and Streptococcus pyogenes (group A Streptococcus) to demonstrate its potential for biological discovery. Our approach reveals interacting networks of resistance, virulence and core machinery genes in the pneumococcus, which highlights putative candidates for novel drug targets. We also discover a number of plausible targets of co-selection in S. pyogenes linked to RNA pseudouridine synthases. Our method significantly enhances the future potential of epistasis analysis for systems biology, and can complement genome-wide association studies as a means of formulating hypotheses for targeted experimental work.
Collapse
Affiliation(s)
- Marcin J Skwark
- Department of Chemistry, Vanderbilt University, Nashville, TN, United States of America
| | - Nicholas J Croucher
- Department of Infectious Disease Epidemiology, Imperial College London, London, United Kingdom
| | - Santeri Puranen
- Department of Computer Science, Aalto University, Espoo, Finland
| | | | - Maiju Pesonen
- Department of Computer Science, Aalto University, Espoo, Finland
| | - Ying Ying Xu
- Department of Computer Science, Aalto University, Espoo, Finland
| | - Paul Turner
- Shoklo Malaria Research Unit, Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Mae Sot, Thailand.,Centre for Tropical Medicine, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
| | - Simon R Harris
- Pathogen Genomics, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Stephen B Beres
- Center for Molecular and Translational Human Infectious Diseases Research, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute, and Houston Methodist Hospital, Houston, Texas, United States of America
| | - James M Musser
- Center for Molecular and Translational Human Infectious Diseases Research, Department of Pathology and Genomic Medicine, Houston Methodist Research Institute, and Houston Methodist Hospital, Houston, Texas, United States of America.,Departments of Pathology and Laboratory Medicine and Microbiology and Immunology, Weill Cornell Medical College, New York, New York, United States of America
| | - Julian Parkhill
- Pathogen Genomics, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Stephen D Bentley
- Pathogen Genomics, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Erik Aurell
- Department of Computational Biology, KTH-Royal Institute of Technology, Stockholm, Sweden.,Departments of Applied Physics and Computer Science, Aalto University, Espoo, Finland.,Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, China
| | - Jukka Corander
- Pathogen Genomics, Wellcome Trust Sanger Institute, Cambridge, United Kingdom.,Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.,Department of Biostatistics, University of Oslo, Oslo, Norway.,Department of Veterinary Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
31
|
Levy RM, Haldane A, Flynn WF. Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr Opin Struct Biol 2016; 43:55-62. [PMID: 27870991 DOI: 10.1016/j.sbi.2016.11.004] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2016] [Accepted: 11/03/2016] [Indexed: 11/17/2022]
Abstract
Potts Hamiltonian models of protein sequence co-variation are statistical models constructed from the pair correlations observed in a multiple sequence alignment (MSA) of a protein family. These models are powerful because they capture higher order correlations induced by mutations evolving under constraints and help quantify the connections between protein sequence, structure, and function maintained through evolution. We review recent work with Potts models to predict protein structure and sequence-dependent conformational free energy landscapes, to survey protein fitness landscapes and to explore the effects of epistasis on fitness. We also comment on the numerical methods used to infer these models for each application.
Collapse
Affiliation(s)
- Ronald M Levy
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, PA 19122, United States.
| | - Allan Haldane
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, PA 19122, United States
| | - William F Flynn
- Center for Biophysics and Computational Biology, Department of Chemistry, and Institute for Computational Molecular Science, Temple University, Philadelphia, PA 19122, United States; Department of Physics and Astronomy, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, United States
| |
Collapse
|
32
|
Jacquin H, Rançon A. Resummed mean-field inference for strongly coupled data. Phys Rev E 2016; 94:042118. [PMID: 27841631 DOI: 10.1103/physreve.94.042118] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Indexed: 11/07/2022]
Abstract
We present a resummed mean-field approximation for inferring the parameters of an Ising or a Potts model from empirical, noisy, one- and two-point correlation functions. Based on a resummation of a class of diagrams of the small correlation expansion of the log-likelihood, the method outperforms standard mean-field inference methods, even when they are regularized. The inference is stable with respect to sampling noise, contrarily to previous works based either on the small correlation expansion, on the Bethe free energy, or on the mean-field and Gaussian models. Because it is mostly analytic, its complexity is still very low, requiring an iterative algorithm to solve for N auxiliary variables, that resorts only to matrix inversions and multiplications. We test our algorithm on the Sherrington-Kirkpatrick model submitted to a random external field and large random couplings, and demonstrate that even without regularization, the inference is stable across the whole phase diagram. In addition, the calculation leads to a consistent estimation of the entropy of the data and allows us to sample form the inferred distribution to obtain artificial data that are consistent with the empirical distribution.
Collapse
Affiliation(s)
- Hugo Jacquin
- Laboratoire de Physique Statistique, École Normale Supérieure, UMR CNRS 8550, 24 rue Lhomond, 75005 Paris, France
| | - A Rançon
- Université de Lyon, ENS de Lyon, Université Claude Bernard, CNRS, Laboratoire de Physique, F-69342 Lyon, France
| |
Collapse
|
33
|
Coucke A, Uguzzoni G, Oteri F, Cocco S, Monasson R, Weigt M. Direct coevolutionary couplings reflect biophysical residue interactions in proteins. J Chem Phys 2016; 145:174102. [DOI: 10.1063/1.4966156] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Affiliation(s)
- Alice Coucke
- Laboratoire de Physique Théorique, Ecole Normale Supérieure and CNRS-UMR8549, PSL Research University, Sorbonne Universités UPMC, 24 Rue Lhomond, 75005 Paris, France
- Sorbonne Universités, UPMC, Institut de Biologie Paris-Seine, CNRS, Laboratoire de Biologie Computationnelle et Quantitative UMR 7238, 75005 Paris, France
| | - Guido Uguzzoni
- Sorbonne Universités, UPMC, Institut de Biologie Paris-Seine, CNRS, Laboratoire de Biologie Computationnelle et Quantitative UMR 7238, 75005 Paris, France
| | - Francesco Oteri
- Sorbonne Universités, UPMC, Institut de Biologie Paris-Seine, CNRS, Laboratoire de Biologie Computationnelle et Quantitative UMR 7238, 75005 Paris, France
| | - Simona Cocco
- Laboratoire de Physique Statistique, Ecole Normale Supérieure and CNRS-UMR8550, PSL Research University, Sorbonne Universités UPMC, 24 Rue Lhomond, 75005 Paris, France
| | - Remi Monasson
- Laboratoire de Physique Théorique, Ecole Normale Supérieure and CNRS-UMR8549, PSL Research University, Sorbonne Universités UPMC, 24 Rue Lhomond, 75005 Paris, France
| | - Martin Weigt
- Sorbonne Universités, UPMC, Institut de Biologie Paris-Seine, CNRS, Laboratoire de Biologie Computationnelle et Quantitative UMR 7238, 75005 Paris, France
| |
Collapse
|
34
|
Abstract
Specific protein-protein interactions are crucial in the cell, both to ensure the formation and stability of multiprotein complexes and to enable signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interaction partners, causing their sequences to be correlated. Here we exploit these correlations to accurately identify, from sequence data alone, which proteins are specific interaction partners. Our general approach, which employs a pairwise maximum entropy model to infer couplings between residues, has been successfully used to predict the 3D structures of proteins from sequences. Thus inspired, we introduce an iterative algorithm to predict specific interaction partners from two protein families whose members are known to interact. We first assess the algorithm's performance on histidine kinases and response regulators from bacterial two-component signaling systems. We obtain a striking 0.93 true positive fraction on our complete dataset without any a priori knowledge of interaction partners, and we uncover the origin of this success. We then apply the algorithm to proteins from ATP-binding cassette (ABC) transporter complexes, and obtain accurate predictions in these systems as well. Finally, we present two metrics that accurately distinguish interacting protein families from noninteracting ones, using only sequence data.
Collapse
|
35
|
Barton JP, De Leonardis E, Coucke A, Cocco S. ACE: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 2016; 32:3089-3097. [PMID: 27329863 DOI: 10.1093/bioinformatics/btw328] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2016] [Accepted: 05/18/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Graphical models are often employed to interpret patterns of correlations observed in data through a network of interactions between the variables. Recently, Ising/Potts models, also known as Markov random fields, have been productively applied to diverse problems in biology, including the prediction of structural contacts from protein sequence data and the description of neural activity patterns. However, inference of such models is a challenging computational problem that cannot be solved exactly. Here, we describe the adaptive cluster expansion (ACE) method to quickly and accurately infer Ising or Potts models based on correlation data. ACE avoids overfitting by constructing a sparse network of interactions sufficient to reproduce the observed correlation data within the statistical error expected due to finite sampling. When convergence of the ACE algorithm is slow, we combine it with a Boltzmann Machine Learning algorithm (BML). We illustrate this method on a variety of biological and artificial datasets and compare it to state-of-the-art approximate methods such as Gaussian and pseudo-likelihood inference. RESULTS We show that ACE accurately reproduces the true parameters of the underlying model when they are known, and yields accurate statistical descriptions of both biological and artificial data. Models inferred by ACE more accurately describe the statistics of the data, including both the constrained low-order correlations and unconstrained higher-order correlations, compared to those obtained by faster Gaussian and pseudo-likelihood methods. These alternative approaches can recover the structure of the interaction network but typically not the correct strength of interactions, resulting in less accurate generative models. AVAILABILITY AND IMPLEMENTATION The ACE source code, user manual and tutorials with the example data and filtered correlations described herein are freely available on GitHub at https://github.com/johnbarton/ACE CONTACTS: jpbarton@mit.edu, cocco@lps.ens.frSupplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- J P Barton
- Departments of Chemical Engineering and Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology and Harvard, Cambridge, MA 02139, USA
| | - E De Leonardis
- Laboratoire de Physique Statistique de L'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure & Université P.&M. Curie, Paris, France Computational and Quantitative Biology, UPMC, UMR 7238, Sorbonne Université, Paris, France
| | - A Coucke
- Computational and Quantitative Biology, UPMC, UMR 7238, Sorbonne Université, Paris, France Laboratoire de Physique Théorique de L'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure & Université P.&M. Curie, Paris, France
| | - S Cocco
- Laboratoire de Physique Statistique de L'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure & Université P.&M. Curie, Paris, France
| |
Collapse
|