1
|
Schmelkin L, Carnevale V, Haldane A, Townsend JP, Chung S, Levy RM, Kumar S. Entrenchment and contingency in neutral protein evolution with epistasis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632266. [PMID: 39868204 PMCID: PMC11761135 DOI: 10.1101/2025.01.09.632266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Protein sequence evolution in the presence of epistasis makes many previously acceptable amino acid residues at a site unfavorable over time. This phenomenon of entrenchment has also been observed with neutral substitutions using Potts Hamiltonian models. Here, we show that simulations using these models often evolve non-neutral proteins. We introduce a Neutral-with-Epistasis (N×E) model that incorporates purifying selection to conserve fitness, a requirement of neutral evolution. N×E protein evolution revealed a surprising lack of entrenchment, with site-specific amino-acid preferences remaining remarkably conserved, in biologically realistic time frames despite extensive residue coupling. Moreover, we found that the overdispersion of the molecular clock is caused by rate variation across sites introduced by epistasis in individual lineages, rather than by historical contingency. Therefore, substitutional entrenchment and rate contingency may indicate that adaptive and other non-neutral evolutionary processes were at play during protein evolution.
Collapse
Affiliation(s)
- Lisa Schmelkin
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| | - Vincenzo Carnevale
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
| | - Allan Haldane
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
- Department of Chemistry, Temple University; Philadelphia, Pennsylvania 19122, USA
- Center for Biophysics and Computational Biology, Temple University; Philadelphia, Pennsylvania 19122, USA
| | | | - Sarah Chung
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| | - Ronald M. Levy
- Institute of Computational Molecular Science, Temple University; Philadelphia, PA 19122, USA
- Department of Chemistry, Temple University; Philadelphia, Pennsylvania 19122, USA
- Center for Biophysics and Computational Biology, Temple University; Philadelphia, Pennsylvania 19122, USA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University; Philadelphia, PA 19122, USA
- Department of Biology, Temple University; Philadelphia, PA 19122, USA
| |
Collapse
|
2
|
Mazzocato Y, Frasson N, Sample M, Fregonese C, Pavan A, Caregnato A, Simeoni M, Scarso A, Cendron L, Šulc P, Angelini A. Combination of Coevolutionary Information and Supervised Learning Enables Generation of Cyclic Peptide Inhibitors with Enhanced Potency from a Small Data Set. ACS CENTRAL SCIENCE 2024; 10:2242-2252. [PMID: 39735311 PMCID: PMC11672547 DOI: 10.1021/acscentsci.4c01428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 10/26/2024] [Accepted: 11/07/2024] [Indexed: 12/31/2024]
Abstract
Computational generation of cyclic peptide inhibitors using machine learning models requires large size training data sets often difficult to generate experimentally. Here we demonstrated that sequential combination of Random Forest Regression with the pseudolikelihood maximization Direct Coupling Analysis method and Monte Carlo simulation can effectively enhance the design pipeline of cyclic peptide inhibitors of a tumor-associated protease even for small experimental data sets. Further in vitro studies showed that such in silico-evolved cyclic peptides are more potent than the best peptide inhibitors previously developed to this target. Crystal structure of the cyclic peptides in complex with the protease resembled those of protein complexes, with large interaction surfaces, constrained peptide backbones, and multiple inter- and intramolecular interactions, leading to good binding affinity and selectivity.
Collapse
Affiliation(s)
- Ylenia Mazzocato
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Nicola Frasson
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Matthew Sample
- School
of Molecular Sciences and Centre for Molecular Design and Biomimetics,
The Biodesign Institute, Arizona State University, 1001 South McAllister Avenue, Tempe, Arizona 85281, United States
- School
for Engineering of Matter, Transport, and Energy, Arizona State University, Tempe, Arizona 85287, United States
| | - Cristian Fregonese
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Angela Pavan
- Department
of Biology, University of Padua, Viale G. Colombo 3, 35131 Padua, Italy
| | - Alberto Caregnato
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Marta Simeoni
- Department
of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
- European
Centre for Living Technology (ECLT), Ca’ Bottacin, Dorsoduro 3911,
Calle Crosera, 30123 Venice, Italy
| | - Alessandro Scarso
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Laura Cendron
- Department
of Biology, University of Padua, Viale G. Colombo 3, 35131 Padua, Italy
| | - Petr Šulc
- School
of Molecular Sciences and Centre for Molecular Design and Biomimetics,
The Biodesign Institute, Arizona State University, 1001 South McAllister Avenue, Tempe, Arizona 85281, United States
- Department
of Bioscience − School of Natural Sciences, Technical University of Munich (TUM), Boltzmannstraße 10, 85748 Garching, Germany
| | - Alessandro Angelini
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
- European
Centre for Living Technology (ECLT), Ca’ Bottacin, Dorsoduro 3911,
Calle Crosera, 30123 Venice, Italy
| |
Collapse
|
3
|
Di Bari L, Bisardi M, Cotogno S, Weigt M, Zamponi F. Emergent time scales of epistasis in protein evolution. Proc Natl Acad Sci U S A 2024; 121:e2406807121. [PMID: 39325427 PMCID: PMC11459137 DOI: 10.1073/pnas.2406807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 08/17/2024] [Indexed: 09/27/2024] Open
Abstract
We introduce a data-driven epistatic model of protein evolution, capable of generating evolutionary trajectories spanning very different time scales reaching from individual mutations to diverged homologs. Our in silico evolution encompasses random nucleotide mutations, insertions and deletions, and models selection using a fitness landscape, which is inferred via a generative probabilistic model for protein families. We show that the proposed framework accurately reproduces the sequence statistics of both short-time (experimental) and long-time (natural) protein evolution, suggesting applicability also to relatively data-poor intermediate evolutionary time scales, which are currently inaccessible to evolution experiments. Our model uncovers a highly collective nature of epistasis, gradually changing the fitness effect of mutations in a diverging sequence context, rather than acting via strong interactions between individual mutations. This collective nature triggers the emergence of a long evolutionary time scale, separating fast mutational processes inside a given sequence context, from the slow evolution of the context itself. The model quantitatively reproduces epistatic phenomena such as contingency and entrenchment, as well as the loss of predictability in protein evolution observed in deep mutational scanning experiments of distant homologs. It thereby deepens our understanding of the interplay between mutation and selection in shaping protein diversity and functions, allows one to statistically forecast evolution, and challenges the prevailing independent-site models of protein evolution, which are unable to capture the fundamental importance of epistasis.
Collapse
Affiliation(s)
- Leonardo Di Bari
- Dipartimento Scienza Applicata e Tecnologia, Politecnico di Torino, I-10129Torino, Italy
| | - Matteo Bisardi
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Sabrina Cotogno
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, 00185Rome, Italy
| |
Collapse
|
4
|
Chen JZ, Bisardi M, Lee D, Cotogno S, Zamponi F, Weigt M, Tokuriki N. Understanding epistatic networks in the B1 β-lactamases through coevolutionary statistical modeling and deep mutational scanning. Nat Commun 2024; 15:8441. [PMID: 39349467 PMCID: PMC11442494 DOI: 10.1038/s41467-024-52614-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 09/16/2024] [Indexed: 10/02/2024] Open
Abstract
Throughout evolution, protein families undergo substantial sequence divergence while preserving structure and function. Although most mutations are deleterious, evolution can explore sequence space via epistatic networks of intramolecular interactions that alleviate the harmful mutations. However, comprehensive analysis of such epistatic networks across protein families remains limited. Thus, we conduct a family wide analysis of the B1 metallo-β-lactamases, combining experiments (deep mutational scanning, DMS) on two distant homologs (NDM-1 and VIM-2) and computational analyses (in silico DMS based on Direct Coupling Analysis, DCA) of 100 homologs. The methods jointly reveal and quantify prevalent epistasis, as ~1/3rd of equivalent mutations are epistatic in DMS. From DCA, half of the positions have a >6.5 fold difference in effective number of tolerated mutations across the entire family. Notably, both methods locate residues with the strongest epistasis in regions of intermediate residue burial, suggesting a balance of residue packing and mutational freedom in forming epistatic networks. We identify entrenched WT residues between NDM-1 and VIM-2 in DMS, which display statistically distinct behaviors in DCA from non-entrenched residues. Entrenched residues are not easily compensated by changes in single nearby interactions, reinforcing existing findings where a complex epistatic network compounds smaller effects from many interacting residues.
Collapse
Affiliation(s)
- J Z Chen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - M Bisardi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - D Lee
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - S Cotogno
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - F Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Dipartimento di Fisica, Sapienza Università di Roma, I-00185, Rome, Italy
| | - M Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - N Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
5
|
Cueno ME, Kamio N, Imai K. Avian influenza A H5N1 hemagglutinin protein models have distinct structural patterns re-occurring across the 1959-2023 strains. Biosystems 2024; 246:105347. [PMID: 39349133 DOI: 10.1016/j.biosystems.2024.105347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 09/26/2024] [Accepted: 09/27/2024] [Indexed: 10/02/2024]
Abstract
Influenza A H5N1 hemagglutinin (HA) plays a crucial role in viral pathogenesis and changes in the HA receptor binding domain (RBD) have been attributed to alterations in viral pathogenesis. Mutations often occur within the HA which in-turn results in HA structural changes that consequently contribute to protein evolution. However, the possible occurrence of mutations that results to reversion of the HA protein (going back to an ancestral protein conformation) which in-turn creates distinct HA structural patterns across the 1959-2023 H5N1 viral evolution has never been investigated. Here, we generated and verified the quality of the HA models, identified similar HA structural patterns, and elucidated the possible variations in HA RBD structural dynamics. Our results show that there are 7 distinct structural patterns occurring among the 1959-2023 H5N1 HA models which suggests that reversion of the HA protein putatively occurs during viral evolution. Similarly, we found that the HA RBD structural dynamics vary among the 7 distinct structural patterns possibly affecting viral pathogenesis.
Collapse
Affiliation(s)
- Marni E Cueno
- Department of Microbiology and Immunology, Nihon University School of Dentistry, Tokyo, 101-8310, Japan.
| | - Noriaki Kamio
- Department of Microbiology and Immunology, Nihon University School of Dentistry, Tokyo, 101-8310, Japan
| | - Kenichi Imai
- Department of Microbiology and Immunology, Nihon University School of Dentistry, Tokyo, 101-8310, Japan
| |
Collapse
|
6
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 PMCID: PMC11228481 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of BioSciences, Rice University, Houston, TX77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
7
|
Calvanese F, Lambert CN, Nghe P, Zamponi F, Weigt M. Towards parsimonious generative modeling of RNA families. Nucleic Acids Res 2024; 52:5465-5477. [PMID: 38661206 PMCID: PMC11162787 DOI: 10.1093/nar/gkae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 03/05/2024] [Accepted: 04/05/2024] [Indexed: 04/26/2024] Open
Abstract
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
Collapse
Affiliation(s)
- Francesco Calvanese
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Camille N Lambert
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Philippe Nghe
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
| |
Collapse
|
8
|
Zheng W. Predicting hotspots for disease-causing single nucleotide variants using sequences-based coevolution, network analysis, and machine learning. PLoS One 2024; 19:e0302504. [PMID: 38743747 PMCID: PMC11093321 DOI: 10.1371/journal.pone.0302504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 04/05/2024] [Indexed: 05/16/2024] Open
Abstract
To enable personalized medicine, it is important yet highly challenging to accurately predict disease-causing mutations in target proteins at high throughput. Previous computational methods have been developed using evolutionary information in combination with various biochemical and structural features of protein residues to discriminate neutral vs. deleterious mutations. However, the power of these methods is often limited because they either assume known protein structures or treat residues independently without fully considering their interactions. To address the above limitations, we build upon recent progress in machine learning, network analysis, and protein language models, and develop a sequences-based variant site prediction workflow based on the protein residue contact networks: 1. We employ and integrate various methods of building protein residue networks using state-of-the-art coevolution analysis tools (RaptorX, DeepMetaPSICOV, and SPOT-Contact) powered by deep learning. 2. We use machine learning algorithms (Random Forest, Gradient Boosting, and Extreme Gradient Boosting) to optimally combine 20 network centrality scores to jointly predict key residues as hot spots for disease mutations. 3. Using a dataset of 107 proteins rich in disease mutations, we rigorously evaluate the network scores individually and collectively (via machine learning). This work supports a promising strategy of combining an ensemble of network scores based on different coevolution analysis methods (and optionally predictive scores from other methods) via machine learning to predict hotspot sites of disease mutations, which will inform downstream applications of disease diagnosis and targeted drug design.
Collapse
Affiliation(s)
- Wenjun Zheng
- Department of Physics, State University of New York at Buffalo, Buffalo, NY, United States of America
| |
Collapse
|
9
|
Ose NJ, Campitelli P, Modi T, Kazan IC, Kumar S, Ozkan SB. Some mechanistic underpinnings of molecular adaptations of SARS-COV-2 spike protein by integrating candidate adaptive polymorphisms with protein dynamics. eLife 2024; 12:RP92063. [PMID: 38713502 PMCID: PMC11076047 DOI: 10.7554/elife.92063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024] Open
Abstract
We integrate evolutionary predictions based on the neutral theory of molecular evolution with protein dynamics to generate mechanistic insight into the molecular adaptations of the SARS-COV-2 spike (S) protein. With this approach, we first identified candidate adaptive polymorphisms (CAPs) of the SARS-CoV-2 S protein and assessed the impact of these CAPs through dynamics analysis. Not only have we found that CAPs frequently overlap with well-known functional sites, but also, using several different dynamics-based metrics, we reveal the critical allosteric interplay between SARS-CoV-2 CAPs and the S protein binding sites with the human ACE2 (hACE2) protein. CAPs interact far differently with the hACE2 binding site residues in the open conformation of the S protein compared to the closed form. In particular, the CAP sites control the dynamics of binding residues in the open state, suggesting an allosteric control of hACE2 binding. We also explored the characteristic mutations of different SARS-CoV-2 strains to find dynamic hallmarks and potential effects of future mutations. Our analyses reveal that Delta strain-specific variants have non-additive (i.e., epistatic) interactions with CAP sites, whereas the less pathogenic Omicron strains have mostly additive mutations. Finally, our dynamics-based analysis suggests that the novel mutations observed in the Omicron strain epistatically interact with the CAP sites to help escape antibody binding.
Collapse
Affiliation(s)
- Nicholas James Ose
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Tushar Modi
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - I Can Kazan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple UniversityPhiladelphiaUnited States
- Department of Biology, Temple UniversityPhiladelphiaUnited States
- Center for Genomic Medicine Research, King Abdulaziz UniversityJeddahSaudi Arabia
| | - Sefika Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| |
Collapse
|
10
|
Biswas A, Choudhuri I, Arnold E, Lyumkis D, Haldane A, Levy RM. Kinetic coevolutionary models predict the temporal emergence of HIV-1 resistance mutations under drug selection pressure. Proc Natl Acad Sci U S A 2024; 121:e2316662121. [PMID: 38557187 PMCID: PMC11009627 DOI: 10.1073/pnas.2316662121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 02/23/2024] [Indexed: 04/04/2024] Open
Abstract
Drug resistance in HIV type 1 (HIV-1) is a pervasive problem that affects the lives of millions of people worldwide. Although records of drug-resistant mutations (DRMs) have been extensively tabulated within public repositories, our understanding of the evolutionary kinetics of DRMs and how they evolve together remains limited. Epistasis, the interaction between a DRM and other residues in HIV-1 protein sequences, is key to the temporal evolution of drug resistance. We use a Potts sequence-covariation statistical-energy model of HIV-1 protein fitness under drug selection pressure, which captures epistatic interactions between all positions, combined with kinetic Monte-Carlo simulations of sequence evolutionary trajectories, to explore the acquisition of DRMs as they arise in an ensemble of drug-naive patient protein sequences. We follow the time course of 52 DRMs in the enzymes protease, RT, and integrase, the primary targets of antiretroviral therapy. The rates at which DRMs emerge are highly correlated with their observed acquisition rates reported in the literature when drug pressure is applied. This result highlights the central role of epistasis in determining the kinetics governing DRM emergence. Whereas rapidly acquired DRMs begin to accumulate as soon as drug pressure is applied, slowly acquired DRMs are contingent on accessory mutations that appear only after prolonged drug pressure. We provide a foundation for using computational methods to determine the temporal evolution of drug resistance using Potts statistical potentials, which can be used to gain mechanistic insights into drug resistance pathways in HIV-1 and other infectious agents.
Collapse
Affiliation(s)
- Avik Biswas
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA92037
- Department of Physics, University of California San Diego, La Jolla, CA92093
| | - Indrani Choudhuri
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Chemistry, Temple University, Philadelphia, PA19122
| | - Eddy Arnold
- Department of Chemistry and Chemical Biology, Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ08854
| | - Dmitry Lyumkis
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA92037
- Graduate School of Biological Sciences, Department of Molecular Biology, University of California San Diego, La Jolla, CA92093
| | - Allan Haldane
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Physics, Temple University, Philadelphia, PA19122
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Chemistry, Temple University, Philadelphia, PA19122
| |
Collapse
|
11
|
Nartey C, Koo HJ, Laurendon C, Shaik HZ, O’maille P, Noel JP, Morcos F. Coevolutionary Information Captures Catalytic Functions and Reveals Divergent Roles of Terpene Synthase Interdomain Connections. Biochemistry 2024; 63:355-366. [PMID: 38206111 PMCID: PMC10851433 DOI: 10.1021/acs.biochem.3c00578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 12/22/2023] [Accepted: 12/27/2023] [Indexed: 01/12/2024]
Abstract
Inferring the historical and biophysical causes of diversity within protein families is a complex puzzle. A key to unraveling this problem is characterizing the rugged topography of sequence-function adaptive landscapes. Using biochemical data from a 29 = 512 combinatorial library of tobacco 5-epi-aristolochene synthase (TEAS) mutants engineered to make the native major product of Egyptian henbane premnaspirodiene synthase (HPS) and a complementary 512 mutant HPS library, we address the question of how product specificity is controlled. These data sets reveal that HPS is far more robust and resistant to mutations than TEAS, where most mutants are promiscuous. We also combine experimental data with a sequence Potts Hamiltonian model and direct coupling analysis to quantify mutant fitness. Our results demonstrate that the Hamiltonian captures variation in product outputs across both libraries, clusters native family members based on their substrate specificities, and exposes the divergent catalytic roles of couplings between the catalytic and noncatalytic domains of TEAS versus HPS. Specifically, we found that the role of the interdomain connectivities in specifying product output is more important in TEAS than connectivities within the catalytic domain. Despite being 75% identical, this property is not shared by HPS, where connectivities within the catalytic domain are more important for specificity. By solving the X-ray crystal structure of HPS, we assessed structural bases for their interdomain network differences. Last, we calculate the product profile Shannon entropies of the two libraries, which showcases that site-site connectivities also play divergent roles in catalytic accuracy.
Collapse
Affiliation(s)
- Charisse
M. Nartey
- Department
of Biological Sciences, The University of
Texas at Dallas, Richardson, Texas 75080, United States
| | - Hyun Jo Koo
- Howard
Hughes Medical Institute, The Salk Institute for Biological Studies, Jack H. Skirball Center for Chemical Biology and Proteomics, 10010 North Torrey Pines Road, La Jolla, California 92037, United States
| | - Caroline Laurendon
- John
Innes Centre, Department of Metabolic Biology, Norwich Research Park, Norwich NR4 7UH, U.K.
| | - Hana Z. Shaik
- Department
of Bioengineering, The University of Texas
at Dallas, Richardson, Texas 75080, United States
| | - Paul O’maille
- John
Innes Centre, Institute of Food Research, Food & Health Programme, Norwich Research Park, Norwich NR4 7UA, U.K.
| | - Joseph P. Noel
- Howard
Hughes Medical Institute, The Salk Institute for Biological Studies, Jack H. Skirball Center for Chemical Biology and Proteomics, 10010 North Torrey Pines Road, La Jolla, California 92037, United States
| | - Faruck Morcos
- Department
of Biological Sciences, The University of
Texas at Dallas, Richardson, Texas 75080, United States
- Department
of Bioengineering, The University of Texas
at Dallas, Richardson, Texas 75080, United States
- Center for
Systems Biology, The University of Texas
at Dallas, Richardson, Texas 75080, United States
| |
Collapse
|
12
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
13
|
Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G. Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 2024; 20:e1011812. [PMID: 38377054 PMCID: PMC10878520 DOI: 10.1371/journal.pcbi.1011812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/08/2024] [Indexed: 02/22/2024] Open
Abstract
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
Collapse
Affiliation(s)
- Luca Sesta
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
| | - Andrea Pagnani
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
- Italian Institute for Genomic Medicine, Torino, Italy
- INFN, Sezione di Torino, Torino, Italy
| | | | | |
Collapse
|
14
|
Ose NJ, Campitelli P, Modi T, Can Kazan I, Kumar S, Banu Ozkan S. Some mechanistic underpinnings of molecular adaptations of SARS-COV-2 spike protein by integrating candidate adaptive polymorphisms with protein dynamics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.14.557827. [PMID: 37745560 PMCID: PMC10515954 DOI: 10.1101/2023.09.14.557827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
We integrate evolutionary predictions based on the neutral theory of molecular evolution with protein dynamics to generate mechanistic insight into the molecular adaptations of the SARS-COV-2 Spike (S) protein. With this approach, we first identified Candidate Adaptive Polymorphisms (CAPs) of the SARS-CoV-2 Spike protein and assessed the impact of these CAPs through dynamics analysis. Not only have we found that CAPs frequently overlap with well-known functional sites, but also, using several different dynamics-based metrics, we reveal the critical allosteric interplay between SARS-CoV-2 CAPs and the S protein binding sites with the human ACE2 (hACE2) protein. CAPs interact far differently with the hACE2 binding site residues in the open conformation of the S protein compared to the closed form. In particular, the CAP sites control the dynamics of binding residues in the open state, suggesting an allosteric control of hACE2 binding. We also explored the characteristic mutations of different SARS-CoV-2 strains to find dynamic hallmarks and potential effects of future mutations. Our analyses reveal that Delta strain-specific variants have non-additive (i.e., epistatic) interactions with CAP sites, whereas the less pathogenic Omicron strains have mostly additive mutations. Finally, our dynamics-based analysis suggests that the novel mutations observed in the Omicron strain epistatically interact with the CAP sites to help escape antibody binding.
Collapse
Affiliation(s)
- Nicholas J. Ose
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona, United States of America
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona, United States of America
| | - Tushar Modi
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona, United States of America
| | - I. Can Kazan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona, United States of America
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania, United States of America
- Department of Biology, Temple University, Philadelphia, Pennsylvania, United States of America
- Center for Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia
| | - S. Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona, United States of America
| |
Collapse
|
15
|
Ose NJ, Campitelli P, Patel R, Kumar S, Ozkan SB. Protein dynamics provide mechanistic insights about epistasis among common missense polymorphisms. Biophys J 2023; 122:2938-2947. [PMID: 36726312 PMCID: PMC10398253 DOI: 10.1016/j.bpj.2023.01.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 12/20/2022] [Accepted: 01/26/2023] [Indexed: 02/03/2023] Open
Abstract
Sequencing of the protein coding genome has revealed many different missense mutations of human proteins and different population frequencies of corresponding haplotypes, which consist of different sets of those mutations. Here, we present evidence for pairwise intramolecular epistasis (i.e., nonadditive interactions) between many such mutations through an analysis of protein dynamics. We suggest that functional compensation for conserving protein dynamics is a likely evolutionary mechanism that maintains high-frequency mutations that are individually nonneutral but epistatically compensating within proteins. This analysis is the first of its type to look at human proteins with specific high population frequency mutations and examine the relationship between mutations that make up that observed high-frequency protein haplotype. Importantly, protein dynamics revealed a separation between high and low frequency haplotypes within a target protein cytochrome P450 2A7, with the high-frequency haplotypes showing behavior closer to the wild-type protein. Common protein haplotypes containing two mutations display dynamic compensation in which one mutation can correct for the dynamic effects of the other. We also utilize a dynamics-based metric, EpiScore, that evaluates the epistatic interactions and allows us to see dynamic compensation within many other proteins.
Collapse
Affiliation(s)
- Nicholas J Ose
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona
| | - Ravi Patel
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania; Department of Biology, Temple University, Philadelphia, Pennsylvania
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania; Department of Biology, Temple University, Philadelphia, Pennsylvania; Center for Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia.
| | - S Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona.
| |
Collapse
|
16
|
Alvarez S, Nartey CM, Mercado N, de la Paz A, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.24.542176. [PMID: 37292895 PMCID: PMC10245989 DOI: 10.1101/2023.05.24.542176] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called Sequence Evolution with Epistatic Contributions. Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo β -lactamase activity in E. coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their WT predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Alberto de la Paz
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Tea Huseinbegovic
- School of Natural Sciences and Mathematics, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX 75080, USA
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX 75080, USA
| |
Collapse
|
17
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
18
|
Mauri E, Cocco S, Monasson R. Mutational Paths with Sequence-Based Models of Proteins: From Sampling to Mean-Field Characterization. PHYSICAL REVIEW LETTERS 2023; 130:158402. [PMID: 37115874 DOI: 10.1103/physrevlett.130.158402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 03/16/2023] [Indexed: 06/19/2023]
Abstract
Identifying and characterizing mutational paths is an important issue in evolutionary biology, with potential applications to bioengineering. We here propose an algorithm to sample mutational paths, which we benchmark on exactly solvable models of proteins in silico, and apply to data-driven models of natural proteins learned from sequence data with restricted Boltzmann machines. We then use mean-field theory to characterize paths for different mutational dynamics of interest, and to extend Kimura's estimate of evolutionary distances to sequence-based epistatic models of selection.
Collapse
Affiliation(s)
- Eugenio Mauri
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR 8023 and PSL Research, Sorbonne Université, 24 rue Lhomond, 75231 Paris cedex 05, France
| |
Collapse
|
19
|
Malbranke C, Bikard D, Cocco S, Monasson R, Tubiana J. Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergies. Curr Opin Struct Biol 2023; 80:102571. [PMID: 36947951 DOI: 10.1016/j.sbi.2023.102571] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 01/29/2023] [Accepted: 02/07/2023] [Indexed: 03/24/2023]
Abstract
Computational protein design facilitates the discovery of novel proteins with prescribed structure and functionality. Exciting designs were recently reported using novel data-driven methodologies that can be roughly divided into two categories: evolutionary-based and physics-inspired approaches. The former infer characteristic sequence features shared by sets of evolutionary-related proteins, such as conserved or coevolving positions, and recombine them to generate candidates with similar structure and function. The latter approaches estimate key biochemical properties, such as structure free energy, conformational entropy, or binding affinities using machine learning surrogates, and optimize them to yield improved designs. Here, we review recent progress along both tracks, discuss their strengths and weaknesses, and highlight opportunities for synergistic approaches.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France; Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, 75015 Paris, France.
| | - David Bikard
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, 75015 Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Jérôme Tubiana
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
20
|
Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput Biol 2023; 19:e1010956. [PMID: 36857380 PMCID: PMC10010530 DOI: 10.1371/journal.pcbi.1010956] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/13/2023] [Accepted: 02/16/2023] [Indexed: 03/02/2023] Open
Abstract
Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.
Collapse
|
21
|
Neverov AD, Fedonin G, Popova A, Bykova D, Bazykin G. Coordinated evolution at amino acid sites of SARS-CoV-2 spike. eLife 2023; 12:e82516. [PMID: 36752391 PMCID: PMC9908078 DOI: 10.7554/elife.82516] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 01/15/2023] [Indexed: 02/05/2023] Open
Abstract
SARS-CoV-2 has adapted in a stepwise manner, with multiple beneficial mutations accumulating in a rapid succession at origins of VOCs, and the reasons for this are unclear. Here, we searched for coordinated evolution of amino acid sites in the spike protein of SARS-CoV-2. Specifically, we searched for concordantly evolving site pairs (CSPs) for which changes at one site were rapidly followed by changes at the other site in the same lineage. We detected 46 sites which formed 45 CSP. Sites in CSP were closer to each other in the protein structure than random pairs, indicating that concordant evolution has a functional basis. Notably, site pairs carrying lineage defining mutations of the four VOCs that circulated before May 2021 are enriched in CSPs. For the Alpha VOC, the enrichment is detected even if Alpha sequences are removed from analysis, indicating that VOC origin could have been facilitated by positive epistasis. Additionally, we detected nine discordantly evolving pairs of sites where mutations at one site unexpectedly rarely occurred on the background of a specific allele at another site, for example on the background of wild-type D at site 614 (four pairs) or derived Y at site 501 (three pairs). Our findings hint that positive epistasis between accumulating mutations could have delayed the assembly of advantageous combinations of mutations comprising at least some of the VOCs.
Collapse
Affiliation(s)
- Alexey Dmitrievich Neverov
- HSE UniversityMoscowRussian Federation
- Central Research Institute for EpidemiologyMoscowRussian Federation
| | - Gennady Fedonin
- Central Research Institute for EpidemiologyMoscowRussian Federation
- Moscow Institute of Physics and Technology (National Research University)MoscowRussian Federation
- Institute for Information Transmission Problems (Kharkevich Institute) of the Russian Academy of SciencesMoscowRussian Federation
| | - Anfisa Popova
- Central Research Institute for EpidemiologyMoscowRussian Federation
| | - Daria Bykova
- Central Research Institute for EpidemiologyMoscowRussian Federation
- Lomonosov Moscow State UniversityMoscowRussian Federation
| | - Georgii Bazykin
- Institute for Information Transmission Problems (Kharkevich Institute) of the Russian Academy of SciencesMoscowRussian Federation
- Skolkovo Institute of Science and TechnologyMoscowRussian Federation
| |
Collapse
|
22
|
Evolutionary entrenchment in immune proteins selects against stabilizing mutations. Proc Natl Acad Sci U S A 2022; 119:e2216087119. [PMID: 36449544 PMCID: PMC9894131 DOI: 10.1073/pnas.2216087119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
|
23
|
Wittmund M, Cadet F, Davari MD. Learning Epistasis and Residue Coevolution Patterns: Current Trends and Future Perspectives for Advancing Enzyme Engineering. ACS Catal 2022. [DOI: 10.1021/acscatal.2c01426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Affiliation(s)
- Marcel Wittmund
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Frederic Cadet
- Laboratory of Excellence LABEX GR, DSIMB, Inserm UMR S1134, University of Paris city & University of Reunion, Paris 75014, France
| | - Mehdi D. Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| |
Collapse
|
24
|
Ravishankar K, Jiang X, Leddin EM, Morcos F, Cisneros GA. Computational compensatory mutation discovery approach: Predicting a PARP1 variant rescue mutation. Biophys J 2022; 121:3663-3673. [PMID: 35642254 PMCID: PMC9617126 DOI: 10.1016/j.bpj.2022.05.036] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 05/20/2022] [Accepted: 05/23/2022] [Indexed: 11/02/2022] Open
Abstract
The prediction of protein mutations that affect function may be exploited for multiple uses. In the context of disease variants, the prediction of compensatory mutations that reestablish functional phenotypes could aid in the development of genetic therapies. In this work, we present an integrated approach that combines coevolutionary analysis and molecular dynamics (MD) simulations to discover functional compensatory mutations. This approach is employed to investigate possible rescue mutations of a poly(ADP-ribose) polymerase 1 (PARP1) variant, PARP1 V762A, associated with lung cancer and follicular lymphoma. MD simulations show PARP1 V762A exhibits noticeable changes in structural and dynamical behavior compared with wild-type (WT) PARP1. Our integrated approach predicts A755E as a possible compensatory mutation based on coevolutionary information, and molecular simulations indicate that the PARP1 A755E/V762A double mutant exhibits similar structural and dynamical behavior to WT PARP1. Our methodology can be broadly applied to a large number of systems where single-nucleotide polymorphisms have been identified as connected to disease and can shed light on the biophysical effects of such changes as well as provide a way to discover potential mutants that could restore WT-like functionality. This can, in turn, be further utilized in the design of molecular therapeutics that aim to mimic such compensatory effect.
Collapse
Affiliation(s)
| | - Xianli Jiang
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, Texas; Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Emmett M Leddin
- Department of Chemistry, University of North Texas, Denton, Texas
| | - Faruck Morcos
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, Texas; Department of Bioengineering, The University of Texas at Dallas, Richardson, Texas; Center for Systems Biology, The University of Texas at Dallas, Richardson, Texas.
| | - G Andrés Cisneros
- Department of Chemistry, University of North Texas, Denton, Texas; Department of Physics, The University of Texas at Dallas, Richardson, Texas; Department of Chemistry, The University of Texas at Dallas, Richardson, Texas.
| |
Collapse
|
25
|
Di Gioacchino A, Procyk J, Molari M, Schreck JS, Zhou Y, Liu Y, Monasson R, Cocco S, Šulc P. Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection. PLoS Comput Biol 2022; 18:e1010561. [PMID: 36174101 PMCID: PMC9553063 DOI: 10.1371/journal.pcbi.1010561] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 10/11/2022] [Accepted: 09/12/2022] [Indexed: 12/03/2022] Open
Abstract
Selection protocols such as SELEX, where molecules are selected over multiple rounds for their ability to bind to a target of interest, are popular methods for obtaining binders for diagnostic and therapeutic purposes. We show that Restricted Boltzmann Machines (RBMs), an unsupervised two-layer neural network architecture, can successfully be trained on sequence ensembles from single rounds of SELEX experiments for thrombin aptamers. RBMs assign scores to sequences that can be directly related to their fitnesses estimated through experimental enrichment ratios. Hence, RBMs trained from sequence data at a given round can be used to predict the effects of selection at later rounds. Moreover, the parameters of the trained RBMs are interpretable and identify functional features contributing most to sequence fitness. To exploit the generative capabilities of RBMs, we introduce two different training protocols: one taking into account sequence counts, capable of identifying the few best binders, and another based on unique sequences only, generating more diverse binders. We then use RBMs model to generate novel aptamers with putative disruptive mutations or good binding properties, and validate the generated sequences with gel shift assay experiments. Finally, we compare the RBM’s performance with different supervised learning approaches that include random forests and several deep neural network architectures. We show that two-layer neural networks, Restricted Boltzmann Machines (RBM), can be successfully trained on sequence ensemble datasets from selection-amplification experiments. We train the RBM using datasets from aptamer selection experiments on thrombin protein, and show that the model can successfully generalize to the test set to predict binders and non-binders. The log-likelihood assigned to a sequence by the RBM is correlated with the sequence fitness as quantified by the amplification between different rounds of selection. We further show that that the model is interpretable and by inspecting the weights of the model, we can identify structural motifs that are characteristic of the good binders. We explore the usage of the RBMs to identify which of the possible protein exosites the aptamers bind to. We show that the RBM can also be used for unsupervised clustering. Finally, we use RBMs to generate novel aptamers, and we experimentally verify predicted binding and non-binding sequences generated from the RBM.
Collapse
Affiliation(s)
- Andrea Di Gioacchino
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
| | - Jonah Procyk
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Marco Molari
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- Biozentrum, University of Basel, Basel, Switzerland
- Swiss Institute of Bioinformatics, Basel, Switzerland
| | - John S. Schreck
- National Center for Atmospheric Research, Computational and Information Systems Laboratory, Boulder, Colorado, United States of America
| | - Yu Zhou
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Yan Liu
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
| | - Rémi Monasson
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- * E-mail: (RM); (SC); (PŠ)
| | - Simona Cocco
- Laboratoire de Physique de l’Ecole Normale Supérieure, PSL & CNRS UMR8063, Sorbonne Université, Université de Paris, Paris, France
- * E-mail: (RM); (SC); (PŠ)
| | - Petr Šulc
- School of Molecular Sciences and Center for Molecular Design and Biomimetics, The Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America
- * E-mail: (RM); (SC); (PŠ)
| |
Collapse
|
26
|
Vigué L, Croce G, Petitjean M, Ruppé E, Tenaillon O, Weigt M. Deciphering polymorphism in 61,157 Escherichia coli genomes via epistatic sequence landscapes. Nat Commun 2022; 13:4030. [PMID: 35821377 PMCID: PMC9276797 DOI: 10.1038/s41467-022-31643-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 06/27/2022] [Indexed: 12/05/2022] Open
Abstract
Characterizing the effect of mutations is key to understand the evolution of protein sequences and to separate neutral amino-acid changes from deleterious ones. Epistatic interactions between residues can lead to a context dependence of mutation effects. Context dependence constrains the amino-acid changes that can contribute to polymorphism in the short term, and the ones that can accumulate between species in the long term. We use computational approaches to accurately predict the polymorphisms segregating in a panel of 61,157 Escherichia coli genomes from the analysis of distant homologues. By comparing a context-aware Direct-Coupling Analysis modelling to a non-epistatic approach, we show that the genetic context strongly constrains the tolerable amino acids in 30% to 50% of amino-acid sites. The study of more distant species suggests the gradual build-up of genetic context over long evolutionary timescales by the accumulation of small epistatic contributions.
Collapse
Affiliation(s)
- Lucile Vigué
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
| | - Giancarlo Croce
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics-SIB, Lausanne, Switzerland
| | - Marie Petitjean
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
| | - Etienne Ruppé
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France
- Laboratoire de Bactériologie, Hôpital Bichat, APHP, Paris, France
| | - Olivier Tenaillon
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, IAME, F-75018, Paris, France.
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Computational and Quantitative Biology-LCQB, Paris, France.
| |
Collapse
|
27
|
Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys 2021; 47:435-454. [PMID: 34751854 DOI: 10.1007/s10867-021-09593-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 10/18/2021] [Indexed: 10/19/2022] Open
Abstract
One fundamental problem of protein biochemistry is to predict protein structure from amino acid sequence. The inverse problem, predicting either entire sequences or individual mutations that are consistent with a given protein structure, has received much less attention even though it has important applications in both protein engineering and evolutionary biology. Here, we ask whether 3D convolutional neural networks (3D CNNs) can learn the local fitness landscape of protein structure to reliably predict either the wild-type amino acid or the consensus in a multiple sequence alignment from the local structural context surrounding site of interest. We find that the network can predict wild type with good accuracy, and that network confidence is a reliable measure of whether a given prediction is likely going to be correct or not. Predictions of consensus are less accurate and are primarily driven by whether or not the consensus matches the wild type. Our work suggests that high-confidence mis-predictions of the wild type may identify sites that are primed for mutation and likely targets for protein engineering.
Collapse
|