1
|
Di Bari L, Bisardi M, Cotogno S, Weigt M, Zamponi F. Emergent time scales of epistasis in protein evolution. Proc Natl Acad Sci U S A 2024; 121:e2406807121. [PMID: 39325427 PMCID: PMC11459137 DOI: 10.1073/pnas.2406807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2024] [Accepted: 08/17/2024] [Indexed: 09/27/2024] Open
Abstract
We introduce a data-driven epistatic model of protein evolution, capable of generating evolutionary trajectories spanning very different time scales reaching from individual mutations to diverged homologs. Our in silico evolution encompasses random nucleotide mutations, insertions and deletions, and models selection using a fitness landscape, which is inferred via a generative probabilistic model for protein families. We show that the proposed framework accurately reproduces the sequence statistics of both short-time (experimental) and long-time (natural) protein evolution, suggesting applicability also to relatively data-poor intermediate evolutionary time scales, which are currently inaccessible to evolution experiments. Our model uncovers a highly collective nature of epistasis, gradually changing the fitness effect of mutations in a diverging sequence context, rather than acting via strong interactions between individual mutations. This collective nature triggers the emergence of a long evolutionary time scale, separating fast mutational processes inside a given sequence context, from the slow evolution of the context itself. The model quantitatively reproduces epistatic phenomena such as contingency and entrenchment, as well as the loss of predictability in protein evolution observed in deep mutational scanning experiments of distant homologs. It thereby deepens our understanding of the interplay between mutation and selection in shaping protein diversity and functions, allows one to statistically forecast evolution, and challenges the prevailing independent-site models of protein evolution, which are unable to capture the fundamental importance of epistasis.
Collapse
Affiliation(s)
- Leonardo Di Bari
- Dipartimento Scienza Applicata e Tecnologia, Politecnico di Torino, I-10129Torino, Italy
| | - Matteo Bisardi
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Sabrina Cotogno
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative, ParisF-75005, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, 00185Rome, Italy
| |
Collapse
|
2
|
Chen JZ, Bisardi M, Lee D, Cotogno S, Zamponi F, Weigt M, Tokuriki N. Understanding epistatic networks in the B1 β-lactamases through coevolutionary statistical modeling and deep mutational scanning. Nat Commun 2024; 15:8441. [PMID: 39349467 PMCID: PMC11442494 DOI: 10.1038/s41467-024-52614-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 09/16/2024] [Indexed: 10/02/2024] Open
Abstract
Throughout evolution, protein families undergo substantial sequence divergence while preserving structure and function. Although most mutations are deleterious, evolution can explore sequence space via epistatic networks of intramolecular interactions that alleviate the harmful mutations. However, comprehensive analysis of such epistatic networks across protein families remains limited. Thus, we conduct a family wide analysis of the B1 metallo-β-lactamases, combining experiments (deep mutational scanning, DMS) on two distant homologs (NDM-1 and VIM-2) and computational analyses (in silico DMS based on Direct Coupling Analysis, DCA) of 100 homologs. The methods jointly reveal and quantify prevalent epistasis, as ~1/3rd of equivalent mutations are epistatic in DMS. From DCA, half of the positions have a >6.5 fold difference in effective number of tolerated mutations across the entire family. Notably, both methods locate residues with the strongest epistasis in regions of intermediate residue burial, suggesting a balance of residue packing and mutational freedom in forming epistatic networks. We identify entrenched WT residues between NDM-1 and VIM-2 in DMS, which display statistically distinct behaviors in DCA from non-entrenched residues. Entrenched residues are not easily compensated by changes in single nearby interactions, reinforcing existing findings where a complex epistatic network compounds smaller effects from many interacting residues.
Collapse
Affiliation(s)
- J Z Chen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - M Bisardi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - D Lee
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - S Cotogno
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - F Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005, Paris, France
- Dipartimento di Fisica, Sapienza Università di Roma, I-00185, Rome, Italy
| | - M Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative LCQB, F-75005, Paris, France
| | - N Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
3
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 PMCID: PMC11228481 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of BioSciences, Rice University, Houston, TX77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
4
|
Jaafari H, Bueno C, Schafer NP, Martin J, Morcos F, Wolynes PG. The physical and evolutionary energy landscapes of devolved protein sequences corresponding to pseudogenes. Proc Natl Acad Sci U S A 2024; 121:e2322428121. [PMID: 38739795 PMCID: PMC11127006 DOI: 10.1073/pnas.2322428121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/26/2024] [Indexed: 05/16/2024] Open
Abstract
Protein evolution is guided by structural, functional, and dynamical constraints ensuring organismal viability. Pseudogenes are genomic sequences identified in many eukaryotes that lack translational activity due to sequence degradation and thus over time have undergone "devolution." Previously pseudogenized genes sometimes regain their protein-coding function, suggesting they may still encode robust folding energy landscapes despite multiple mutations. We study both the physical folding landscapes of protein sequences corresponding to human pseudogenes using the Associative Memory, Water Mediated, Structure and Energy Model, and the evolutionary energy landscapes obtained using direct coupling analysis (DCA) on their parent protein families. We found that generally mutations that have occurred in pseudogene sequences have disrupted their native global network of stabilizing residue interactions, making it harder for them to fold if they were translated. In some cases, however, energetic frustration has apparently decreased when the functional constraints were removed. We analyzed this unexpected situation for Cyclophilin A, Profilin-1, and Small Ubiquitin-like Modifier 2 Protein. Our analysis reveals that when such mutations in the pseudogene ultimately stabilize folding, at the same time, they likely alter the pseudogenes' former biological activity, as estimated by DCA. We localize most of these stabilizing mutations generally to normally frustrated regions required for binding to other partners.
Collapse
Affiliation(s)
- Hana Jaafari
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
- Applied Physics Graduate Program, Smalley-Curl Institute, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
| | - Carlos Bueno
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
| | | | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| | - Peter G. Wolynes
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Biochemistry and Cell Biology, Rice University, Houston, TX77005
| |
Collapse
|
5
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
6
|
Vigué L, Tenaillon O. Predicting the effect of mutations to investigate recent events of selection across 60,472 Escherichia coli strains. Proc Natl Acad Sci U S A 2023; 120:e2304177120. [PMID: 37487088 PMCID: PMC10401003 DOI: 10.1073/pnas.2304177120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 05/25/2023] [Indexed: 07/26/2023] Open
Abstract
Microbial genomics studies focusing on the dynamics of selection have often used a small number of distant genomes. As a result, they could only analyze mutations that had become fixed during the divergence between species. However, thousands of genomes of some species are now available in public databases, thanks to high-throughput sequencing. These data provide a more complete picture of the polymorphisms segregating within a species, offering a unique insight into the processes that shape the recent evolution of a species. In this study, we present GLASS (Gene-Level Amino-acid Score Shift), a selection test that is based on the predicted effects of amino acid changes. By comparing the distribution of effects of mutations observed in a gene to the expectation in the absence of selection, GLASS can quantify the intensity of selection. We applied GLASS to a dataset of 60,472 Escherichia coli strains and used this to reexamine the longstanding debate about the role of essentiality versus expression level in the rate of protein evolution. We found that selection has contrasting short-term and long-term dynamics, with essential genes being subject to strong purifying selection in the short term, while expression level determines the rate of gene evolution in the long term. GLASS also found an overrepresentation of inactivating mutations in specific transcription factors, such as efflux pump repressors, which is consistent with selection for antibiotic resistance. These gene-inactivating polymorphisms do not reach fixation, suggesting another contrast between short-term fitness gains and long-term counterselection.
Collapse
Affiliation(s)
- Lucile Vigué
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, Infection, Antimicrobials, Modelling, Evolution, F-75018Paris, France
| | - Olivier Tenaillon
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, Infection, Antimicrobials, Modelling, Evolution, F-75018Paris, France
| |
Collapse
|
7
|
Alvarez S, Nartey CM, Mercado N, de la Paz A, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.24.542176. [PMID: 37292895 PMCID: PMC10245989 DOI: 10.1101/2023.05.24.542176] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called Sequence Evolution with Epistatic Contributions. Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo β -lactamase activity in E. coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their WT predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Alberto de la Paz
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Tea Huseinbegovic
- School of Natural Sciences and Mathematics, University of Texas at Dallas, Richardson, TX 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080, USA
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX 75080, USA
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX 75080, USA
| |
Collapse
|
8
|
Zeng HL, Liu Y, Dichio V, Aurell E. Temporal epistasis inference from more than 3 500 000 SARS-CoV-2 genomic sequences. Phys Rev E 2022; 106:044409. [PMID: 36397507 DOI: 10.1103/physreve.106.044409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 09/19/2022] [Indexed: 06/16/2023]
Abstract
We use direct coupling analysis (DCA) to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021 for a total of over 3 500 000 genomes. We find that DCA terms are more stable over time than correlations but nevertheless change over time as mutations disappear from the global population or reach fixation. Correlations are enriched for phylogenetic effects, and in particularly statistical dependencies at short genomic distances, while DCA brings out links at longer genomic distance. We discuss the validity of a DCA analysis under these conditions in terms of a transient auasilinkage equilibrium state. We identify putative epistatic interaction mutations involving loci in spike.
Collapse
Affiliation(s)
- Hong-Li Zeng
- School of Science, Nanjing University of Posts and Telecommunications, New Energy Technology Engineering Laboratory of Jiangsu Province, Nanjing 210023, China
| | - Yue Liu
- School of Science, Nanjing University of Posts and Telecommunications, New Energy Technology Engineering Laboratory of Jiangsu Province, Nanjing 210023, China
| | - Vito Dichio
- Inria Paris, Aramis Project Team, Paris 75013, France
- Institut du Cerveau, ICM, Inserm U 1127, CNRS UMR 7225, Sorbonne Université, Paris, France
| | - Erik Aurell
- Department of Computational Science and Technology, AlbaNova University Center, SE-106 91 Stockholm, Sweden
| |
Collapse
|