1
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
2
|
Ose NJ, Campitelli P, Modi T, Kazan IC, Kumar S, Ozkan SB. Some mechanistic underpinnings of molecular adaptations of SARS-COV-2 spike protein by integrating candidate adaptive polymorphisms with protein dynamics. eLife 2024; 12:RP92063. [PMID: 38713502 PMCID: PMC11076047 DOI: 10.7554/elife.92063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024] Open
Abstract
We integrate evolutionary predictions based on the neutral theory of molecular evolution with protein dynamics to generate mechanistic insight into the molecular adaptations of the SARS-COV-2 spike (S) protein. With this approach, we first identified candidate adaptive polymorphisms (CAPs) of the SARS-CoV-2 S protein and assessed the impact of these CAPs through dynamics analysis. Not only have we found that CAPs frequently overlap with well-known functional sites, but also, using several different dynamics-based metrics, we reveal the critical allosteric interplay between SARS-CoV-2 CAPs and the S protein binding sites with the human ACE2 (hACE2) protein. CAPs interact far differently with the hACE2 binding site residues in the open conformation of the S protein compared to the closed form. In particular, the CAP sites control the dynamics of binding residues in the open state, suggesting an allosteric control of hACE2 binding. We also explored the characteristic mutations of different SARS-CoV-2 strains to find dynamic hallmarks and potential effects of future mutations. Our analyses reveal that Delta strain-specific variants have non-additive (i.e., epistatic) interactions with CAP sites, whereas the less pathogenic Omicron strains have mostly additive mutations. Finally, our dynamics-based analysis suggests that the novel mutations observed in the Omicron strain epistatically interact with the CAP sites to help escape antibody binding.
Collapse
Affiliation(s)
- Nicholas James Ose
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Tushar Modi
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - I Can Kazan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple UniversityPhiladelphiaUnited States
- Department of Biology, Temple UniversityPhiladelphiaUnited States
- Center for Genomic Medicine Research, King Abdulaziz UniversityJeddahSaudi Arabia
| | - Sefika Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| |
Collapse
|
3
|
Posani L, Rizzato F, Monasson R, Cocco S. Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data. PLoS Comput Biol 2023; 19:e1011521. [PMID: 37883593 PMCID: PMC10645369 DOI: 10.1371/journal.pcbi.1011521] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 11/14/2023] [Accepted: 09/15/2023] [Indexed: 10/28/2023] Open
Abstract
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
Collapse
Affiliation(s)
- Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Francesca Rizzato
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| |
Collapse
|
4
|
Ose NJ, Campitelli P, Patel R, Kumar S, Ozkan SB. Protein dynamics provide mechanistic insights about epistasis among common missense polymorphisms. Biophys J 2023; 122:2938-2947. [PMID: 36726312 PMCID: PMC10398253 DOI: 10.1016/j.bpj.2023.01.037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 12/20/2022] [Accepted: 01/26/2023] [Indexed: 02/03/2023] Open
Abstract
Sequencing of the protein coding genome has revealed many different missense mutations of human proteins and different population frequencies of corresponding haplotypes, which consist of different sets of those mutations. Here, we present evidence for pairwise intramolecular epistasis (i.e., nonadditive interactions) between many such mutations through an analysis of protein dynamics. We suggest that functional compensation for conserving protein dynamics is a likely evolutionary mechanism that maintains high-frequency mutations that are individually nonneutral but epistatically compensating within proteins. This analysis is the first of its type to look at human proteins with specific high population frequency mutations and examine the relationship between mutations that make up that observed high-frequency protein haplotype. Importantly, protein dynamics revealed a separation between high and low frequency haplotypes within a target protein cytochrome P450 2A7, with the high-frequency haplotypes showing behavior closer to the wild-type protein. Common protein haplotypes containing two mutations display dynamic compensation in which one mutation can correct for the dynamic effects of the other. We also utilize a dynamics-based metric, EpiScore, that evaluates the epistatic interactions and allows us to see dynamic compensation within many other proteins.
Collapse
Affiliation(s)
- Nicholas J Ose
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona
| | - Ravi Patel
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania; Department of Biology, Temple University, Philadelphia, Pennsylvania
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania; Department of Biology, Temple University, Philadelphia, Pennsylvania; Center for Genomic Medicine Research, King Abdulaziz University, Jeddah, Saudi Arabia.
| | - S Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State University, Tempe, Arizona.
| |
Collapse
|
5
|
Kleeorin Y, Russ WP, Rivoire O, Ranganathan R. Undersampling and the inference of coevolution in proteins. Cell Syst 2023; 14:210-219.e7. [PMID: 36693377 PMCID: PMC10911952 DOI: 10.1016/j.cels.2022.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 01/02/2022] [Accepted: 12/23/2022] [Indexed: 01/24/2023]
Abstract
Protein structure, function, and evolution depend on local and collective epistatic interactions between amino acids. A powerful approach to defining these interactions is to construct models of couplings between amino acids that reproduce the empirical statistics (frequencies and correlations) observed in sequences comprising a protein family. The top couplings are then interpreted. Here, we show that as currently implemented, this inference unequally represents epistatic interactions, a problem that fundamentally arises from limited sampling of sequences in the context of distinct scales at which epistasis occurs in proteins. We show that these issues explain the ability of current approaches to predict tertiary contacts between amino acids and the inability to obviously expose larger networks of functionally relevant, collectively evolving residues called sectors. This work provides a necessary foundation for more deeply understanding and improving evolution-based models of proteins.
Collapse
Affiliation(s)
- Yaakov Kleeorin
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA
| | - William P Russ
- Green Center for Systems Biology, University of Texas Southwestern Medical Center, Dallas, TX 75390, USA
| | - Olivier Rivoire
- Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, PSL Research University, 75005 Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Department of Biochemistry & Molecular Biology, University of Chicago, Chicago, IL 60637, USA; The Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
6
|
Barrat-Charlaix P, Muntoni AP, Shimagaki K, Weigt M, Zamponi F. Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families. Phys Rev E 2021; 104:024407. [PMID: 34525554 DOI: 10.1103/physreve.104.024407] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/19/2021] [Indexed: 11/07/2022]
Abstract
Boltzmann machines (BMs) are widely used as generative models. For example, pairwise Potts models (PMs), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and generating new functional sequences. However, the resulting PM suffers from important overfitting effects: many couplings are small, noisy, and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than 90% of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.
Collapse
Affiliation(s)
- Pierre Barrat-Charlaix
- Biozentrum, Universität Basel, Switzerland, Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Anna Paola Muntoni
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino 10129, Italy.,Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO), Italy.,Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France.,Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| | - Kai Shimagaki
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative-LCQB, F-75005 Paris, France
| | - Francesco Zamponi
- Laboratoire de Physique de l'Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, F-75005 Paris, France
| |
Collapse
|
7
|
Anton B, Besalú M, Fornes O, Bonet J, Molina A, Molina-Fernandez R, De Las Cuevas G, Fernandez-Fuentes N, Oliva B. On the use of direct-coupling analysis with a reduced alphabet of amino acids combined with super-secondary structure motifs for protein fold prediction. NAR Genom Bioinform 2021; 3:lqab027. [PMID: 33937764 PMCID: PMC8061457 DOI: 10.1093/nargab/lqab027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 02/27/2021] [Accepted: 03/26/2021] [Indexed: 11/12/2022] Open
Abstract
Direct-coupling analysis (DCA) for studying the coevolution of residues in proteins has been widely used to predict the three-dimensional structure of a protein from its sequence. We present RADI/raDIMod, a variation of the original DCA algorithm that groups chemically equivalent residues combined with super-secondary structure motifs to model protein structures. Interestingly, the simplification produced by grouping amino acids into only two groups (polar and non-polar) is still representative of the physicochemical nature that characterizes the protein structure and it is in line with the role of hydrophobic forces in protein-folding funneling. As a result of a compressed alphabet, the number of sequences required for the multiple sequence alignment is reduced. The number of long-range contacts predicted is limited; therefore, our approach requires the use of neighboring sequence-positions. We use the prediction of secondary structure and motifs of super-secondary structures to predict local contacts. We use RADI and raDIMod, a fragment-based protein structure modelling, achieving near native conformations when the number of super-secondary motifs covers >30-50% of the sequence. Interestingly, although different contacts are predicted with different alphabets, they produce similar structures.
Collapse
Affiliation(s)
- Bernat Anton
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Mireia Besalú
- Departament de Genètica, Microbiologia i Estadística, Universitat de Barcelona, Barcelona 08028, Catalonia, Spain
| | - Oriol Fornes
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Jaume Bonet
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Alexis Molina
- Electronic and Atomic Protein Modeling, Life Sciences, Barcelona Supercomputing Center, Barcelona 08034, Catalonia, Spain
| | - Ruben Molina-Fernandez
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Gemma De Las Cuevas
- Institut für Theoritische Physik, School of Mathematics, Computer Science and Physics, Universität Innsbruck. A-6020 Innsbruck, Austria
| | - Narcis Fernandez-Fuentes
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, SY233EB Aberystwyth, United Kingdom
| | - Baldo Oliva
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| |
Collapse
|
8
|
Russ WP, Figliuzzi M, Stocker C, Barrat-Charlaix P, Socolich M, Kast P, Hilvert D, Monasson R, Cocco S, Weigt M, Ranganathan R. An evolution-based model for designing chorismate mutase enzymes. Science 2020; 369:440-445. [PMID: 32703877 DOI: 10.1126/science.aba3304] [Citation(s) in RCA: 126] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Accepted: 05/13/2020] [Indexed: 02/02/2023]
Abstract
The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.
Collapse
Affiliation(s)
- William P Russ
- University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Matteo Figliuzzi
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France
| | | | - Pierre Barrat-Charlaix
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.,Biozentrum, University of Basel, Basel, Switzerland
| | - Michael Socolich
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA
| | - Peter Kast
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Donald Hilvert
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Remi Monasson
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Simona Cocco
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA.
| |
Collapse
|