1
|
Zhang Z, Wayment-Steele HK, Brixi G, Wang H, Kern D, Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc Natl Acad Sci U S A 2024; 121:e2406285121. [PMID: 39467119 DOI: 10.1073/pnas.2406285121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Accepted: 09/03/2024] [Indexed: 10/30/2024] Open
Abstract
Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a "categorical Jacobian" calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.
Collapse
Affiliation(s)
- Zhidian Zhang
- Harvard University, Cambridge, MA 02138
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139
- Institute of Bioengineering, School of Life Sciences, Ecole polytechnique fédérale de Lausanne, Lausanne VD 1015, Switzerland
| | - Hannah K Wayment-Steele
- HHMI, Brandeis University, Waltham, MA 02453
- Department of Biochemistry, Brandeis University, Waltham, MA 02453
| | - Garyk Brixi
- Harvard College, Harvard University, Cambridge, MA 02138
| | | | - Dorothee Kern
- HHMI, Brandeis University, Waltham, MA 02453
- Department of Biochemistry, Brandeis University, Waltham, MA 02453
| | - Sergey Ovchinnikov
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139
- John Harvard Distinguished Science Fellowship, Harvard University, Cambridge, MA 02138
| |
Collapse
|
2
|
Ghio D, Dandi Y, Krzakala F, Zdeborová L. Sampling with flows, diffusion, and autoregressive neural networks from a spin-glass perspective. Proc Natl Acad Sci U S A 2024; 121:e2311810121. [PMID: 38913892 PMCID: PMC11228464 DOI: 10.1073/pnas.2311810121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 04/03/2024] [Indexed: 06/26/2024] Open
Abstract
Recent years witnessed the development of powerful generative models based on flows, diffusion, or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in this direction by analyzing the efficiency of sampling by these methods on a class of problems with a known probability distribution and comparing it with the sampling performance of more traditional methods such as the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of probability distribution widely studied in the statistical physics of disordered systems that relate to spin glasses, statistical inference, and constraint satisfaction problems. We leverage the fact that sampling via flow-based, diffusion-based, or autoregressive networks methods can be equivalently mapped to the analysis of a Bayes optimal denoising of a modified probability measure. Our findings demonstrate that these methods encounter difficulties in sampling stemming from the presence of a first-order phase transition along the algorithm's denoising path. Our conclusions go both ways: We identify regions of parameters where these methods are unable to sample efficiently, while that is possible using standard Monte Carlo or Langevin approaches. We also identify regions where the opposite happens: standard approaches are inefficient while the discussed generative methods work well.
Collapse
Affiliation(s)
- Davide Ghio
- Information, Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| | - Yatin Dandi
- Information, Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| | - Florent Krzakala
- Information, Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| | - Lenka Zdeborová
- Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
| |
Collapse
|
3
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
4
|
Calvanese F, Lambert CN, Nghe P, Zamponi F, Weigt M. Towards parsimonious generative modeling of RNA families. Nucleic Acids Res 2024; 52:5465-5477. [PMID: 38661206 PMCID: PMC11162787 DOI: 10.1093/nar/gkae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 03/05/2024] [Accepted: 04/05/2024] [Indexed: 04/26/2024] Open
Abstract
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
Collapse
Affiliation(s)
- Francesco Calvanese
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Camille N Lambert
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Philippe Nghe
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
| |
Collapse
|
5
|
Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol 2024:10.1038/s41587-024-02214-2. [PMID: 38653796 DOI: 10.1038/s41587-024-02214-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/20/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.
Collapse
Affiliation(s)
| | - Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Sandra Viknander
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Clara Goldin
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | | | - Aleksej Zelezniak
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden.
- Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania.
- Randall Centre for Cell & Molecular Biophysics, King's College London, Guy's Campus, London, UK.
| | | |
Collapse
|
6
|
Shibata M, Lin X, Onuchic JN, Yura K, Cheng RR. Residue coevolution and mutational landscape for OmpR and NarL response regulator subfamilies. Biophys J 2024; 123:681-692. [PMID: 38291753 PMCID: PMC10995415 DOI: 10.1016/j.bpj.2024.01.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 12/31/2023] [Accepted: 01/24/2024] [Indexed: 02/01/2024] Open
Abstract
DNA-binding response regulators (DBRRs) are a broad class of proteins that operate in tandem with their partner kinase proteins to form two-component signal transduction systems in bacteria. Typical DBRRs are composed of two domains where the conserved N-terminal domain accepts transduced signals and the evolutionarily diverse C-terminal domain binds to DNA. These domains are assumed to be functionally independent, and hence recombination of the two domains should yield novel DBRRs of arbitrary input/output response, which can be used as biosensors. This idea has been proved to be successful in some cases; yet, the error rate is not trivial. Improvement of the success rate of this technique requires a deeper understanding of the linker-domain and inter-domain residue interactions, which have not yet been thoroughly examined. Here, we studied residue coevolution of DBRRs of the two main subfamilies (OmpR and NarL) using large collections of bacterial amino acid sequences to extensively investigate the evolutionary signatures of linker-domain and inter-domain residue interactions. Coevolutionary analysis uncovered evolutionarily selected linker-domain and inter-domain residue interactions of known experimental structures, as well as previously unknown inter-domain residue interactions. We examined the possibility of these inter-domain residue interactions as contacts that stabilize an inactive conformation of the DBRR where DNA binding is inhibited for both subfamilies. The newly gained insights on linker-domain/inter-domain residue interactions and shared inactivation mechanisms improve the understanding of the functional mechanism of DBRRs, providing clues to efficiently create functional DBRR-based biosensors. Additionally, we show the feasibility of applying coevolutionary landscape models to predict the functionality of domain-swapped DBRR proteins. The presented result demonstrates that sequence information can be used to filter out bioengineered DBRR proteins that are predicted to be nonfunctional due to a high negative predictive value.
Collapse
Affiliation(s)
- Mayu Shibata
- Graduate School of Humanities and Sciences, Ochanomizu University, Bunkyo, Tokyo, Japan; Center for Theoretical Biological Physics, Rice University, Houston Texas
| | - Xingcheng Lin
- Department of Physics, North Carolina State University, Raleigh, North Carolina; Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston Texas; Department of Physics and Astronomy, Chemistry, and Biosciences, Rice University, Houston, Texas
| | - Kei Yura
- Graduate School of Humanities and Sciences, Ochanomizu University, Bunkyo, Tokyo, Japan; Center for Interdisciplinary AI and Data Science, Ochanomizu University, Bunkyo, Tokyo, Japan; Graduate School of Advanced Science and Engineering, Waseda University, Shinjuku, Tokyo, Japan
| | - Ryan R Cheng
- Department of Chemistry, University of Kentucky, Lexington, Kentucky.
| |
Collapse
|
7
|
Sumi S, Hamada M, Saito H. Deep generative design of RNA family sequences. Nat Methods 2024; 21:435-443. [PMID: 38238559 DOI: 10.1038/s41592-023-02148-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 12/07/2023] [Indexed: 03/13/2024]
Abstract
RNA engineering has immense potential to drive innovation in biotechnology and medicine. Despite its importance, a versatile platform for the automated design of functional RNA is still lacking. Here, we propose RNA family sequence generator (RfamGen), a deep generative model that designs RNA family sequences in a data-efficient manner by explicitly incorporating alignment and consensus secondary structure information. RfamGen can generate novel and functional RNA family sequences by sampling points from a semantically rich and continuous representation. We have experimentally demonstrated the versatility of RfamGen using diverse RNA families. Furthermore, we confirmed the high success rate of RfamGen in designing functional ribozymes through a quantitative massively parallel assay. Notably, RfamGen successfully generates artificial sequences with higher activity than natural sequences. Overall, RfamGen significantly improves our ability to design functional RNA and opens up new potential for generative RNA engineering in synthetic biology.
Collapse
Affiliation(s)
- Shunsuke Sumi
- Department of Life Science Frontiers, Center for iPS Cell Research and Application (CiRA), Kyoto University, Kyoto, Japan
- Graduate School of Medicine, Kyoto University, Kyoto, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan.
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
- Graduate School of Medicine, Nippon Medical School, Tokyo, Japan.
| | - Hirohide Saito
- Department of Life Science Frontiers, Center for iPS Cell Research and Application (CiRA), Kyoto University, Kyoto, Japan.
- Graduate School of Medicine, Kyoto University, Kyoto, Japan.
| |
Collapse
|
8
|
Chu HY, Fong JHC, Thean DGL, Zhou P, Fung FKC, Huang Y, Wong ASL. Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - John H C Fong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Dawn G L Thean
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Peng Zhou
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Frederic K C Fung
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Alan S L Wong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
9
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
10
|
Wu KE, Yang KK, van den Berg R, Alamdari S, Zou JY, Lu AX, Amini AP. Protein structure generation via folding diffusion. Nat Commun 2024; 15:1059. [PMID: 38316764 PMCID: PMC10844308 DOI: 10.1038/s41467-024-45051-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 01/12/2024] [Indexed: 02/07/2024] Open
Abstract
The ability to computationally generate novel yet physically foldable protein structures could lead to new biological discoveries and new treatments targeting yet incurable diseases. Despite recent advances in protein structure prediction, directly generating diverse, novel protein structures from neural networks remains difficult. In this work, we present a diffusion-based generative model that generates protein backbone structures via a procedure inspired by the natural folding process. We describe a protein backbone structure as a sequence of angles capturing the relative orientation of the constituent backbone atoms, and generate structures by denoising from a random, unfolded state towards a stable folded structure. Not only does this mirror how proteins natively twist into energetically favorable conformations, the inherent shift and rotational invariance of this representation crucially alleviates the need for more complex equivariant networks. We train a denoising diffusion probabilistic model with a simple transformer backbone and demonstrate that our resulting model unconditionally generates highly realistic protein structures with complexity and structural patterns akin to those of naturally-occurring proteins. As a useful resource, we release an open-source codebase and trained models for protein structure diffusion.
Collapse
Affiliation(s)
- Kevin E Wu
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | | | | | | | - James Y Zou
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
| | - Alex X Lu
- Microsoft Research, Cambridge, MA, USA
| | | |
Collapse
|
11
|
Pucci F, Zerihun MB, Rooman M, Schug A. pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics 2024; 40:btae074. [PMID: 38335928 PMCID: PMC10881095 DOI: 10.1093/bioinformatics/btae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 01/25/2024] [Accepted: 02/06/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION The accurate prediction of how mutations change biophysical properties of proteins or RNA is a major goal in computational biology with tremendous impacts on protein design and genetic variant interpretation. Evolutionary approaches such as coevolution can help solving this issue. RESULTS We present pycofitness, a standalone Python-based software package for the in silico mutagenesis of protein and RNA sequences. It is based on coevolution and, more specifically, on a popular inverse statistical approach, namely direct coupling analysis by pseudo-likelihood maximization. Its efficient implementation and user-friendly command line interface make it an easy-to-use tool even for researchers with no bioinformatics background. To illustrate its strengths, we present three applications in which pycofitness efficiently predicts the deleteriousness of genetic variants and the effect of mutations on protein fitness and thermodynamic stability. AVAILABILITY AND IMPLEMENTATION https://github.com/KIT-MBS/pycofitness.
Collapse
Affiliation(s)
- Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Mehari B Zerihun
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Alexander Schug
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
- Department of Biology, University of Duisburg-Essen, D-45141 Essen, Germany
| |
Collapse
|
12
|
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
13
|
Min X, Yang C, Xie J, Huang Y, Liu N, Jin X, Wang T, Kong Z, Lu X, Ge S, Zhang J, Xia N. Tpgen: a language model for stable protein design with a specific topology structure. BMC Bioinformatics 2024; 25:35. [PMID: 38254030 PMCID: PMC10804651 DOI: 10.1186/s12859-024-05637-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND Natural proteins occupy a small portion of the protein sequence space, whereas artificial proteins can explore a wider range of possibilities within the sequence space. However, specific requirements may not be met when generating sequences blindly. Research indicates that small proteins have notable advantages, including high stability, accurate resolution prediction, and facile specificity modification. RESULTS This study involves the construction of a neural network model named TopoProGenerator(TPGen) using a transformer decoder. The model is trained with sequences consisting of a maximum of 65 amino acids. The training process of TopoProGenerator incorporates reinforcement learning and adversarial learning, for fine-tuning. Additionally, it encompasses a stability predictive model trained with a dataset comprising over 200,000 sequences. The results demonstrate that TopoProGenerator is capable of designing stable small protein sequences with specified topology structures. CONCLUSION TPGen has the ability to generate protein sequences that fold into the specified topology, and the pretraining and fine-tuning methods proposed in this study can serve as a framework for designing various types of proteins.
Collapse
Affiliation(s)
- Xiaoping Min
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Chongzhou Yang
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Jun Xie
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Yang Huang
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Life Sciences, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Nan Liu
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Xiaocheng Jin
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Tianshu Wang
- School of Informatics, Institute of Artificial Intelligence, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Zhibo Kong
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Xiaoli Lu
- Information and Networking Center, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Shengxiang Ge
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China.
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China.
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China.
| | - Jun Zhang
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| | - Ningshao Xia
- National Institute of Diagnostics and Vaccine Development in Infectious Diseases, State Key Laboratory of Molecular Vaccinology and Molecular Diagnostics, Collaborative Innovation Centers of Biologic Products, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- School of Public Health, Xiamen University, No. 422 Siming South Rd, Xiamen, 361005, China
- State Key Laboratory of Vaccines for Infectious Diseases, Xiang An Biomedicine Laboratory, No. 422 Siming South Rd, Xiamen, 361005, China
| |
Collapse
|
14
|
Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023; 12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
Collapse
Affiliation(s)
- Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States
| | - Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
15
|
Barghout RA, Xu Z, Betala S, Mahadevan R. Advances in generative modeling methods and datasets to design novel enzymes for renewable chemicals and fuels. Curr Opin Biotechnol 2023; 84:103007. [PMID: 37931573 DOI: 10.1016/j.copbio.2023.103007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 09/12/2023] [Accepted: 09/13/2023] [Indexed: 11/08/2023]
Abstract
Biotechnology has revolutionized the development of sustainable energy sources by harnessing biomass as a feedstock for energy production. However, challenges such as recalcitrant feedstocks and inefficient metabolic pathways hinder the large-scale integration of renewable energy systems. Enzyme engineering has emerged as a powerful tool to address these challenges by enhancing enzyme activity, specificity, and stability. Generative machine learning (ML) models have shown great promise in accelerating protein design, allowing for the generation of novel protein sequences with desired properties by navigating vast spaces. This review paper aims to summarize the state of the art in generative models for protein design and how they can be applied to bioenergy applications, including the underlying architectures and training strategies. Additionally, it highlights the importance of high-quality datasets for training and evaluating generative models, organizes available datasets for generative protein design, and discusses the potential of applying generative models to strain design for bioenergy production.
Collapse
Affiliation(s)
- Rana A Barghout
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada.
| | - Zhiqing Xu
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada
| | - Siddharth Betala
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Radhakrishnan Mahadevan
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada
| |
Collapse
|
16
|
Abakarova M, Marquet C, Rera M, Rost B, Laine E. Alignment-based Protein Mutational Landscape Prediction: Doing More with Less. Genome Biol Evol 2023; 15:evad201. [PMID: 37936309 PMCID: PMC10653582 DOI: 10.1093/gbe/evad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 10/27/2023] [Accepted: 11/01/2023] [Indexed: 11/09/2023] Open
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
Collapse
Affiliation(s)
- Marina Abakarova
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Michael Rera
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748 Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Elodie Laine
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Institut universitaire de France (IUF)
| |
Collapse
|
17
|
Akl H, Emison B, Zhao X, Mondal A, Perez A, Dixit PD. GENERALIST: A latent space based generative model for protein sequence families. PLoS Comput Biol 2023; 19:e1011655. [PMID: 38011273 PMCID: PMC10703406 DOI: 10.1371/journal.pcbi.1011655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 12/07/2023] [Accepted: 11/03/2023] [Indexed: 11/29/2023] Open
Abstract
Generative models of protein sequence families are an important tool in the repertoire of protein scientists and engineers alike. However, state-of-the-art generative approaches face inference, accuracy, and overfitting- related obstacles when modeling moderately sized to large proteins and/or protein families with low sequence coverage. Here, we present a simple to learn, tunable, and accurate generative model, GENERALIST: GENERAtive nonLInear tenSor-factorizaTion for protein sequences. GENERALIST accurately captures several high order summary statistics of amino acid covariation. GENERALIST also predicts conservative local optimal sequences which are likely to fold in stable 3D structure. Importantly, unlike current methods, the density of sequences in GENERALIST-modeled sequence ensembles closely resembles the corresponding natural ensembles. Finally, GENERALIST embeds protein sequences in an informative latent space. GENERALIST will be an important tool to study protein sequence variability.
Collapse
Affiliation(s)
- Hoda Akl
- Department of Physics, University of Florida, Gainesville, Florida, United States of America
| | - Brooke Emison
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, United States of America
| | - Xiaochuan Zhao
- Department of Physics, University of Florida, Gainesville, Florida, United States of America
| | - Arup Mondal
- Department of Chemistry, University of Florida, Gainesville, Florida, United States of America
| | - Alberto Perez
- Department of Chemistry, University of Florida, Gainesville, Florida, United States of America
| | - Purushottam D. Dixit
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, United States of America
- Systems Biology Institute, Yale University, West Haven, Connecticut, United States of America
| |
Collapse
|
18
|
Malbranke C, Rostain W, Depardieu F, Cocco S, Monasson R, Bikard D. Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment. PLoS Comput Biol 2023; 19:e1011621. [PMID: 37976326 PMCID: PMC10729993 DOI: 10.1371/journal.pcbi.1011621] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 12/19/2023] [Accepted: 10/19/2023] [Indexed: 11/19/2023] Open
Abstract
We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - William Rostain
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - Florence Depardieu
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
| | - David Bikard
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| |
Collapse
|
19
|
Mardikoraem M, Wang Z, Pascual N, Woldring D. Generative models for protein sequence modeling: recent advances and future directions. Brief Bioinform 2023; 24:bbad358. [PMID: 37864295 PMCID: PMC10589401 DOI: 10.1093/bib/bbad358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 09/08/2023] [Accepted: 09/12/2023] [Indexed: 10/22/2023] Open
Abstract
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Michigan State University (MSU)‘s Department of Chemical Engineering and Materials Science
| | - Zirui Wang
- Regeneron Pharmaceuticals, Inc. Having received his B.S. in Chemical Engineering from MSU, he is currently pursuing a M.S. in Computer Science from Syracuse University
| | | | - Daniel Woldring
- MSU’s Department of Chemical Engineering and Materials Science and a member of MSU’s Institute for Quantitative Health Sciences and Engineering
| |
Collapse
|
20
|
Mallik BB, Stanislaw J, Alawathurage TM, Khmelinskaia A. De Novo Design of Polyhedral Protein Assemblies: Before and After the AI Revolution. Chembiochem 2023; 24:e202300117. [PMID: 37014094 DOI: 10.1002/cbic.202300117] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 04/03/2023] [Accepted: 04/03/2023] [Indexed: 04/05/2023]
Abstract
Self-assembling polyhedral protein biomaterials have gained attention as engineering targets owing to their naturally evolved sophisticated functions, ranging from protecting macromolecules from the environment to spatially controlling biochemical reactions. Precise computational design of de novo protein polyhedra is possible through two main types of approaches: methods from first principles, using physical and geometrical rules, and more recent data-driven methods based on artificial intelligence (AI), including deep learning (DL). Here, we retrospect first principle- and AI-based approaches for designing finite polyhedral protein assemblies, as well as advances in the structure prediction of such assemblies. We further highlight the possible applications of these materials and explore how the presented approaches can be combined to overcome current challenges and to advance the design of functional protein-based biomaterials.
Collapse
Affiliation(s)
- Bhoomika Basu Mallik
- Transdisciplinary Research Area, "Building Blocks of Matter and Fundamental Interactions (TRA Matter)", University of Bonn, 53121, Bonn, Germany
- Life and Medical Sciences Institute, University of Bonn, 53115, Bonn, Germany
| | - Jenna Stanislaw
- Transdisciplinary Research Area, "Building Blocks of Matter and Fundamental Interactions (TRA Matter)", University of Bonn, 53121, Bonn, Germany
- Life and Medical Sciences Institute, University of Bonn, 53115, Bonn, Germany
| | - Tharindu Madhusankha Alawathurage
- Transdisciplinary Research Area, "Building Blocks of Matter and Fundamental Interactions (TRA Matter)", University of Bonn, 53121, Bonn, Germany
- Life and Medical Sciences Institute, University of Bonn, 53115, Bonn, Germany
| | - Alena Khmelinskaia
- Transdisciplinary Research Area, "Building Blocks of Matter and Fundamental Interactions (TRA Matter)", University of Bonn, 53121, Bonn, Germany
- Life and Medical Sciences Institute, University of Bonn, 53115, Bonn, Germany
- Current address: Department of Chemistry, Ludwig Maximillian University, 80539, Munich, Germany
| |
Collapse
|
21
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
22
|
Budzynski L, Pagnani A. Small-coupling expansion for multiple sequence alignment. Phys Rev E 2023; 107:044125. [PMID: 37198812 DOI: 10.1103/physreve.107.044125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/27/2023] [Indexed: 05/19/2023]
Abstract
The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional or structural characterizations between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics tools are based on profile models that assume the statistical independence of the different sites of the sequences. Over the last years, it has become increasingly clear that homologous sequences show complex patterns of long-range correlations over the primary sequence as a consequence of the natural evolution process that selects genetic variants under the constraint of preserving the functional or structural determinants of the sequence. Here, we present an alignment algorithm based on message passing techniques that overcomes the limitations of profile models. Our method is based on a perturbative small-coupling expansion of the free energy of the model that assumes a linear chain approximation as the zeroth-order of the expansion. We test the potentiality of the algorithm against standard competing strategies on several biological sequences.
Collapse
Affiliation(s)
- Louise Budzynski
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, 24, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
- INFN, Sezione di Torino, Torino, Via Pietro Giuria, 1 10125 Torino Italy
| |
Collapse
|
23
|
Malbranke C, Bikard D, Cocco S, Monasson R, Tubiana J. Machine learning for evolutionary-based and physics-inspired protein design: Current and future synergies. Curr Opin Struct Biol 2023; 80:102571. [PMID: 36947951 DOI: 10.1016/j.sbi.2023.102571] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 01/29/2023] [Accepted: 02/07/2023] [Indexed: 03/24/2023]
Abstract
Computational protein design facilitates the discovery of novel proteins with prescribed structure and functionality. Exciting designs were recently reported using novel data-driven methodologies that can be roughly divided into two categories: evolutionary-based and physics-inspired approaches. The former infer characteristic sequence features shared by sets of evolutionary-related proteins, such as conserved or coevolving positions, and recombine them to generate candidates with similar structure and function. The latter approaches estimate key biochemical properties, such as structure free energy, conformational entropy, or binding affinities using machine learning surrogates, and optimize them to yield improved designs. Here, we review recent progress along both tracks, discuss their strengths and weaknesses, and highlight opportunities for synergistic approaches.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France; Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, 75015 Paris, France.
| | - David Bikard
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, 75015 Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Université de Paris, Paris, France
| | - Jérôme Tubiana
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
24
|
Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput Biol 2023; 19:e1010956. [PMID: 36857380 PMCID: PMC10010530 DOI: 10.1371/journal.pcbi.1010956] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/13/2023] [Accepted: 02/16/2023] [Indexed: 03/02/2023] Open
Abstract
Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.
Collapse
|
25
|
Clifton BE, Kozome D, Laurino P. Efficient Exploration of Sequence Space by Sequence-Guided Protein Engineering and Design. Biochemistry 2023; 62:210-220. [PMID: 35245020 DOI: 10.1021/acs.biochem.1c00757] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The rapid growth of sequence databases over the past two decades means that protein engineers faced with optimizing a protein for any given task will often have immediate access to a vast number of related protein sequences. These sequences encode information about the evolutionary history of the protein and the underlying sequence requirements to produce folded, stable, and functional protein variants. Methods that can take advantage of this information are an increasingly important part of the protein engineering tool kit. In this Perspective, we discuss the utility of sequence data in protein engineering and design, focusing on recent advances in three main areas: the use of ancestral sequence reconstruction as an engineering tool to generate thermostable and multifunctional proteins, the use of sequence data to guide engineering of multipoint mutants by structure-based computational protein design, and the use of unlabeled sequence data for unsupervised and semisupervised machine learning, allowing the generation of diverse and functional protein sequences in unexplored regions of sequence space. Altogether, these methods enable the rapid exploration of sequence space within regions enriched with functional proteins and therefore have great potential for accelerating the engineering of stable, functional, and diverse proteins for industrial and biomedical applications.
Collapse
Affiliation(s)
- Ben E Clifton
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| | - Dan Kozome
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| | - Paola Laurino
- Protein Engineering and Evolution Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna, Okinawa 904-0495, Japan
| |
Collapse
|
26
|
Schmitt LT, Paszkowski-Rogacz M, Jug F, Buchholz F. Prediction of designer-recombinases for DNA editing with generative deep learning. Nat Commun 2022; 13:7966. [PMID: 36575171 PMCID: PMC9794738 DOI: 10.1038/s41467-022-35614-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 12/14/2022] [Indexed: 12/28/2022] Open
Abstract
Site-specific tyrosine-type recombinases are effective tools for genome engineering, with the first engineered variants having demonstrated therapeutic potential. So far, adaptation to new DNA target site selectivity of designer-recombinases has been achieved mostly through iterative cycles of directed molecular evolution. While effective, directed molecular evolution methods are laborious and time consuming. Here we present RecGen (Recombinase Generator), an algorithm for the intelligent generation of designer-recombinases. We gather the sequence information of over one million Cre-like recombinase sequences evolved for 89 different target sites with which we train Conditional Variational Autoencoders for recombinase generation. Experimental validation demonstrates that the algorithm can predict recombinase sequences with activity on novel target-sites, indicating that RecGen is useful to accelerate the development of future designer-recombinases.
Collapse
Affiliation(s)
- Lukas Theo Schmitt
- Medical Systems Biology, Medical Faculty, TU Dresden, 01307, Dresden, Germany
| | | | - Florian Jug
- Fondazione Human Technopole, Milano, Italy
- Center for Systems Biology Dresden, Dresden, Germany
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Frank Buchholz
- Medical Systems Biology, Medical Faculty, TU Dresden, 01307, Dresden, Germany.
| |
Collapse
|
27
|
A Bayesian generative neural network framework for epidemic inference problems. Sci Rep 2022; 12:19673. [PMID: 36385141 PMCID: PMC9667449 DOI: 10.1038/s41598-022-20898-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Accepted: 09/20/2022] [Indexed: 11/17/2022] Open
Abstract
The reconstruction of missing information in epidemic spreading on contact networks can be essential in the prevention and containment strategies. The identification and warning of infectious but asymptomatic individuals (i.e., contact tracing), the well-known patient-zero problem, or the inference of the infectivity values in structured populations are examples of significant epidemic inference problems. As the number of possible epidemic cascades grows exponentially with the number of individuals involved and only an almost negligible subset of them is compatible with the observations (e.g., medical tests), epidemic inference in contact networks poses incredible computational challenges. We present a new generative neural networks framework that learns to generate the most probable infection cascades compatible with observations. The proposed method achieves better (in some cases, significantly better) or comparable results with existing methods in all problems considered both in synthetic and real contact networks. Given its generality, clear Bayesian and variational nature, the presented framework paves the way to solve fundamental inference epidemic problems with high precision in small and medium-sized real case scenarios such as the spread of infections in workplaces and hospitals.
Collapse
|
28
|
Ravishankar K, Jiang X, Leddin EM, Morcos F, Cisneros GA. Computational compensatory mutation discovery approach: Predicting a PARP1 variant rescue mutation. Biophys J 2022; 121:3663-3673. [PMID: 35642254 PMCID: PMC9617126 DOI: 10.1016/j.bpj.2022.05.036] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 05/20/2022] [Accepted: 05/23/2022] [Indexed: 11/02/2022] Open
Abstract
The prediction of protein mutations that affect function may be exploited for multiple uses. In the context of disease variants, the prediction of compensatory mutations that reestablish functional phenotypes could aid in the development of genetic therapies. In this work, we present an integrated approach that combines coevolutionary analysis and molecular dynamics (MD) simulations to discover functional compensatory mutations. This approach is employed to investigate possible rescue mutations of a poly(ADP-ribose) polymerase 1 (PARP1) variant, PARP1 V762A, associated with lung cancer and follicular lymphoma. MD simulations show PARP1 V762A exhibits noticeable changes in structural and dynamical behavior compared with wild-type (WT) PARP1. Our integrated approach predicts A755E as a possible compensatory mutation based on coevolutionary information, and molecular simulations indicate that the PARP1 A755E/V762A double mutant exhibits similar structural and dynamical behavior to WT PARP1. Our methodology can be broadly applied to a large number of systems where single-nucleotide polymorphisms have been identified as connected to disease and can shed light on the biophysical effects of such changes as well as provide a way to discover potential mutants that could restore WT-like functionality. This can, in turn, be further utilized in the design of molecular therapeutics that aim to mimic such compensatory effect.
Collapse
Affiliation(s)
| | - Xianli Jiang
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, Texas; Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Emmett M Leddin
- Department of Chemistry, University of North Texas, Denton, Texas
| | - Faruck Morcos
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, Texas; Department of Bioengineering, The University of Texas at Dallas, Richardson, Texas; Center for Systems Biology, The University of Texas at Dallas, Richardson, Texas.
| | - G Andrés Cisneros
- Department of Chemistry, University of North Texas, Denton, Texas; Department of Physics, The University of Texas at Dallas, Richardson, Texas; Department of Chemistry, The University of Texas at Dallas, Richardson, Texas.
| |
Collapse
|
29
|
Feinauer C, Meynard-Piganeau B, Lucibello C. Interpretable pairwise distillations for generative protein sequence models. PLoS Comput Biol 2022; 18:e1010219. [PMID: 35737722 PMCID: PMC9258900 DOI: 10.1371/journal.pcbi.1010219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 07/06/2022] [Accepted: 05/17/2022] [Indexed: 11/25/2022] Open
Abstract
Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze two different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction. In addition, we show that even simpler, factorized models often come close in performance to the original models. Complex neural networks trained on large biological datasets have recently shown powerful capabilites in tasks like the prediction of protein structure, assessing the effect of mutations on the fitness of proteins and even designing completely novel proteins with desired characteristics. The enthralling prospect of leveraging these advances in fields like medicine and synthetic biology has created a large amount of interest in academic research and industry. The connected question of what biological insights these methods actually gain during training has, however, received less attention. In this work, we systematically investigate in how far neural networks capture information that could not be captured by simpler models. To this end, we develop a method to train simpler models to imitate more complex models, and compare their performance to the original neural network models. Surprisingly, we find that the simpler models thus trained often perform on par with the neural networks, while having a considerably easier structure. This highlights the importance of finding ways to interpret the predictions of neural networks in these fields, which could inform the creation of better models, improve methods for their assessment and ultimately also increase our understanding of the underlying biology.
Collapse
Affiliation(s)
- Christoph Feinauer
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
- * E-mail:
| | - Barthelemy Meynard-Piganeau
- Laboratory of Computational and Quantitative Biology (LCQB) UMR 7238 CNRS, Sorbonne Université, Paris, France
- Department of Applied Science and Technologies (DISAT), Politecnico di Torino, Turin, Italy
| | - Carlo Lucibello
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
| |
Collapse
|
30
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
31
|
Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Proc Natl Acad Sci U S A 2022; 119:2113118119. [PMID: 35022216 PMCID: PMC8795541 DOI: 10.1073/pnas.2113118119] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/13/2021] [Indexed: 12/21/2022] Open
Abstract
During the COVID pandemic, new severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants emerge and spread, some being of major concern due to their increased infectivity or capacity to reduce vaccine efficiency. Anticipating mutations, which might give rise to new variants, would be of great interest. We construct sequence models predicting how mutable SARS-CoV-2 positions are, using a single SARS-CoV-2 sequence and databases of other coronaviruses. Predictions are tested against available mutagenesis data and the observed variability of SARS-CoV-2 proteins. Interestingly, predictions agree increasingly with observations, as more SARS-CoV-2 sequences become available. Combining predictions with immunological data, we find an overrepresentation of mutations in current variants of concern. The approach may become relevant for potential outbreaks of future viral diseases. The emergence of new variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a major concern given their potential impact on the transmissibility and pathogenicity of the virus as well as the efficacy of therapeutic interventions. Here, we predict the mutability of all positions in SARS-CoV-2 protein domains to forecast the appearance of unseen variants. Using sequence data from other coronaviruses, preexisting to SARS-CoV-2, we build statistical models that not only capture amino acid conservation but also more complex patterns resulting from epistasis. We show that these models are notably superior to conservation profiles in estimating the already observable SARS-CoV-2 variability. In the receptor binding domain of the spike protein, we observe that the predicted mutability correlates well with experimental measures of protein stability and that both are reliable mutability predictors (receiver operating characteristic areas under the curve ∼0.8). Most interestingly, we observe an increasing agreement between our model and the observed variability as more data become available over time, proving the anticipatory capacity of our model. When combined with data concerning the immune response, our approach identifies positions where current variants of concern are highly overrepresented. These results could assist studies on viral evolution and future viral outbreaks and, in particular, guide the exploration and anticipation of potentially harmful future SARS-CoV-2 variants.
Collapse
|
32
|
McGee F, Hauri S, Novinger Q, Vucetic S, Levy RM, Carnevale V, Haldane A. The generative capacity of probabilistic protein sequence models. Nat Commun 2021; 12:6302. [PMID: 34728624 PMCID: PMC8563988 DOI: 10.1038/s41467-021-26529-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 09/23/2021] [Indexed: 01/10/2023] Open
Abstract
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
Collapse
Affiliation(s)
- Francisco McGee
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
| | - Sandro Hauri
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Quentin Novinger
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Slobodan Vucetic
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Ronald M Levy
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
- Department of Physics, Temple University, Philadelphia, 19122, USA
- Department of Chemistry, Temple University, Philadelphia, 19122, USA
| | - Vincenzo Carnevale
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA.
- Department of Biology, Temple University, Philadelphia, 19122, USA.
| | - Allan Haldane
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA.
- Department of Chemistry, Temple University, Philadelphia, 19122, USA.
| |
Collapse
|
33
|
Defresne M, Barbe S, Schiex T. Protein Design with Deep Learning. Int J Mol Sci 2021; 22:11741. [PMID: 34769173 PMCID: PMC8584038 DOI: 10.3390/ijms222111741] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/23/2021] [Accepted: 10/26/2021] [Indexed: 12/21/2022] Open
Abstract
Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Collapse
Affiliation(s)
- Marianne Defresne
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| |
Collapse
|