1
|
Rennie ML, Oliver MR. Emerging frontiers in protein structure prediction following the AlphaFold revolution. J R Soc Interface 2025; 22:20240886. [PMID: 40233800 PMCID: PMC11999738 DOI: 10.1098/rsif.2024.0886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 02/04/2025] [Accepted: 03/10/2025] [Indexed: 04/17/2025] Open
Abstract
Models of protein structures enable molecular understanding of biological processes. Current protein structure prediction tools lie at the interface of biology, chemistry and computer science. Millions of protein structure models have been generated in a very short space of time through a revolution in protein structure prediction driven by deep learning, led by AlphaFold. This has provided a wealth of new structural information. Interpreting these predictions is critical to determining where and when this information is useful. But proteins are not static nor do they act alone, and structures of proteins interacting with other proteins and other biomolecules are critical to a complete understanding of their biological function at the molecular level. This review focuses on the application of state-of-the-art protein structure prediction to these advanced applications. We also suggest a set of guidelines for reporting AlphaFold predictions.
Collapse
|
2
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. PLoS Comput Biol 2025; 21:e1012818. [PMID: 40111986 PMCID: PMC11957564 DOI: 10.1371/journal.pcbi.1012818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 01/22/2025] [Indexed: 03/22/2025] Open
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- Department of Biology, University of Florida, Gainesville, Florida, United States of America
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
3
|
Posfai A, McCandlish DM, Kinney JB. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.05.12.593774. [PMID: 38798625 PMCID: PMC11118426 DOI: 10.1101/2024.05.12.593774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models that describe how biological sequences encode functional activities are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. In this work we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when model parameters transform under redundant irreducible matrix representations of this group. Based on this finding, we describe an "embedding distillation" procedure that enables both analytic calculation of the number of independent gauge freedoms and efficient computation of a sparse basis for the space of gauge freedoms. We also study how parameter transformation behavior affects parameter interpretability. We find that in many (and possibly all) nontrivial models, the ability to interpret individual model parameters as quantifying intrinsic allelic effects requires that gauge freedoms be present. This finding establishes an incompatibility between two distinct notions of parameter interpretability. Our work thus advances the understanding of symmetries, gauge freedoms, and parameter interpretability in models of sequence-function relationships.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory
| | | | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory
| |
Collapse
|
4
|
Caredda F, Pagnani A. Direct coupling analysis and the attention mechanism. BMC Bioinformatics 2025; 26:41. [PMID: 39915710 PMCID: PMC11804077 DOI: 10.1186/s12859-025-06062-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 01/22/2025] [Indexed: 02/09/2025] Open
Abstract
Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.
Collapse
Affiliation(s)
- Francesco Caredda
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, I-10129, Torino, Italy.
| | - Andrea Pagnani
- DISAT, Politecnico di Torino, Corso Duca degli Abruzzi, I-10129, Torino, Italy
- Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060, Candiolo, Italy
- INFN, Sezione di Torino, Via Pietro Giuria, I-10125, Torino, Italy
| |
Collapse
|
5
|
Plata G, Srinivasan K, Krishnamurthy M, Herron L, Dixit P. Designing host-associated microbiomes using the consumer/resource model. mSystems 2025; 10:e0106824. [PMID: 39651880 PMCID: PMC11748559 DOI: 10.1128/msystems.01068-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 11/06/2024] [Indexed: 12/18/2024] Open
Abstract
A key step toward rational microbiome engineering is in silico sampling of realistic microbial communities that correspond to desired host phenotypes, and vice versa. This remains challenging due to a lack of generative models that simultaneously capture compositions of host-associated microbiomes and host phenotypes. To that end, we present a generative model based on the mechanistic consumer/resource (C/R) framework. In the model, variation in microbial ecosystem composition arises due to differences in the availability of effective resources (inferred latent variables), while species' resource preferences remain conserved. Simultaneously, the latent variables are used to model phenotypic states of hosts. In silico microbiomes generated by our model accurately reproduce universal and dataset-specific statistics of bacterial communities. The model allows us to address three salient questions in host-associated microbial ecologies: (i) which host phenotypes maximally constrain the composition of the host-associated microbiomes? (ii) how context-specific are phenotype/microbiome associations, and (iii) what are plausible microbiome compositions that correspond to desired host phenotypes? Our approach aids the analysis and design of microbial communities associated with host phenotypes of interest. IMPORTANCE Generative models are extremely popular in modern biology. They have been used to model the variation of protein sequences, entire genomes, and RNA sequencing profiles. Importantly, generative models have been used to extrapolate and interpolate to unobserved regimes of data to design biological systems with desired properties. For example, there has been a boom in machine-learning models aiding in the design of proteins with user-specified structures or functions. Host-associated microbiomes play important roles in animal health and disease, as well as the productivity and environmental footprint of livestock species. However, there are no generative models of host-associated microbiomes. One chief reason is that off-the-shelf machine-learning models are data hungry, and microbiome studies usually deal with large variability and small sample sizes. Moreover, microbiome compositions are heavily context dependent, with characteristics of the host and the abiotic environment leading to distinct patterns in host-microbiome associations. Consequently, off-the-shelf generative modeling has not been successfully applied to microbiomes.To address these challenges, we develop a generative model for host-associated microbiomes derived from the consumer/resource (C/R) framework. This derivation allows us to fit the model to readily available cross-sectional microbiome profile data. Using data from three animal hosts, we show that this mechanistic generative model has several salient features: the model identifies a latent space that represents variables that determine the growth and, therefore, relative abundances of microbial species. Probabilistic modeling of variation in this latent space allows us to generate realistic in silico microbial communities. The model can assign probabilities to microbiomes, thereby allowing us to discriminate between dissimilar ecosystems. Importantly, the model predictively captures host-associated microbiomes and the corresponding hosts' phenotypes, enabling the design of microbial communities associated with user-specified host characteristics.
Collapse
Affiliation(s)
- Germán Plata
- Computational Sciences, BiomEdit, LLC., Fishers, Indiana, USA
| | - Karthik Srinivasan
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, USA
| | | | - Lukas Herron
- Department of Physics, University of Florida, Gainesville, Florida, USA
| | - Purushottam Dixit
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, USA
- Systems Biology Institute, Yale University, West Haven, Connecticut, USA
| |
Collapse
|
6
|
Sternke M, Tripp KW, Barrick D. Protein stability is determined by single-site bias rather than pairwise covariance. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632118. [PMID: 39868188 PMCID: PMC11760396 DOI: 10.1101/2025.01.09.632118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
The biases revealed in protein sequence alignments have been shown to provide information related to protein structure, stability, and function. For example, sequence biases at individual positions can be used to design consensus proteins that are often more stable than naturally occurring counterparts. Likewise, correlations between pairs of residue can be used to predict protein structures. Recent work using Potts models show that together, single-site biases and pair correlations lead to improved predictions of protein fitness, activity, and stability. Here we use a Potts model to design groups of protein sequences with different amounts of single-site biases and pair correlations, and determine the thermodynamic stabilities of a representative set of sequences from each group. Surprisingly, sequences excluding pair correlations maximize stability, whereas sequences that maximize pair correlations are less stable, suggesting that pair correlations contribute to another aspect of protein fitness. Consistent with this interpretation, we find that for adenylate kinase, enzyme activity is greatly increased by maximizing pair correlations. The finding that elimination of covariant residue pairs increases protein stability suggests a route to enhance stability of designed proteins; indeed, this strategy produces hyperstable homeodomain and adenylate kinase proteins that retain significant activity.
Collapse
Affiliation(s)
- Matt Sternke
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
- Current address: Protein Design and Informatics, GSK, 1250 South Collegeville Rd, Collegeville, PA 19426 USA
| | - Katherine W. Tripp
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
| | - Doug Barrick
- T.C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD 21219 USA
| |
Collapse
|
7
|
Hermans P, Tsishyn M, Schwersensky M, Rooman M, Pucci F. Exploring Evolution to Uncover Insights Into Protein Mutational Stability. Mol Biol Evol 2025; 42:msae267. [PMID: 39786559 PMCID: PMC11721782 DOI: 10.1093/molbev/msae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/27/2024] [Accepted: 11/28/2024] [Indexed: 01/12/2025] Open
Abstract
Determining the impact of mutations on the thermodynamic stability of proteins is essential for a wide range of applications such as rational protein design and genetic variant interpretation. Since protein stability is a major driver of evolution, evolutionary data are often used to guide stability predictions. Many state-of-the-art stability predictors extract evolutionary information from multiple sequence alignments of proteins homologous to a query protein, and leverage it to predict the effects of mutations on protein stability. To evaluate the power and the limitations of such methods, we used the massive amount of stability data recently obtained by deep mutational scanning to study how best to construct multiple sequence alignments and optimally extract evolutionary information from them. We tested different evolutionary models and found that, unexpectedly, independent-site models achieve similar accuracy to more complex epistatic models. A detailed analysis of the latter models suggests that their inference often results in noisy couplings, which do not appear to add predictive power over the independent-site contribution, at least in the context of stability prediction. Interestingly, by combining any of the evolutionary features with a simple structural feature, the relative solvent accessibility of the mutated residue, we achieved similar prediction accuracy to supervised, machine learning-based, protein stability change predictors. Our results provide new insights into the relationship between protein evolution and stability, and show how evolutionary information can be exploited to improve the performance of mutational stability prediction.
Collapse
Affiliation(s)
- Pauline Hermans
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels 1050, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Brussels 1050, Belgium
| |
Collapse
|
8
|
Mazzocato Y, Frasson N, Sample M, Fregonese C, Pavan A, Caregnato A, Simeoni M, Scarso A, Cendron L, Šulc P, Angelini A. Combination of Coevolutionary Information and Supervised Learning Enables Generation of Cyclic Peptide Inhibitors with Enhanced Potency from a Small Data Set. ACS CENTRAL SCIENCE 2024; 10:2242-2252. [PMID: 39735311 PMCID: PMC11672547 DOI: 10.1021/acscentsci.4c01428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 10/26/2024] [Accepted: 11/07/2024] [Indexed: 12/31/2024]
Abstract
Computational generation of cyclic peptide inhibitors using machine learning models requires large size training data sets often difficult to generate experimentally. Here we demonstrated that sequential combination of Random Forest Regression with the pseudolikelihood maximization Direct Coupling Analysis method and Monte Carlo simulation can effectively enhance the design pipeline of cyclic peptide inhibitors of a tumor-associated protease even for small experimental data sets. Further in vitro studies showed that such in silico-evolved cyclic peptides are more potent than the best peptide inhibitors previously developed to this target. Crystal structure of the cyclic peptides in complex with the protease resembled those of protein complexes, with large interaction surfaces, constrained peptide backbones, and multiple inter- and intramolecular interactions, leading to good binding affinity and selectivity.
Collapse
Affiliation(s)
- Ylenia Mazzocato
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Nicola Frasson
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Matthew Sample
- School
of Molecular Sciences and Centre for Molecular Design and Biomimetics,
The Biodesign Institute, Arizona State University, 1001 South McAllister Avenue, Tempe, Arizona 85281, United States
- School
for Engineering of Matter, Transport, and Energy, Arizona State University, Tempe, Arizona 85287, United States
| | - Cristian Fregonese
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Angela Pavan
- Department
of Biology, University of Padua, Viale G. Colombo 3, 35131 Padua, Italy
| | - Alberto Caregnato
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Marta Simeoni
- Department
of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
- European
Centre for Living Technology (ECLT), Ca’ Bottacin, Dorsoduro 3911,
Calle Crosera, 30123 Venice, Italy
| | - Alessandro Scarso
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
| | - Laura Cendron
- Department
of Biology, University of Padua, Viale G. Colombo 3, 35131 Padua, Italy
| | - Petr Šulc
- School
of Molecular Sciences and Centre for Molecular Design and Biomimetics,
The Biodesign Institute, Arizona State University, 1001 South McAllister Avenue, Tempe, Arizona 85281, United States
- Department
of Bioscience − School of Natural Sciences, Technical University of Munich (TUM), Boltzmannstraße 10, 85748 Garching, Germany
| | - Alessandro Angelini
- Department
of Molecular Sciences and Nanosystems, Ca’
Foscari University of Venice, Via Torino 155, 30172 Mestre, Italy
- European
Centre for Living Technology (ECLT), Ca’ Bottacin, Dorsoduro 3911,
Calle Crosera, 30123 Venice, Italy
| |
Collapse
|
9
|
Liu H, Zhuo C, Gao J, Zeng C, Zhao Y. AI-integrated network for RNA complex structure and dynamic prediction. BIOPHYSICS REVIEWS 2024; 5:041304. [PMID: 39512332 PMCID: PMC11540444 DOI: 10.1063/5.0237319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Accepted: 10/15/2024] [Indexed: 11/15/2024]
Abstract
RNA complexes are essential components in many cellular processes. The functions of these complexes are linked to their tertiary structures, which are shaped by detailed interface information, such as binding sites, interface contact, and dynamic conformational changes. Network-based approaches have been widely used to analyze RNA complex structures. With their roots in the graph theory, these methods have a long history of providing insight into the static and dynamic properties of RNA molecules. These approaches have been effective in identifying functional binding sites and analyzing the dynamic behavior of RNA complexes. Recently, the advent of artificial intelligence (AI) has brought transformative changes to the field. These technologies have been increasingly applied to studying RNA complex structures, providing new avenues for understanding the complex interactions within RNA complexes. By integrating AI with traditional network analysis methods, researchers can build more accurate models of RNA complex structures, predict their dynamic behaviors, and even design RNA-based inhibitors. In this review, we introduce the integration of network-based methodologies with AI techniques to enhance the understanding of RNA complex structures. We examine how these advanced computational tools can be used to model and analyze the detailed interface information and dynamic behaviors of RNA molecules. Additionally, we explore the potential future directions of how AI-integrated networks can aid in the modeling and analyzing RNA complex structures.
Collapse
Affiliation(s)
- Haoquan Liu
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chen Zhuo
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Jiaming Gao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chengwei Zeng
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Yunjie Zhao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| |
Collapse
|
10
|
Li Y, Zhou Y, Yuan J, Ye F, Gu Q. CryoSTAR: leveraging structural priors and constraints for cryo-EM heterogeneous reconstruction. Nat Methods 2024; 21:2318-2326. [PMID: 39472738 DOI: 10.1038/s41592-024-02486-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 09/25/2024] [Indexed: 12/07/2024]
Abstract
Resolving conformational heterogeneity in cryogenic electron microscopy datasets remains an important challenge in structural biology. Previous methods have often been restricted to working exclusively on volumetric densities, neglecting the potential of incorporating any preexisting structural knowledge as prior or constraints. Here we present cryoSTAR, which harnesses atomic model information as structural regularization to elucidate such heterogeneity. Our method uniquely outputs both coarse-grained models and density maps, showcasing the molecular conformational changes at different levels. Validated against four diverse experimental datasets, spanning large complexes, a membrane protein and a small single-chain protein, our results consistently demonstrate an efficient and effective solution to conformational heterogeneity with minimal human bias. By integrating atomic model insights with cryogenic electron microscopy data, cryoSTAR represents a meaningful step forward, paving the way for a deeper understanding of dynamic biological processes.
Collapse
Affiliation(s)
- Yilai Li
- ByteDance Research, San Jose, CA, USA
| | - Yi Zhou
- ByteDance Research, Shanghai, China
| | | | - Fei Ye
- ByteDance Research, Shanghai, China
| | | |
Collapse
|
11
|
Zhang C, Tang D, Han C, Gou Y, Chen M, Huang X, Liu D, Zhao M, Xiao L, Xiao Q, Peng D, Xue Y. GPS-pPLM: A Language Model for Prediction of Prokaryotic Phosphorylation Sites. Cells 2024; 13:1854. [PMID: 39594603 PMCID: PMC11593113 DOI: 10.3390/cells13221854] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 11/06/2024] [Accepted: 11/07/2024] [Indexed: 11/28/2024] Open
Abstract
In the prokaryotic kingdom, protein phosphorylation serves as one of the most important posttranslational modifications (PTMs) and is involved in orchestrating a broad spectrum of biological processes. Here, we report an updated online server named the group-based prediction system for prokaryotic phosphorylation language model (GPS-pPLM), used for predicting phosphorylation sites (p-sites) in prokaryotes. For model training, two deep learning methods, a transformer and a deep neural network, were employed, and a total of 10 sequence features and contextual features were integrated. Using 44,839 nonredundant p-sites in 16,041 proteins from 95 prokaryotes, two general models for the prediction of O-phosphorylation and N-phosphorylation were first pretrained and then fine-tuned to construct 6 predictors specific for each phosphorylatable residue type as well as 134 species-specific predictors. Compared with other existing tools, the GPS-pPLM exhibits higher accuracy in predicting prokaryotic O-phosphorylation p-sites. Protein sequences in FASTA format or UniProt accession numbers can be submitted by users, and the predicted results are displayed in tabular form. In addition, we annotate the predicted p-sites with knowledge from 22 public resources, including experimental evidence, 3D structures, and disorder tendencies. The online service of the GPS-pPLM is freely accessible for academic research.
Collapse
Affiliation(s)
- Chi Zhang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dachao Tang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Cheng Han
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yujie Gou
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaomiao Chen
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Xinhe Huang
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Dan Liu
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Miaoying Zhao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Leming Xiao
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Qiang Xiao
- School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China;
| | - Di Peng
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| | - Yu Xue
- Department of Bioinformatics and Systems Biology, MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; (C.Z.); (D.T.); (C.H.); (Y.G.); (M.C.); (X.H.); (D.L.); (M.Z.); (L.X.)
| |
Collapse
|
12
|
Zhang Z, Wayment-Steele HK, Brixi G, Wang H, Kern D, Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc Natl Acad Sci U S A 2024; 121:e2406285121. [PMID: 39467119 PMCID: PMC11551344 DOI: 10.1073/pnas.2406285121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Accepted: 09/03/2024] [Indexed: 10/30/2024] Open
Abstract
Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a "categorical Jacobian" calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.
Collapse
Affiliation(s)
- Zhidian Zhang
- Harvard University, Cambridge, MA02138
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA02139
- Institute of Bioengineering, School of Life Sciences, Ecole polytechnique fédérale de Lausanne, LausanneVD 1015, Switzerland
| | - Hannah K. Wayment-Steele
- HHMI, Brandeis University, Waltham, MA02453
- Department of Biochemistry, Brandeis University, Waltham, MA02453
| | - Garyk Brixi
- Harvard College, Harvard University, Cambridge, MA02138
| | | | - Dorothee Kern
- HHMI, Brandeis University, Waltham, MA02453
- Department of Biochemistry, Brandeis University, Waltham, MA02453
| | - Sergey Ovchinnikov
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA02139
- John Harvard Distinguished Science Fellowship, Harvard University, Cambridge, MA02138
| |
Collapse
|
13
|
Karagöl A, Karagöl T, Li M, Zhang S. Inhibitory Potential of the Truncated Isoforms on Glutamate Transporter Oligomerization Identified by Computational Analysis of Gene-Centric Isoform Maps. Pharm Res 2024; 41:2173-2187. [PMID: 39487385 PMCID: PMC11599315 DOI: 10.1007/s11095-024-03786-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Accepted: 10/14/2024] [Indexed: 11/04/2024]
Abstract
OBJECTIVE Glutamate transporters play a key role in central nervous system physiology by maintaining excitatory neurotransmitter homeostasis. Biological assemblies of the transporters, consisting of cyclic homotrimers, emerge as a crucial aspect of glutamate transporter modulation. Hence targeting heteromerization promises an effective approach for modulator design. On the other hand, the dynamic nature of transcription allows for the generation of transporter isoforms in structurally distinct manners. METHODS The potential isoforms were identified through the analysis of computationally generated gene-centric isoform maps. The conserved features of isoform sequences were revealed by computational chemistry methods and subsequent structural analysis of AlphaFold2 predictions. Truncated isoforms were further subjected to a wide range of docking analyses, 50ns molecular dynamics simulations, and evolutionary coupling analyses. RESULTS Energetic landscapes of isoform-canonical transporter complexes suggested an inhibitory potential of truncated isoforms on glutamate transporter bio-assembly. Moreover, isoforms that mimic the trimerization domain (in particular, TM2 helices) exhibited stronger interactions with canonical transporters, underscoring the role of transmembrane helices in isoform interactions. Additionally, self-assembly dynamics observed in truncated isoforms mimicking canonical TM5 helices indicate a potential protective role against unwanted interactions with canonical transporters. CONCLUSION Our computational studies on glutamate transporters offer insights into the roles of alternative splicing on protein interactions and identifies potential drug targets for physiological or pathological processes.
Collapse
Affiliation(s)
- Alper Karagöl
- Istanbul University Istanbul Medical Faculty, Istanbul, Turkey
| | - Taner Karagöl
- Istanbul University Istanbul Medical Faculty, Istanbul, Turkey
| | - Mengke Li
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic and Developmental Sciences, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Shuguang Zhang
- Laboratory of Molecular Architecture, Media Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
| |
Collapse
|
14
|
Shimagaki KS, Barton JP. Efficient epistasis inference via higher-order covariance matrix factorization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.14.618287. [PMID: 39464126 PMCID: PMC11507688 DOI: 10.1101/2024.10.14.618287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Epistasis can profoundly influence evolutionary dynamics. Temporal genetic data, consisting of sequences sampled repeatedly from a population over time, provides a unique resource to understand how epistasis shapes evolution. However, detecting epistatic interactions from sequence data is technically challenging. Existing methods for identifying epistasis are computationally demanding, limiting their applicability to real-world data. Here, we present a novel computational method for inferring epistasis that significantly reduces computational costs without sacrificing accuracy. We validated our approach in simulations and applied it to study HIV-1 evolution over multiple years in a data set of 16 individuals. There we observed a strong excess of negative epistatic interactions between beneficial mutations, especially mutations involved in immune escape. Our method is general and could be used to characterize epistasis in other large data sets.
Collapse
Affiliation(s)
- Kai S. Shimagaki
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, USA
- Department of Physics and Astronomy, University of Pittsburgh, USA
| | - John P. Barton
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, USA
- Department of Physics and Astronomy, University of Pittsburgh, USA
| |
Collapse
|
15
|
Chen WC, Zhou J, McCandlish DM. Density estimation for ordinal biological sequences and its applications. Phys Rev E 2024; 110:044408. [PMID: 39562961 PMCID: PMC11605730 DOI: 10.1103/physreve.110.044408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 10/03/2024] [Indexed: 11/21/2024]
Abstract
Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first step toward understanding the underlying mechanisms. Here we propose a method for inferring the probability distribution from which a sample of biological sequences were drawn for the case where the sequences are composed of elements that admit a natural ordering. Our method is based on Bayesian field theory, a physics-based machine learning approach, and can be regarded as a nonparametric extension of the traditional maximum entropy estimate. As an example, we use it to analyze the aneuploidy data pertaining to gliomas from The Cancer Genome Atlas project. In addition, we demonstrate two follow-up analyses that can be performed with the resulting probability distribution. One of them is to investigate the associations among the sequence sites. This provides a way to infer the governing biological grammar. The other is to study the global geometry of the probability landscape, which allows us to look at the problem from an evolutionary point of view. It can be seen that this methodology enables us to learn from a sample of sequences about how a biological system or phenomenon in the real world works.
Collapse
Affiliation(s)
- Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, R.O.C
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, Florida 32611, U.S.A
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, U.S.A
| |
Collapse
|
16
|
Liu J, Guo Z, You H, Zhang C, Lai L. All-Atom Protein Sequence Design Based on Geometric Deep Learning. Angew Chem Int Ed Engl 2024:e202411461. [PMID: 39295564 DOI: 10.1002/anie.202411461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 09/09/2024] [Accepted: 09/18/2024] [Indexed: 09/21/2024]
Abstract
Designing sequences for specific protein backbones is a key step in creating new functional proteins. Here, we introduce GeoSeqBuilder, a deep learning framework that integrates protein sequence generation with side chain conformation prediction to produce the complete all-atom structures for designed sequences. GeoSeqBuilder uses spatial geometric features from protein backbones and explicitly includes three-body interactions of neighboring residues. GeoSeqBuilder achieves native residue type recovery rate of 51.6 %, comparable to ProteinMPNN and other leading methods, while accurately predicting side chain conformations. We first used GeoSeqBuilder to design sequences for thioredoxin and a hallucinated three-helical bundle protein. All the 15 tested sequences expressed as soluble monomeric proteins with high thermal stability, and the 2 high-resolution crystal structures solved closely match the designed models. The generated protein sequences exhibit low similarity (minimum 23 %) to the original sequences, with significantly altered hydrophobic cores. We further redesigned the hydrophobic core of glutathione peroxidase 4, and 3 of the 5 designs showed improved enzyme activity. Although further testing is needed, the high experimental success rate in our testing demonstrates that GeoSeqBuilder is a powerful tool for designing novel sequences for predefined protein structures with atomic details. GeoSeqBuilder is available at https://github.com/PKUliujl/GeoSeqBuilder.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Zheng Guo
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Hantian You
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Luhua Lai
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Peking University, Chengdu, 510100, Sichuan, China
| |
Collapse
|
17
|
Gao J, Liu H, Zhuo C, Zeng C, Zhao Y. Predicting Small Molecule Binding Nucleotides in RNA Structures Using RNA Surface Topography. J Chem Inf Model 2024. [PMID: 39230508 DOI: 10.1021/acs.jcim.4c01264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
RNA small molecule interactions play a crucial role in drug discovery and inhibitor design. Identifying RNA small molecule binding nucleotides is essential and requires methods that exhibit a high predictive ability to facilitate drug discovery and inhibitor design. Existing methods can predict the binding nucleotides of simple RNA structures, but it is hard to predict binding nucleotides in complex RNA structures with junctions. To address this limitation, we developed a new deep learning model based on spatial correlation, ZHmolReSTasite, which can accurately predict binding nucleotides of small and large RNA with junctions. We utilize RNA surface topography to consider the spatial correlation, characterizing nucleotides from sequence and tertiary structures to learn a high-level representation. Our method outperforms existing methods for benchmark test sets composed of simple RNA structures, achieving precision values of 72.9% on TE18 and 76.7% on RB9 test sets. For a challenging test set composed of RNA structures with junctions, our method outperforms the second best method by 11.6% in precision. Moreover, ZHmolReSTasite demonstrates robustness regarding the predicted RNA structures. In summary, ZHmolReSTasite successfully incorporates spatial correlation, outperforms previous methods on small and large RNA structures using RNA surface topography, and can provide valuable insights into RNA small molecule prediction and accelerate RNA inhibitor design.
Collapse
Affiliation(s)
- Jiaming Gao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Haoquan Liu
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chen Zhuo
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Chengwei Zeng
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| | - Yunjie Zhao
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
| |
Collapse
|
18
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
19
|
Illig AM, Siedhoff NE, Davari MD, Schwaneberg U. Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort. J Chem Inf Model 2024; 64:6350-6360. [PMID: 39088689 DOI: 10.1021/acs.jcim.4c00704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2024]
Abstract
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
Collapse
Affiliation(s)
| | - Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| |
Collapse
|
20
|
Innocenti G, Obara M, Costa B, Jacobsen H, Katzmarzyk M, Cicin-Sain L, Kalinke U, Galardini M. Real-time identification of epistatic interactions in SARS-CoV-2 from large genome collections. Genome Biol 2024; 25:228. [PMID: 39175058 PMCID: PMC11342480 DOI: 10.1186/s13059-024-03355-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Accepted: 07/26/2024] [Indexed: 08/24/2024] Open
Abstract
BACKGROUND The emergence of the SARS-CoV-2 virus has highlighted the importance of genomic epidemiology in understanding the evolution of pathogens and guiding public health interventions. The Omicron variant in particular has underscored the role of epistasis in the evolution of lineages with both higher infectivity and immune escape, and therefore the necessity to update surveillance pipelines to detect them early on. RESULTS In this study, we apply a method based on mutual information between positions in a multiple sequence alignment, which is capable of scaling up to millions of samples. We show how it can reliably predict known experimentally validated epistatic interactions, even when using as little as 10,000 sequences, which opens the possibility of making it a near real-time prediction system. We test this possibility by modifying the method to account for the sample collection date and apply it retrospectively to multiple sequence alignments for each month between March 2020 and March 2023. We detected a cornerstone epistatic interaction in the Spike protein between codons 498 and 501 as soon as seven samples with a double mutation were present in the dataset, thus demonstrating the method's sensitivity. We test the ability of the method to make inferences about emerging interactions by testing candidates predicted after March 2023, which we validate experimentally. CONCLUSIONS We show how known epistatic interaction in SARS-CoV-2 can be detected with high sensitivity, and how emerging ones can be quickly prioritized for experimental validation, an approach that could be implemented downstream of pandemic genome sequencing efforts.
Collapse
Affiliation(s)
- Gabriel Innocenti
- Institute for Molecular Bacteriology, TWINCORE Centre for Experimental and Clinical Infection Research, a joint venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School (MHH), Hannover, Germany
- Center for Cancer Research, Medical University of Vienna, Vienna, Austria
| | - Maureen Obara
- Institute for Experimental Infection Research, TWINCORE Centre for Experimental and Clinical Infection Research, a joint venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany
| | - Bibiana Costa
- Institute for Experimental Infection Research, TWINCORE Centre for Experimental and Clinical Infection Research, a joint venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany
| | - Henning Jacobsen
- Helmholtz Centre for Infection Research, Department of Viral Immunology (VIRI), Brunswick, Germany
- Centre for Individualized Infection Medicine (CiiM) a Joint Venture of Helmholtz Centre for Infection Research and Hannover Medical School, Hannover, Germany
| | - Maeva Katzmarzyk
- Helmholtz Centre for Infection Research, Department of Viral Immunology (VIRI), Brunswick, Germany
- Centre for Individualized Infection Medicine (CiiM) a Joint Venture of Helmholtz Centre for Infection Research and Hannover Medical School, Hannover, Germany
| | - Luka Cicin-Sain
- Helmholtz Centre for Infection Research, Department of Viral Immunology (VIRI), Brunswick, Germany
- Centre for Individualized Infection Medicine (CiiM) a Joint Venture of Helmholtz Centre for Infection Research and Hannover Medical School, Hannover, Germany
| | - Ulrich Kalinke
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School (MHH), Hannover, Germany
- Institute for Experimental Infection Research, TWINCORE Centre for Experimental and Clinical Infection Research, a joint venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany
| | - Marco Galardini
- Institute for Molecular Bacteriology, TWINCORE Centre for Experimental and Clinical Infection Research, a joint venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany.
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School (MHH), Hannover, Germany.
| |
Collapse
|
21
|
Ohnuki J, Okazaki KI. Integration of AlphaFold with Molecular Dynamics for Efficient Conformational Sampling of Transporter Protein NarK. J Phys Chem B 2024. [PMID: 39066727 DOI: 10.1021/acs.jpcb.4c02726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Transporter proteins carry their substrate across the cell membrane by changing their conformation. Thus, conformational dynamics are crucial for transport function. However, clarifying the complete transport cycle is challenging even with the current structural biology approach. Molecular dynamics (MD) simulation is a computational approach that can provide the time-resolved conformational dynamics of transporter proteins in atomic details but suffers from a high computational cost. Here, we integrate state-of-the-art protein structure prediction AI, AlphaFold2 (AF2), with MD simulation to reduce the computational cost. Focusing on the transporter protein NarK, we first show that AF2 sampled broad conformations of NarK, including the inward-open, occluded, and outward-open states. We also applied the coevolution-informed mutation in AF2, identifying state-shifting mutations. Then, we show that MD simulations from AF2-generated outward-open conformation, which is experimentally unresolved, captured the essence of the conformational state. We also found that MD simulations from AF2-generated intermediates showed transient dynamics like a transition state connecting two conformational states. This study paves the way for efficient conformational sampling of transporter proteins.
Collapse
Affiliation(s)
- Jun Ohnuki
- Research Center for Computational Science, Institute for Molecular Science, National Institutes of Natural Sciences, Okazaki, Aichi 444-8585, Japan
- Graduate Institute for Advanced Studies, SOKENDAI, Okazaki, Aichi 444-8585, Japan
| | - Kei-Ichi Okazaki
- Research Center for Computational Science, Institute for Molecular Science, National Institutes of Natural Sciences, Okazaki, Aichi 444-8585, Japan
- Graduate Institute for Advanced Studies, SOKENDAI, Okazaki, Aichi 444-8585, Japan
| |
Collapse
|
22
|
Martí JM, Hsu C, Rochereau C, Xu C, Blazejewski T, Nisonoff H, Leonard SP, Kang-Yun CS, Chlebek J, Ricci DP, Park D, Wang H, Listgarten J, Jiao Y, Allen JE. GENTANGLE: integrated computational design of gene entanglements. Bioinformatics 2024; 40:btae380. [PMID: 38905502 PMCID: PMC11251573 DOI: 10.1093/bioinformatics/btae380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 06/01/2024] [Accepted: 06/14/2024] [Indexed: 06/23/2024] Open
Abstract
SUMMARY The design of two overlapping genes in a microbial genome is an emerging technique for adding more reliable control mechanisms in engineered organisms for increased stability. The design of functional overlapping gene pairs is a challenging procedure, and computational design tools are used to improve the efficiency to deploy successful designs in genetically engineered systems. GENTANGLE (Gene Tuples ArraNGed in overLapping Elements) is a high-performance containerized pipeline for the computational design of two overlapping genes translated in different reading frames of the genome. This new software package can be used to design and test gene entanglements for microbial engineering projects using arbitrary sets of user-specified gene pairs. AVAILABILITY AND IMPLEMENTATION The GENTANGLE source code and its submodules are freely available on GitHub at https://github.com/BiosecSFA/gentangle. The DATANGLE (DATA for genTANGLE) repository contains related data and results and is freely available on GitHub at https://github.com/BiosecSFA/datangle. The GENTANGLE container is freely available on Singularity Cloud Library at https://cloud.sylabs.io/library/khyox/gentangle/gentangle.sif. The GENTANGLE repository wiki (https://github.com/BiosecSFA/gentangle/wiki), website (https://biosecsfa.github.io/gentangle/), and user manual contain detailed instructions on how to use the different components of software and data, including examples and reproducing the results. The code is licensed under the GNU Affero General Public License version 3 (https://www.gnu.org/licenses/agpl.html).
Collapse
Affiliation(s)
- Jose Manuel Martí
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Chloe Hsu
- Center for Computational Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Charlotte Rochereau
- Department of Systems Biology, Columbia University, New York, NY 10023, United States
| | - Chenling Xu
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Tomasz Blazejewski
- Department of Systems Biology, Columbia University, New York, NY 10023, United States
| | - Hunter Nisonoff
- Center for Computational Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Sean P Leonard
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Christina S Kang-Yun
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Jennifer Chlebek
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Dante P Ricci
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Dan Park
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Harris Wang
- Department of Systems Biology, Columbia University, New York, NY 10023, United States
| | - Jennifer Listgarten
- Center for Computational Biology, University of California Berkeley, Berkeley, CA 94720, USA
| | - Yongqin Jiao
- Biosciences & Biotechnology Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| | - Jonathan E Allen
- Global Security Computing Applications Division, Lawrence Livermore National Laboratory, Livermore, CA 94550, United States
| |
Collapse
|
23
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
24
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593772. [PMID: 38798671 PMCID: PMC11118547 DOI: 10.1101/2024.05.12.593772] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
25
|
Hiraizumi M, Perry NT, Durrant MG, Soma T, Nagahata N, Okazaki S, Athukoralage JS, Isayama Y, Pai JJ, Pawluk A, Konermann S, Yamashita K, Hsu PD, Nishimasu H. Structural mechanism of bridge RNA-guided recombination. Nature 2024; 630:994-1002. [PMID: 38926616 PMCID: PMC11208158 DOI: 10.1038/s41586-024-07570-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 05/15/2024] [Indexed: 06/28/2024]
Abstract
Insertion sequence (IS) elements are the simplest autonomous transposable elements found in prokaryotic genomes1. We recently discovered that IS110 family elements encode a recombinase and a non-coding bridge RNA (bRNA) that confers modular specificity for target DNA and donor DNA through two programmable loops2. Here we report the cryo-electron microscopy structures of the IS110 recombinase in complex with its bRNA, target DNA and donor DNA in three different stages of the recombination reaction cycle. The IS110 synaptic complex comprises two recombinase dimers, one of which houses the target-binding loop of the bRNA and binds to target DNA, whereas the other coordinates the bRNA donor-binding loop and donor DNA. We uncovered the formation of a composite RuvC-Tnp active site that spans the two dimers, positioning the catalytic serine residues adjacent to the recombination sites in both target and donor DNA. A comparison of the three structures revealed that (1) the top strands of target and donor DNA are cleaved at the composite active sites to form covalent 5'-phosphoserine intermediates, (2) the cleaved DNA strands are exchanged and religated to create a Holliday junction intermediate, and (3) this intermediate is subsequently resolved by cleavage of the bottom strands. Overall, this study reveals the mechanism by which a bispecific RNA confers target and donor DNA specificity to IS110 recombinases for programmable DNA recombination.
Collapse
Affiliation(s)
- Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Nicholas T Perry
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- San Francisco Graduate Program in Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Teppei Soma
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Naoto Nagahata
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan
| | - Sae Okazaki
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan
| | | | - Yukari Isayama
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan
| | | | | | - Silvana Konermann
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
| | - Keitaro Yamashita
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan
| | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA.
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan.
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan.
- Inamori Research Institute for Science, Kyoto, Japan.
| |
Collapse
|
26
|
Mallawaarachchi S, Tonkin-Hill G, Pöntinen A, Calland J, Gladstone R, Arredondo-Alonso S, MacAlasdair N, Thorpe H, Top J, Sheppard S, Balding D, Croucher N, Corander J. Detecting co-selection through excess linkage disequilibrium in bacterial genomes. NAR Genom Bioinform 2024; 6:lqae061. [PMID: 38846349 PMCID: PMC11155488 DOI: 10.1093/nargab/lqae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2024] [Revised: 04/15/2024] [Accepted: 05/14/2024] [Indexed: 06/09/2024] Open
Abstract
Population genomics has revolutionized our ability to study bacterial evolution by enabling data-driven discovery of the genetic architecture of trait variation. Genome-wide association studies (GWAS) have more recently become accompanied by genome-wide epistasis and co-selection (GWES) analysis, which offers a phenotype-free approach to generating hypotheses about selective processes that simultaneously impact multiple loci across the genome. However, existing GWES methods only consider associations between distant pairs of loci within the genome due to the strong impact of linkage-disequilibrium (LD) over short distances. Based on the general functional organisation of genomes it is nevertheless expected that majority of co-selection and epistasis will act within relatively short genomic proximity, on co-variation occurring within genes and their promoter regions, and within operons. Here, we introduce LDWeaver, which enables an exhaustive GWES across both short- and long-range LD, to disentangle likely neutral co-variation from selection. We demonstrate the ability of LDWeaver to efficiently generate hypotheses about co-selection using large genomic surveys of multiple major human bacterial pathogen species and validate several findings using functional annotation and phenotypic measurements. Our approach will facilitate the study of bacterial evolution in the light of rapidly expanding population genomic data.
Collapse
Affiliation(s)
| | | | - Anna K Pöntinen
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Norwegian National Advisory Unit on Detection of Antimicrobial Resistance, Department of Microbiology and Infection Control, University Hospital of North Norway, Tromsø, Norway
| | - Jessica K Calland
- Oslo Centre for Biostatistics and Epidemiology, Oslo University Hospital, Oslo, Norway
| | | | | | | | - Harry A Thorpe
- Department of Biostatistics, University of Oslo, Oslo, Norway
| | - Janetta Top
- Department of Medical Microbiology, UMC Utrecht, Utrecht, The Netherlands
| | - Samuel K Sheppard
- Ineos Oxford Institute of Antimicrobial Research, Department of Biology, University of Oxford, Oxford, United Kingdom
| | - David Balding
- Melbourne Integrative Genomics, School of BioSciences and School of Mathematics & Statistics, University of Melbourne, Parkville, Victoria, Australia
| | - Nicholas J Croucher
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, United Kingdom
- MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, United Kingdom
| | - Jukka Corander
- Department of Biostatistics, University of Oslo, Oslo, Norway
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, UK
- Helsinki Institute of Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
| |
Collapse
|
27
|
Durrant MG, Perry NT, Pai JJ, Jangid AR, Athukoralage JS, Hiraizumi M, McSpedon JP, Pawluk A, Nishimasu H, Konermann S, Hsu PD. Bridge RNAs direct programmable recombination of target and donor DNA. Nature 2024; 630:984-993. [PMID: 38926615 PMCID: PMC11208160 DOI: 10.1038/s41586-024-07552-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 05/09/2024] [Indexed: 06/28/2024]
Abstract
Genomic rearrangements, encompassing mutational changes in the genome such as insertions, deletions or inversions, are essential for genetic diversity. These rearrangements are typically orchestrated by enzymes that are involved in fundamental DNA repair processes, such as homologous recombination, or in the transposition of foreign genetic material by viruses and mobile genetic elements1,2. Here we report that IS110 insertion sequences, a family of minimal and autonomous mobile genetic elements, express a structured non-coding RNA that binds specifically to their encoded recombinase. This bridge RNA contains two internal loops encoding nucleotide stretches that base-pair with the target DNA and the donor DNA, which is the IS110 element itself. We demonstrate that the target-binding and donor-binding loops can be independently reprogrammed to direct sequence-specific recombination between two DNA molecules. This modularity enables the insertion of DNA into genomic target sites, as well as programmable DNA excision and inversion. The IS110 bridge recombination system expands the diversity of nucleic-acid-guided systems beyond CRISPR and RNA interference, offering a unified mechanism for the three fundamental DNA rearrangements-insertion, excision and inversion-that are required for genome design.
Collapse
Affiliation(s)
- Matthew G Durrant
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Nicholas T Perry
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- University of California, Berkeley-University of California, San Francisco Graduate Program in Bioengineering, Berkeley, CA, USA
| | | | - Aditya R Jangid
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, University of Tokyo, Tokyo, Japan
| | | | | | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, University of Tokyo, Tokyo, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, University of Tokyo, Tokyo, Japan
- Department of Biological Sciences, Graduate School of Science, University of Tokyo, Tokyo, Japan
- Inamori Research Institute for Science, Kyoto, Japan
- Japan Science and Technology Agency, Core Research for Evolutional Science and Technology, Saitama, Japan
| | - Silvana Konermann
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
| | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA.
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
28
|
Wang D, Frechette LB, Best RB. On the role of native contact cooperativity in protein folding. Proc Natl Acad Sci U S A 2024; 121:e2319249121. [PMID: 38776371 PMCID: PMC11145220 DOI: 10.1073/pnas.2319249121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/11/2024] [Indexed: 05/25/2024] Open
Abstract
The consistency of energy landscape theory predictions with available experimental data, as well as direct evidence from molecular simulations, have shown that protein folding mechanisms are largely determined by the contacts present in the native structure. As expected, native contacts are generally energetically favorable. However, there are usually at least as many energetically favorable nonnative pairs owing to the greater number of possible nonnative interactions. This apparent frustration must therefore be reduced by the greater cooperativity of native interactions. In this work, we analyze the statistics of contacts in the unbiased all-atom folding trajectories obtained by Shaw and coworkers, focusing on the unfolded state. By computing mutual cooperativities between contacts formed in the unfolded state, we show that native contacts form the most cooperative pairs, while cooperativities among nonnative or between native and nonnative contacts are typically much less favorable or even anticooperative. Furthermore, we show that the largest network of cooperative interactions observed in the unfolded state consists mainly of native contacts, suggesting that this set of mutually reinforcing interactions has evolved to stabilize the native state.
Collapse
Affiliation(s)
- David Wang
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
- Department of Biology, Johns Hopkins University, Baltimore, MD21218
| | - Layne B. Frechette
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
- Martin A. Fisher School of Physics, Brandeis University, Waltham, MA02453
| | - Robert B. Best
- Laboratory of Chemical Physics, National Institute of Diabetes and Digestive and Kidney Diseases, NIH, Bethesda, MD20892-0520
| |
Collapse
|
29
|
Chen K, Litfin T, Singh J, Zhan J, Zhou Y. MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae018. [PMID: 38872612 PMCID: PMC12053375 DOI: 10.1093/gpbjnl/qzae018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 09/24/2023] [Accepted: 10/31/2023] [Indexed: 06/15/2024]
Abstract
Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by incorporating the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to the nucleotide (nt) database and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI's nt database or 60-fold larger than RNAcentral. The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037, and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.
Collapse
Affiliation(s)
- Ke Chen
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
- Peking University Shenzhen Graduate School, Shenzhen 518055, China
- University of Science and Technology of China, Hefei 230026, China
- Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou 215123, China
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia
| | - Jaswinder Singh
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Jian Zhan
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
- Peking University Shenzhen Graduate School, Shenzhen 518055, China
- Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
30
|
Ose NJ, Campitelli P, Modi T, Kazan IC, Kumar S, Ozkan SB. Some mechanistic underpinnings of molecular adaptations of SARS-COV-2 spike protein by integrating candidate adaptive polymorphisms with protein dynamics. eLife 2024; 12:RP92063. [PMID: 38713502 PMCID: PMC11076047 DOI: 10.7554/elife.92063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2024] Open
Abstract
We integrate evolutionary predictions based on the neutral theory of molecular evolution with protein dynamics to generate mechanistic insight into the molecular adaptations of the SARS-COV-2 spike (S) protein. With this approach, we first identified candidate adaptive polymorphisms (CAPs) of the SARS-CoV-2 S protein and assessed the impact of these CAPs through dynamics analysis. Not only have we found that CAPs frequently overlap with well-known functional sites, but also, using several different dynamics-based metrics, we reveal the critical allosteric interplay between SARS-CoV-2 CAPs and the S protein binding sites with the human ACE2 (hACE2) protein. CAPs interact far differently with the hACE2 binding site residues in the open conformation of the S protein compared to the closed form. In particular, the CAP sites control the dynamics of binding residues in the open state, suggesting an allosteric control of hACE2 binding. We also explored the characteristic mutations of different SARS-CoV-2 strains to find dynamic hallmarks and potential effects of future mutations. Our analyses reveal that Delta strain-specific variants have non-additive (i.e., epistatic) interactions with CAP sites, whereas the less pathogenic Omicron strains have mostly additive mutations. Finally, our dynamics-based analysis suggests that the novel mutations observed in the Omicron strain epistatically interact with the CAP sites to help escape antibody binding.
Collapse
Affiliation(s)
- Nicholas James Ose
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Paul Campitelli
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Tushar Modi
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - I Can Kazan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple UniversityPhiladelphiaUnited States
- Department of Biology, Temple UniversityPhiladelphiaUnited States
- Center for Genomic Medicine Research, King Abdulaziz UniversityJeddahSaudi Arabia
| | - Sefika Banu Ozkan
- Department of Physics and Center for Biological Physics, Arizona State UniversityTempeUnited States
| |
Collapse
|
31
|
Sood A, Schuette G, Zhang B. Dynamical phase transition in models that couple chromatin folding with histone modifications. Phys Rev E 2024; 109:054411. [PMID: 38907407 DOI: 10.1103/physreve.109.054411] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Accepted: 04/25/2024] [Indexed: 06/24/2024]
Abstract
Genomic regions can acquire heritable epigenetic states through unique histone modifications, which lead to stable gene expression patterns without altering the underlying DNA sequence. However, the relationship between chromatin conformational dynamics and epigenetic stability is poorly understood. In this paper, we propose kinetic models to investigate the dynamic fluctuations of histone modifications and the spatial interactions between nucleosomes. Our model explicitly incorporates the influence of chemical modifications on the structural stability of chromatin and the contribution of chromatin contacts to the cooperative nature of chemical reactions. Through stochastic simulations and analytical theory, we have discovered distinct steady-state outcomes in different kinetic regimes, resembling a dynamical phase transition. Importantly, we have validated that the emergence of this transition, which occurs on biologically relevant timescales, is robust against variations in model design and parameters. Our findings suggest that the viscoelastic properties of chromatin and the timescale at which it transitions from a gel-like to a liquidlike state significantly impact dynamic processes that occur along the one-dimensional DNA sequence.
Collapse
|
32
|
Olsen VK, Whitlock JR, Roudi Y. The quality and complexity of pairwise maximum entropy models for large cortical populations. PLoS Comput Biol 2024; 20:e1012074. [PMID: 38696532 DOI: 10.1371/journal.pcbi.1012074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 05/14/2024] [Accepted: 04/10/2024] [Indexed: 05/04/2024] Open
Abstract
We investigate the ability of the pairwise maximum entropy (PME) model to describe the spiking activity of large populations of neurons recorded from the visual, auditory, motor, and somatosensory cortices. To quantify this performance, we use (1) Kullback-Leibler (KL) divergences, (2) the extent to which the pairwise model predicts third-order correlations, and (3) its ability to predict the probability that multiple neurons are simultaneously active. We compare these with the performance of a model with independent neurons and study the relationship between the different performance measures, while varying the population size, mean firing rate of the chosen population, and the bin size used for binarizing the data. We confirm the previously reported excellent performance of the PME model for small population sizes N < 20. But we also find that larger mean firing rates and bin sizes generally decreases performance. The performance for larger populations were generally not as good. For large populations, pairwise models may be good in terms of predicting third-order correlations and the probability of multiple neurons being active, but still significantly worse than small populations in terms of their improvement over the independent model in KL-divergence. We show that these results are independent of the cortical area and of whether approximate methods or Boltzmann learning are used for inferring the pairwise couplings. We compared the scaling of the inferred couplings with N and find it to be well explained by the Sherrington-Kirkpatrick (SK) model, whose strong coupling regime shows a complex phase with many metastable states. We find that, up to the maximum population size studied here, the fitted PME model remains outside its complex phase. However, the standard deviation of the couplings compared to their mean increases, and the model gets closer to the boundary of the complex phase as the population size grows.
Collapse
Affiliation(s)
- Valdemar Kargård Olsen
- Kavli Institute for Systems Neuroscience, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Jonathan R Whitlock
- Kavli Institute for Systems Neuroscience, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
| | - Yasser Roudi
- Kavli Institute for Systems Neuroscience, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Mathematics, King's College London, London, United Kingdom
| |
Collapse
|
33
|
Xie T, Huang J. Can Protein Structure Prediction Methods Capture Alternative Conformations of Membrane Transporters? J Chem Inf Model 2024; 64:3524-3536. [PMID: 38564295 DOI: 10.1021/acs.jcim.3c01936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Understanding the conformational dynamics of proteins, such as the inward-facing (IF) and outward-facing (OF) transition observed in transporters, is vital for elucidating their functional mechanisms. Despite significant advances in protein structure prediction (PSP) over the past three decades, most efforts have been focused on single-state prediction, leaving multistate or alternative conformation prediction (ACP) relatively unexplored. This discrepancy has led to the development of highly accurate PSP methods such as AlphaFold, yet their capabilities for ACP remain limited. To investigate the performance of current PSP methods in ACP, we curated a data set, named IOMemP, consisting of 32 experimentally determined high-resolution IF and OF structures of 16 membrane proteins with substantial conformational changes. We benchmarked 12 representative PSP methods, along with two recent multistate methods based on AlphaFold, against this data set. Our findings reveal a remarkably consistent preference for specific states across various PSP methods. We elucidated how coevolution information in MSAs influences state preference. Moreover, we showed that AlphaFold, when excluding coevolution information, estimated similar energies between the experimental IF and OF conformations, indicating that the energy model learned by AlphaFold is not biased toward any particular state. Our IOMemP data set and benchmark results are anticipated to advance the development of robust ACP methods.
Collapse
Affiliation(s)
- Tengyu Xie
- College of Life Science, Zhejiang University, HangZhou Zhejiang 310058, China
- Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, HangZhou Zhejiang 310024, China
- Westlake AI Therapeutics Lab, Westlake Laboratory of Life Sciences and Biomedicine, HangZhou Zhejiang 310024, China
| | - Jing Huang
- College of Life Science, Zhejiang University, HangZhou Zhejiang 310058, China
- Key Laboratory of Structural Biology of Zhejiang Province, School of Life Sciences, Westlake University, HangZhou Zhejiang 310024, China
- Westlake AI Therapeutics Lab, Westlake Laboratory of Life Sciences and Biomedicine, HangZhou Zhejiang 310024, China
| |
Collapse
|
34
|
Fang T, Szklarczyk D, Hachilif R, von Mering C. Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration. Sci Rep 2024; 14:6009. [PMID: 38472223 PMCID: PMC10933411 DOI: 10.1038/s41598-024-55655-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.
Collapse
Affiliation(s)
- Tao Fang
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Damian Szklarczyk
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Radja Hachilif
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Christian von Mering
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| |
Collapse
|
35
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
36
|
Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G. Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 2024; 20:e1011812. [PMID: 38377054 PMCID: PMC10878520 DOI: 10.1371/journal.pcbi.1011812] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/08/2024] [Indexed: 02/22/2024] Open
Abstract
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
Collapse
Affiliation(s)
- Luca Sesta
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
| | - Andrea Pagnani
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
- Italian Institute for Genomic Medicine, Torino, Italy
- INFN, Sezione di Torino, Torino, Italy
| | | | | |
Collapse
|
37
|
Pucci F, Zerihun MB, Rooman M, Schug A. pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics 2024; 40:btae074. [PMID: 38335928 PMCID: PMC10881095 DOI: 10.1093/bioinformatics/btae074] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 01/25/2024] [Accepted: 02/06/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION The accurate prediction of how mutations change biophysical properties of proteins or RNA is a major goal in computational biology with tremendous impacts on protein design and genetic variant interpretation. Evolutionary approaches such as coevolution can help solving this issue. RESULTS We present pycofitness, a standalone Python-based software package for the in silico mutagenesis of protein and RNA sequences. It is based on coevolution and, more specifically, on a popular inverse statistical approach, namely direct coupling analysis by pseudo-likelihood maximization. Its efficient implementation and user-friendly command line interface make it an easy-to-use tool even for researchers with no bioinformatics background. To illustrate its strengths, we present three applications in which pycofitness efficiently predicts the deleteriousness of genetic variants and the effect of mutations on protein fitness and thermodynamic stability. AVAILABILITY AND IMPLEMENTATION https://github.com/KIT-MBS/pycofitness.
Collapse
Affiliation(s)
- Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Mehari B Zerihun
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Alexander Schug
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
- Department of Biology, University of Duisburg-Essen, D-45141 Essen, Germany
| |
Collapse
|
38
|
Zhao C, Wang S. AttCON: With better MSAs and attention mechanism for accurate protein contact map prediction. Comput Biol Med 2024; 169:107822. [PMID: 38091726 DOI: 10.1016/j.compbiomed.2023.107822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 11/19/2023] [Accepted: 12/04/2023] [Indexed: 02/08/2024]
Abstract
Protein contact map prediction is a critical and vital step in protein structure prediction, and its accuracy is highly contingent upon the feature representations of protein sequence information and the efficacy of deep learning models. In this paper, we propose an algorithm, DeepMSA+, to generate protein multiple sequence alignments (MSAs) and to construct feature representations based on co-evolutionary information and sequence information derived from MSAs. We also propose an improved deep learning model, AttCON, for training input features to predict protein contact map. The model incorporates an attention module, and by comparing different attention modules, we find a parameter-free attention module suitable for contact map prediction. Additionally, we use the Focal Loss function to better address the data imbalance issue in protein contact map. We also developed a weighted evaluation index (W score) for model evaluation, which takes into account a wide range of metrics. W score is comprehensive in its scope, with a particular focus on the precision of predictions for medium-range and long-range contacts. Experimental results show that AttCON achieves good precision results on datasets from CASP11 to CASP15. Compared to some state-of-the-art methods, it achieves an average improvement of over 5% in both medium-range and long-range predictions, and W score is improved by an average of 2 points.
Collapse
Affiliation(s)
- Che Zhao
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming, 650504, Yunnan, China; Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming, 650504, Yunnan, China.
| |
Collapse
|
39
|
Durrant MG, Perry NT, Pai JJ, Jangid AR, Athukoralage JS, Hiraizumi M, McSpedon JP, Pawluk A, Nishimasu H, Konermann S, Hsu PD. Bridge RNAs direct modular and programmable recombination of target and donor DNA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.24.577089. [PMID: 38328150 PMCID: PMC10849738 DOI: 10.1101/2024.01.24.577089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Genomic rearrangements, encompassing mutational changes in the genome such as insertions, deletions, or inversions, are essential for genetic diversity. These rearrangements are typically orchestrated by enzymes involved in fundamental DNA repair processes such as homologous recombination or in the transposition of foreign genetic material by viruses and mobile genetic elements (MGEs). We report that IS110 insertion sequences, a family of minimal and autonomous MGEs, express a structured non-coding RNA that binds specifically to their encoded recombinase. This bridge RNA contains two internal loops encoding nucleotide stretches that base-pair with the target DNA and donor DNA, which is the IS110 element itself. We demonstrate that the target-binding and donor-binding loops can be independently reprogrammed to direct sequence-specific recombination between two DNA molecules. This modularity enables DNA insertion into genomic target sites as well as programmable DNA excision and inversion. The IS110 bridge system expands the diversity of nucleic acid-guided systems beyond CRISPR and RNA interference, offering a unified mechanism for the three fundamental DNA rearrangements required for genome design.
Collapse
Affiliation(s)
- Matthew G. Durrant
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Nicholas T. Perry
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- University of California, Berkeley - University of California, San Francisco Graduate Program in Bioengineering, Berkeley, CA, USA
| | - James J. Pai
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
| | - Aditya R. Jangid
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | | | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | | | - April Pawluk
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan
- Inamori Research Institute for Science, 620 Suiginya-cho, Shimogyo-ku, Kyoto 600-8411, Japan
- Japan Science and Technology Agency, Core Research for Evolutional Science and Technology, 4-1-8, Honcho, Kawaguchi-shi, Saitama 332-0012, Japan
| | - Silvana Konermann
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
| | - Patrick D. Hsu
- Arc Institute, 3181 Porter Drive, Palo Alto, CA 94304, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| |
Collapse
|
40
|
Ohnuki J, Jaunet-Lahary T, Yamashita A, Okazaki KI. Accelerated Molecular Dynamics and AlphaFold Uncover a Missing Conformational State of Transporter Protein OxlT. J Phys Chem Lett 2024; 15:725-732. [PMID: 38215403 DOI: 10.1021/acs.jpclett.3c03052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]
Abstract
Transporter proteins change their conformations to carry their substrate across the cell membrane. The conformational dynamics is vital to understanding the transport function. We have studied the oxalate transporter (OxlT), an oxalate:formate antiporter from Oxalobacter formigenes, significant in avoiding kidney stone formation. The atomic structure of OxlT has been recently solved in the outward-open and occluded states. However, the inward-open conformation is still missing, hindering a complete understanding of the transporter. Here, we performed a Gaussian accelerated molecular dynamics simulation to sample the extensive conformational space of OxlT and successfully predicted the inward-open conformation where cytoplasmic substrate formate binding was preferred over oxalate binding. We also identified critical interactions for the inward-open conformation. The results were complemented by an AlphaFold2 structure prediction. Although AlphaFold2 solely predicted OxlT in the outward-open conformation, mutation of the identified critical residues made it partly predict the inward-open conformation, identifying possible state-shifting mutations.
Collapse
Affiliation(s)
- Jun Ohnuki
- Research Center for Computational Science, Institute for Molecular Science, National Institutes of Natural Sciences, Okazaki 444-8585, Japan
- Graduate Institute for Advanced Studies, SOKENDAI, Okazaki, Aichi 444-8585, Japan
| | - Titouan Jaunet-Lahary
- Research Center for Computational Science, Institute for Molecular Science, National Institutes of Natural Sciences, Okazaki 444-8585, Japan
| | - Atsuko Yamashita
- Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama University, Okayama 700-8530, Japan
| | - Kei-Ichi Okazaki
- Research Center for Computational Science, Institute for Molecular Science, National Institutes of Natural Sciences, Okazaki 444-8585, Japan
- Graduate Institute for Advanced Studies, SOKENDAI, Okazaki, Aichi 444-8585, Japan
| |
Collapse
|
41
|
Hayes RL, Nixon CF, Marqusee S, Brooks CL. Selection pressures on evolution of ribonuclease H explored with rigorous free-energy-based design. Proc Natl Acad Sci U S A 2024; 121:e2312029121. [PMID: 38194446 PMCID: PMC10801872 DOI: 10.1073/pnas.2312029121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/22/2023] [Indexed: 01/11/2024] Open
Abstract
Understanding natural protein evolution and designing novel proteins are motivating interest in development of high-throughput methods to explore large sequence spaces. In this work, we demonstrate the application of multisite λ dynamics (MSλD), a rigorous free energy simulation method, and chemical denaturation experiments to quantify evolutionary selection pressure from sequence-stability relationships and to address questions of design. This study examines a mesophilic phylogenetic clade of ribonuclease H (RNase H), furthering its extensive characterization in earlier studies, focusing on E. coli RNase H (ecRNH) and a more stable consensus sequence (AncCcons) differing at 15 positions. The stabilities of 32,768 chimeras between these two sequences were computed using the MSλD framework. The most stable and least stable chimeras were predicted and tested along with several other sequences, revealing a designed chimera with approximately the same stability increase as AncCcons, but requiring only half the mutations. Comparing the computed stabilities with experiment for 12 sequences reveals a Pearson correlation of 0.86 and root mean squared error of 1.18 kcal/mol, an unprecedented level of accuracy well beyond less rigorous computational design methods. We then quantified selection pressure using a simple evolutionary model in which sequences are selected according to the Boltzmann factor of their stability. Selection temperatures from 110 to 168 K are estimated in three ways by comparing experimental and computational results to evolutionary models. These estimates indicate selection pressure is high, which has implications for evolutionary dynamics and for the accuracy required for design, and suggests accurate high-throughput computational methods like MSλD may enable more effective protein design.
Collapse
Affiliation(s)
- Ryan L. Hayes
- Department of Chemical and Biomolecular Engineering, University of California, Irvine, CA92697
- Department of Chemistry, University of Michigan, Ann Arbor, MI48109
| | - Charlotte F. Nixon
- Department of Molecular and Cell Biology, University of California, Berkeley, CA94720
| | - Susan Marqusee
- Department of Molecular and Cell Biology, University of California, Berkeley, CA94720
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA94720
- Department of Chemistry, University of California, Berkeley, CA94720
| | - Charles L. Brooks
- Department of Chemistry, University of Michigan, Ann Arbor, MI48109
- Biophysics Program, University of Michigan, Ann Arbor, MI48109
| |
Collapse
|
42
|
Marone R, Landmann E, Devaux A, Lepore R, Seyres D, Zuin J, Burgold T, Engdahl C, Capoferri G, Dell’Aglio A, Larrue C, Simonetta F, Rositzka J, Rhiel M, Andrieux G, Gallagher DN, Schröder MS, Wiederkehr A, Sinopoli A, Do Sacramento V, Haydn A, Garcia-Prat L, Divsalar C, Camus A, Xu L, Bordoli L, Schwede T, Porteus M, Tamburini J, Corn JE, Cathomen T, Cornu TI, Urlinger S, Jeker LT. Epitope-engineered human hematopoietic stem cells are shielded from CD123-targeted immunotherapy. J Exp Med 2023; 220:e20231235. [PMID: 37773046 PMCID: PMC10541312 DOI: 10.1084/jem.20231235] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 09/01/2023] [Accepted: 09/08/2023] [Indexed: 09/30/2023] Open
Abstract
Targeted eradication of transformed or otherwise dysregulated cells using monoclonal antibodies (mAb), antibody-drug conjugates (ADC), T cell engagers (TCE), or chimeric antigen receptor (CAR) cells is very effective for hematologic diseases. Unlike the breakthrough progress achieved for B cell malignancies, there is a pressing need to find suitable antigens for myeloid malignancies. CD123, the interleukin-3 (IL-3) receptor alpha-chain, is highly expressed in various hematological malignancies, including acute myeloid leukemia (AML). However, shared CD123 expression on healthy hematopoietic stem and progenitor cells (HSPCs) bears the risk for myelotoxicity. We demonstrate that epitope-engineered HSPCs were shielded from CD123-targeted immunotherapy but remained functional, while CD123-deficient HSPCs displayed a competitive disadvantage. Transplantation of genome-edited HSPCs could enable tumor-selective targeted immunotherapy while rebuilding a fully functional hematopoietic system. We envision that this approach is broadly applicable to other targets and cells, could render hitherto undruggable targets accessible to immunotherapy, and will allow continued posttransplant therapy, for instance, to treat minimal residual disease (MRD).
Collapse
Affiliation(s)
- Romina Marone
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Emmanuelle Landmann
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Anna Devaux
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Rosalba Lepore
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
- Cimeio Therapeutics AG, Basel, Switzerland
- Ridgeline Discovery GmbH, Basel, Switzerland
| | - Denis Seyres
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Jessica Zuin
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Thomas Burgold
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Corinne Engdahl
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Giuseppina Capoferri
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Alessandro Dell’Aglio
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| | - Clément Larrue
- Translational Research Centre in Onco-Hematology, Faculty of Medicine, University of Geneva, and Swiss Cancer Center Leman, Geneva, Switzerland
| | - Federico Simonetta
- Division of Hematology, Department of Oncology, Geneva University Hospitals, Geneva, Switzerland
- Department of Medicine, Translational Research Center for Onco-Hematology, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julia Rositzka
- Institute for Transfusion Medicine and Gene Therapy, Medical Center - University of Freiburg, Freiburg, Germany
- Center for Chronic Immunodeficiency, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Manuel Rhiel
- Institute for Transfusion Medicine and Gene Therapy, Medical Center - University of Freiburg, Freiburg, Germany
- Center for Chronic Immunodeficiency, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Geoffroy Andrieux
- Institute of Medical Bioinformatics and Systems Medicine, Medical Center-University of Freiburg, Freiburg, Germany
| | - Danielle N. Gallagher
- Department of Biology, Institute of Molecular Health Sciences, ETH Zürich, Zürich, Switzerland
| | - Markus S. Schröder
- Department of Biology, Institute of Molecular Health Sciences, ETH Zürich, Zürich, Switzerland
| | | | | | | | - Anna Haydn
- Ridgeline Discovery GmbH, Basel, Switzerland
| | | | | | - Anna Camus
- Cimeio Therapeutics AG, Basel, Switzerland
| | - Liwen Xu
- Department of Pediatrics, School of Medicine, Stanford University, Stanford, CA, USA
| | - Lorenza Bordoli
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Matthew Porteus
- Department of Pediatrics, School of Medicine, Stanford University, Stanford, CA, USA
| | - Jérôme Tamburini
- Department of Medicine, Translational Research Center for Onco-Hematology, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Jacob E. Corn
- Department of Biology, Institute of Molecular Health Sciences, ETH Zürich, Zürich, Switzerland
| | - Toni Cathomen
- Institute for Transfusion Medicine and Gene Therapy, Medical Center - University of Freiburg, Freiburg, Germany
- Center for Chronic Immunodeficiency, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Tatjana I. Cornu
- Institute for Transfusion Medicine and Gene Therapy, Medical Center - University of Freiburg, Freiburg, Germany
- Center for Chronic Immunodeficiency, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Stefanie Urlinger
- Cimeio Therapeutics AG, Basel, Switzerland
- Ridgeline Discovery GmbH, Basel, Switzerland
| | - Lukas T. Jeker
- Department of Biomedicine, Basel University Hospital and University of Basel, Basel, Switzerland
- Transplantation Immunology and Nephrology, Basel University Hospital, Basel, Switzerland
| |
Collapse
|
43
|
Lee JW, Won JH, Jeon S, Choo Y, Yeon Y, Oh JS, Kim M, Kim S, Joung I, Jang C, Lee SJ, Kim TH, Jin KH, Song G, Kim ES, Yoo J, Paek E, Noh YK, Joo K. DeepFold: enhancing protein structure prediction through optimized loss functions, improved template features, and re-optimized energy function. Bioinformatics 2023; 39:btad712. [PMID: 37995286 PMCID: PMC10699847 DOI: 10.1093/bioinformatics/btad712] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Revised: 11/17/2023] [Accepted: 11/22/2023] [Indexed: 11/25/2023] Open
Abstract
MOTIVATION Predicting protein structures with high accuracy is a critical challenge for the broad community of life sciences and industry. Despite progress made by deep neural networks like AlphaFold2, there is a need for further improvements in the quality of detailed structures, such as side-chains, along with protein backbone structures. RESULTS Building upon the successes of AlphaFold2, the modifications we made include changing the losses of side-chain torsion angles and frame aligned point error, adding loss functions for side chain confidence and secondary structure prediction, and replacing template feature generation with a new alignment method based on conditional random fields. We also performed re-optimization by conformational space annealing using a molecular mechanics energy function which integrates the potential energies obtained from distogram and side-chain prediction. In the CASP15 blind test for single protein and domain modeling (109 domains), DeepFold ranked fourth among 132 groups with improvements in the details of the structure in terms of backbone, side-chain, and Molprobity. In terms of protein backbone accuracy, DeepFold achieved a median GDT-TS score of 88.64 compared with 85.88 of AlphaFold2. For TBM-easy/hard targets, DeepFold ranked at the top based on Z-scores for GDT-TS. This shows its practical value to the structural biology community, which demands highly accurate structures. In addition, a thorough analysis of 55 domains from 39 targets with publicly available structures indicates that DeepFold shows superior side-chain accuracy and Molprobity scores among the top-performing groups. AVAILABILITY AND IMPLEMENTATION DeepFold tools are open-source software available at https://github.com/newtonjoo/deepfold.
Collapse
Affiliation(s)
- Jae-Won Lee
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Jong-Hyun Won
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Seonggwang Jeon
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Yujin Choo
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
- Department of Artificial intelligence, Hanyang University, Seoul 04763, Korea
| | - Yubin Yeon
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Jin-Seon Oh
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
- Department of Artificial intelligence, Hanyang University, Seoul 04763, Korea
| | - Minsoo Kim
- Department of Physics, Sungkyunkwan University, Suwon 16419, Korea
| | - SeonHwa Kim
- School of Electrical Engineering, Korea University, Seoul 02841, Korea
| | | | - Cheongjae Jang
- Artificial Intelligence Institute, Hanyang University, Seoul 04763, Korea
| | - Sung Jong Lee
- Basic Science Research Institute, Changwon National University, Changwon 51140, Korea
| | - Tae Hyun Kim
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Kyong Hwan Jin
- School of Electrical Engineering, Korea University, Seoul 02841, Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, Korea
| | - Eun-Sol Kim
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Jejoong Yoo
- Department of Physics, Sungkyunkwan University, Suwon 16419, Korea
| | - Eunok Paek
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
| | - Yung-Kyun Noh
- Department of Computer Science, Hanyang University, Seoul 04763, Korea
- School of Computational Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea
| | - Keehyoung Joo
- Center for Advanced Computation, Korea Institute for Advanced Study, Seoul 02455, Korea
| |
Collapse
|
44
|
Abakarova M, Marquet C, Rera M, Rost B, Laine E. Alignment-based Protein Mutational Landscape Prediction: Doing More with Less. Genome Biol Evol 2023; 15:evad201. [PMID: 37936309 PMCID: PMC10653582 DOI: 10.1093/gbe/evad201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 10/27/2023] [Accepted: 11/01/2023] [Indexed: 11/09/2023] Open
Abstract
The wealth of genomic data has boosted the development of computational methods predicting the phenotypic outcomes of missense variants. The most accurate ones exploit multiple sequence alignments, which can be costly to generate. Recent efforts for democratizing protein structure prediction have overcome this bottleneck by leveraging the fast homology search of MMseqs2. Here, we show the usefulness of this strategy for mutational outcome prediction through a large-scale assessment of 1.5M missense variants across 72 protein families. Our study demonstrates the feasibility of producing alignment-based mutational landscape predictions that are both high-quality and compute-efficient for entire proteomes. We provide the community with the whole human proteome mutational landscape and simplified access to our predictive pipeline.
Collapse
Affiliation(s)
- Marina Abakarova
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Céline Marquet
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Michael Rera
- Université Paris Cité, INSERM UMR U1284, 75004 Paris, France
| | - Burkhard Rost
- Department of Informatics, Bioinformatics and Computational Biology - i12, TUM-Technical University of Munich, Boltzmannstr. 3, Garching, 85748 Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, 85748 Munich, Germany
- TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Elodie Laine
- CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), Sorbonne Université, UMR 7238, Paris 75005, France
- Institut universitaire de France (IUF)
| |
Collapse
|
45
|
Kilian M, Bischofs IB. Co-evolution at protein-protein interfaces guides inference of stoichiometry of oligomeric protein complexes by de novo structure prediction. Mol Microbiol 2023; 120:763-782. [PMID: 37777474 DOI: 10.1111/mmi.15169] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/10/2023] [Accepted: 09/11/2023] [Indexed: 10/02/2023]
Abstract
The quaternary structure with specific stoichiometry is pivotal to the specific function of protein complexes. However, determining the structure of many protein complexes experimentally remains a major bottleneck. Structural bioinformatics approaches, such as the deep learning algorithm Alphafold2-multimer (AF2-multimer), leverage the co-evolution of amino acids and sequence-structure relationships for accurate de novo structure and contact prediction. Pseudo-likelihood maximization direct coupling analysis (plmDCA) has been used to detect co-evolving residue pairs by statistical modeling. Here, we provide evidence that combining both methods can be used for de novo prediction of the quaternary structure and stoichiometry of a protein complex. We achieve this by augmenting the existing AF2-multimer confidence metrics with an interpretable score to identify the complex with an optimal fraction of native contacts of co-evolving residue pairs at intermolecular interfaces. We use this strategy to predict the quaternary structure and non-trivial stoichiometries of Bacillus subtilis spore germination protein complexes with unknown structures. Co-evolution at intermolecular interfaces may therefore synergize with AI-based de novo quaternary structure prediction of structurally uncharacterized bacterial protein complexes.
Collapse
Affiliation(s)
- Max Kilian
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| | - Ilka B Bischofs
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| |
Collapse
|
46
|
Sawa T, Moriwaki Y, Jiang H, Murase K, Takayama S, Shimizu K, Terada T. Comprehensive computational analysis of the SRK-SP11 molecular interaction underlying self-incompatibility in Brassicaceae using improved structure prediction for cysteine-rich proteins. Comput Struct Biotechnol J 2023; 21:5228-5239. [PMID: 37928947 PMCID: PMC10624595 DOI: 10.1016/j.csbj.2023.10.026] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 10/03/2023] [Accepted: 10/16/2023] [Indexed: 11/07/2023] Open
Abstract
Plants employ self-incompatibility (SI) to promote cross-fertilization. In Brassicaceae, this process is regulated by the formation of a complex between the pistil determinant S receptor kinase (SRK) and the pollen determinant S-locus protein 11 (SP11, also known as S-locus cysteine-rich protein, SCR). In our previous study, we used the crystal structures of two eSRK-SP11 complexes in Brassica rapa S8 and S9 haplotypes and nine computationally predicted complex models to demonstrate that only the SRK ectodomain (eSRK) and SP11 pairs derived from the same S haplotype exhibit high binding free energy. However, predicting the eSRK-SP11 complex structures for the other 100 + S haplotypes and genera remains difficult because of SP11 polymorphism in sequence and structure. Although protein structure prediction using AlphaFold2 exhibits considerably high accuracy for most protein monomers and complexes, 46% of the predicted SP11 structures that we tested showed < 75 mean per-residue confidence score (pLDDT). Here, we demonstrate that the use of curated multiple sequence alignment (MSA) for cysteine-rich proteins significantly improved model accuracy for SP11 and eSRK-SP11 complexes. Additionally, we calculated the binding free energies of the predicted eSRK-SP11 complexes using molecular dynamics (MD) simulations and observed that some Arabidopsis haplotypes formed a binding mode that was critically different from that of B. rapa S8 and S9. Thus, our computational results provide insights into the haplotype-specific eSRK-SP11 binding modes in Brassicaceae at the residue level. The predicted models are freely available at Zenodo, https://doi.org/10.5281/zenodo.8047768.
Collapse
Affiliation(s)
- Tomoki Sawa
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Yoshitaka Moriwaki
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hanting Jiang
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Kohji Murase
- Department of Applied Biological Chemistry, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Seiji Takayama
- Department of Applied Biological Chemistry, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Kentaro Shimizu
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Tohru Terada
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| |
Collapse
|
47
|
Wang H, Zang Y, Kang Y, Zhang J, Zhang L, Zhang S. ETLD: an encoder-transformation layer-decoder architecture for protein contact and mutation effects prediction. Brief Bioinform 2023; 24:bbad290. [PMID: 37598423 DOI: 10.1093/bib/bbad290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/21/2023] [Accepted: 07/26/2023] [Indexed: 08/22/2023] Open
Abstract
The latent features extracted from the multiple sequence alignments (MSAs) of homologous protein families are useful for identifying residue-residue contacts, predicting mutation effects, shaping protein evolution, etc. Over the past three decades, a growing body of supervised and unsupervised machine learning methods have been applied to this field, yielding fruitful results. Here, we propose a novel self-supervised model, called encoder-transformation layer-decoder (ETLD) architecture, capable of capturing protein sequence latent features directly from MSAs. Compared to the typical autoencoder model, ETLD introduces a transformation layer with the ability to learn inter-site couplings, which can be used to parse out the two-dimensional residue-residue contacts map after a simple mathematical derivation or an additional supervised neural network. ETLD retains the process of encoding and decoding sequences, and the predicted probabilities of amino acids at each site can be further used to construct the mutation landscapes for mutation effects prediction, outperforming advanced models such as GEMME, DeepSequence and EVmutation in general. Overall, ETLD is a highly interpretable unsupervised model with great potential for improvement and can be further combined with supervised methods for more extensive and accurate predictions.
Collapse
Affiliation(s)
- He Wang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Yongjian Zang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Ying Kang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jianwen Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Lei Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Shengli Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
48
|
Ghoreyshi ZS, George JT. Quantitative approaches for decoding the specificity of the human T cell repertoire. Front Immunol 2023; 14:1228873. [PMID: 37781387 PMCID: PMC10539903 DOI: 10.3389/fimmu.2023.1228873] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 08/17/2023] [Indexed: 10/03/2023] Open
Abstract
T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method's mathematical approach, predictive performance, and limitations.
Collapse
Affiliation(s)
- Zahra S. Ghoreyshi
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
| | - Jason T. George
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
- Engineering Medicine Program, Texas A&M University, Houston, TX, United States
- Center for Theoretical Biological Physics, Rice University, Houston, TX, United States
| |
Collapse
|
49
|
Niu Y, Ni Y, Pati D, Mallick BK. Covariate-Assisted Bayesian Graph Learning for Heterogeneous Data. J Am Stat Assoc 2023; 119:1985-1999. [PMID: 39507103 PMCID: PMC11536292 DOI: 10.1080/01621459.2023.2233744] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 06/01/2023] [Accepted: 06/25/2023] [Indexed: 11/08/2024]
Abstract
In a traditional Gaussian graphical model, data homogeneity is routinely assumed with no extra variables affecting the conditional independence. In modern genomic datasets, there is an abundance of auxiliary information, which often gets under-utilized in determining the joint dependency structure. In this article, we consider a Bayesian approach to model undirected graphs underlying heterogeneous multivariate observations with additional assistance from covariates. Building on product partition models, we propose a novel covariate-dependent Gaussian graphical model that allows graphs to vary with covariates so that observations whose covariates are similar share a similar undirected graph. To efficiently embed Gaussian graphical models into our proposed framework, we explore both Gaussian likelihood and pseudo-likelihood functions. For Gaussian likelihood, a G-Wishart distribution is used as a natural conjugate prior, and for the pseudo-likelihood, a product of Gaussianconditionals is used. Moreover, the proposed model has large prior support and is flexible to approximate any v-Hölder conditional variance-covariance matrices with v ∈ ( 0,1 ] . We further show that based on the theory of fractional likelihood, the rate of posterior contraction is minimax optimal assuming the true density to be a Gaussian mixture with a known number of components. The efficacy of the approach is demonstrated via simulation studies and an analysis of a protein network for a breast cancer dataset assisted by mRNA gene expression as covariates.
Collapse
Affiliation(s)
- Yabo Niu
- Department of Mathematics, University of Houston
| | - Yang Ni
- Department of Statistics, Texas A&M University
| | | | | |
Collapse
|
50
|
Taubert O, von der Lehr F, Bazarova A, Faber C, Knechtges P, Weiel M, Debus C, Coquelin D, Basermann A, Streit A, Kesselheim S, Götz M, Schug A. RNA contact prediction by data efficient deep learning. Commun Biol 2023; 6:913. [PMID: 37674020 PMCID: PMC10482910 DOI: 10.1038/s42003-023-05244-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 08/14/2023] [Indexed: 09/08/2023] Open
Abstract
On the path to full understanding of the structure-function relationship or even design of RNA, structure prediction would offer an intriguing complement to experimental efforts. Any deep learning on RNA structure, however, is hampered by the sparsity of labeled training data. Utilizing the limited data available, we here focus on predicting spatial adjacencies ("contact maps") as a proxy for 3D structure. Our model, BARNACLE, combines the utilization of unlabeled data through self-supervised pre-training and efficient use of the sparse labeled data through an XGBoost classifier. BARNACLE shows a considerable improvement over both the established classical baseline and a deep neural network. In order to demonstrate that our approach can be applied to tasks with similar data constraints, we show that our findings generalize to the related setting of accessible surface area prediction.
Collapse
Affiliation(s)
- Oskar Taubert
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
| | - Fabrice von der Lehr
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
| | - Alina Bazarova
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Christian Faber
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
| | - Philipp Knechtges
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Marie Weiel
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Charlotte Debus
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Daniel Coquelin
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Achim Basermann
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
| | - Achim Streit
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
| | - Stefan Kesselheim
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Markus Götz
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany.
- Helmholtz AI, 81675, Munich, Germany.
| | - Alexander Schug
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany.
- Faculty of Biology, University of Duisburg-Essen, 45117, Essen, Germany.
| |
Collapse
|