1
|
Illig AM, Siedhoff NE, Davari MD, Schwaneberg U. Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort. J Chem Inf Model 2024. [PMID: 39088689 DOI: 10.1021/acs.jcim.4c00704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2024]
Abstract
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
Collapse
Affiliation(s)
| | - Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| |
Collapse
|
2
|
Pawnikar S, Magenheimer BS, Joshi K, Munoz EN, Haldane A, Maser RL, Miao Y. Activation of Polycystin-1 Signaling by Binding of Stalk-derived Peptide Agonists. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.06.574465. [PMID: 38260358 PMCID: PMC10802338 DOI: 10.1101/2024.01.06.574465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Polycystin-1 (PC1) is the membrane protein product of the PKD1 gene whose mutation is responsible for 85% of the cases of autosomal dominant polycystic kidney disease (ADPKD). ADPKD is primarily characterized by the formation of renal cysts and potential kidney failure. PC1 is an atypical G protein-coupled receptor (GPCR) consisting of 11 transmembrane helices and an autocatalytic GAIN domain that cleaves PC1 into extracellular N-terminal (NTF) and membrane-embedded C-terminal (CTF) fragments. Recently, signaling activation of the PC1 CTF was shown to be regulated by a stalk tethered agonist (TA), a distinct mechanism observed in the adhesion GPCR family. A novel allosteric activation pathway was elucidated for the PC1 CTF through a combination of Gaussian accelerated molecular dynamics (GaMD), mutagenesis and cellular signaling experiments. Here, we show that synthetic, soluble peptides with 7 to 21 residues derived from the stalk TA, in particular, peptides including the first 9 residues (p9), 17 residues (p17) and 21 residues (p21) exhibited the ability to re-activate signaling by a stalkless PC1 CTF mutant in cellular assays. To reveal molecular mechanisms of stalk peptide-mediated signaling activation, we have applied a novel Peptide GaMD (Pep-GaMD) algorithm to elucidate binding conformations of selected stalk peptide agonists p9, p17 and p21 to the stalkless PC1 CTF. The simulations revealed multiple specific binding regions of the stalk peptide agonists to the PC1 protein including an "intermediate" bound yet inactive state. Our Pep-GaMD simulation findings were consistent with the cellular assay experimental data. Binding of peptide agonists to the TOP domain of PC1 induced close TOP-putative pore loop interactions, a characteristic feature of the PC1 CTF signaling activation mechanism. Using sequence covariation analysis of PC1 homologs, we further showed that the peptide binding regions were consistent with covarying residue pairs identified between the TOP domain and the stalk TA. Therefore, structural dynamic insights into the mechanisms of PC1 activation by stalk-derived peptide agonists have enabled an in-depth understanding of PC1 signaling. They will form a foundation for development of PC1 as a therapeutic target for the treatment of ADPKD.
Collapse
Affiliation(s)
- Shristi Pawnikar
- Center for Computational Biology and Department of Molecular Biosciences, University of Kansas, Lawrence, KS 66047
| | - Brenda S. Magenheimer
- Clinical Laboratory Sciences, University of Kansas Medical Center, Kansas City, KS 66160
- The Jared Grantham Kidney Institute, University of Kansas Medical Center, Kansas City, KS 66160
| | - Keya Joshi
- Department of Pharmacology and Computational Medicine Program, University of North Carolina – Chapel Hill, Chapel Hill, NC 27599
| | - Ericka Nevarez Munoz
- Clinical Laboratory Sciences, University of Kansas Medical Center, Kansas City, KS 66160
| | - Allan Haldane
- Dept of Physics, and Center for Biophysics and Computational Biology, Temple University, Philadelphia, PA 19122
| | - Robin L. Maser
- Departments of Biochemistry and Molecular Biology, University of Kansas Medical Center, Kansas City, KS 66160
- Clinical Laboratory Sciences, University of Kansas Medical Center, Kansas City, KS 66160
- The Jared Grantham Kidney Institute, University of Kansas Medical Center, Kansas City, KS 66160
| | - Yinglong Miao
- Department of Pharmacology and Computational Medicine Program, University of North Carolina – Chapel Hill, Chapel Hill, NC 27599
| |
Collapse
|
3
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 PMCID: PMC11228481 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of BioSciences, Rice University, Houston, TX77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
4
|
Lupo U, Sgarbossa D, Bitbol AF. Pairing interacting protein sequences using masked language modeling. Proc Natl Acad Sci U S A 2024; 121:e2311887121. [PMID: 38913900 PMCID: PMC11228504 DOI: 10.1073/pnas.2311887121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 12/18/2023] [Indexed: 06/26/2024] Open
Abstract
Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
5
|
Jisna VA, Ajay AP, Jayaraj PB. Using Attention-UNet Models to Predict Protein Contact Maps. J Comput Biol 2024; 31:691-702. [PMID: 38979621 DOI: 10.1089/cmb.2023.0102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024] Open
Abstract
Proteins are essential to life, and understanding their intrinsic roles requires determining their structure. The field of proteomics has opened up new opportunities by applying deep learning algorithms to large databases of solved protein structures. With the availability of large data sets and advanced machine learning methods, the prediction of protein residue interactions has greatly improved. Protein contact maps provide empirical evidence of the interacting residue pairs within a protein sequence. Template-free protein structure prediction systems rely heavily on this information. This article proposes UNet-CON, an attention-integrated UNet architecture, trained to predict residue-residue contacts in protein sequences. With the predicted contacts being more accurate than state-of-the-art methods on the PDB25 test set, the model paves the way for the development of more powerful deep learning algorithms for predicting protein residue interactions.
Collapse
Affiliation(s)
- V A Jisna
- Department of Computer Science and Engineering, Indian Institute of Information Technology Design and Manufacturing, Kurnool, India
| | | | - P B Jayaraj
- Department of Computer Science and Engineering, NIT Calicut, Calicut, India
| |
Collapse
|
6
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
7
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593772. [PMID: 38798671 PMCID: PMC11118547 DOI: 10.1101/2024.05.12.593772] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
8
|
Posfai A, McCandlish DM, Kinney JB. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593774. [PMID: 38798625 PMCID: PMC11118426 DOI: 10.1101/2024.05.12.593774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models that describe how biological sequences encode functional activities are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when model parameters transform under redundant irreducible matrix representations of this group. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the number of independent gauge freedoms, as well as efficient computation of a sparse basis for the space of gauge freedoms. We also study how parameter transformation behavior affects parameter interpretability. We find that in many (and possibly all) nontrivial models, the ability to interpret individual model parameters as quantifying intrinsic allelic effects requires that gauge freedoms be present. This finding establishes an incompatibility between two distinct notions of parameter interpretability. Our work thus advances the understanding of symmetries, gauge freedoms, and parameter interpretability in sequence-function relationships.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
9
|
Nechushtai R, Rowland L, Karmi O, Marjault HB, Nguyen TT, Mittal S, Ahmed RS, Grant D, Manrique-Acevedo C, Morcos F, Onuchic JN, Mittler R. CISD3/MiNT is required for complex I function, mitochondrial integrity, and skeletal muscle maintenance. Proc Natl Acad Sci U S A 2024; 121:e2405123121. [PMID: 38781208 PMCID: PMC11145280 DOI: 10.1073/pnas.2405123121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Accepted: 04/23/2024] [Indexed: 05/25/2024] Open
Abstract
Mitochondria play a central role in muscle metabolism and function. A unique family of iron-sulfur proteins, termed CDGSH Iron Sulfur Domain-containing (CISD/NEET) proteins, support mitochondrial function in skeletal muscles. The abundance of these proteins declines during aging leading to muscle degeneration. Although the function of the outer mitochondrial CISD/NEET proteins, CISD1/mitoNEET and CISD2/NAF-1, has been defined in skeletal muscle cells, the role of the inner mitochondrial CISD protein, CISD3/MiNT, is currently unknown. Here, we show that CISD3 deficiency in mice results in muscle atrophy that shares proteomic features with Duchenne muscular dystrophy. We further reveal that CISD3 deficiency impairs the function and structure of skeletal muscles, as well as their mitochondria, and that CISD3 interacts with, and donates its [2Fe-2S] clusters to, complex I respiratory chain subunit NADH Ubiquinone Oxidoreductase Core Subunit V2 (NDUFV2). Using coevolutionary and structural computational tools, we model a CISD3-NDUFV2 complex with proximal coevolving residue interactions conducive of [2Fe-2S] cluster transfer reactions, placing the clusters of the two proteins 10 to 16 Å apart. Taken together, our findings reveal that CISD3/MiNT is important for supporting the biogenesis and function of complex I, essential for muscle maintenance and function. Interventions that target CISD3 could therefore impact different muscle degeneration syndromes, aging, and related conditions.
Collapse
Affiliation(s)
- Rachel Nechushtai
- Plant & Environmental Sciences, The Alexander Silberman Institute of Life Science and The Wolfson Centre for Applied Structural Biology, Faculty of Science and Mathematics, The Edmond J. Safra Campus at Givat Ram, The Hebrew University of Jerusalem, Jerusalem91904, Israel
| | - Linda Rowland
- Department of Surgery, University of Missouri School of Medicine, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO65201
| | - Ola Karmi
- Plant & Environmental Sciences, The Alexander Silberman Institute of Life Science and The Wolfson Centre for Applied Structural Biology, Faculty of Science and Mathematics, The Edmond J. Safra Campus at Givat Ram, The Hebrew University of Jerusalem, Jerusalem91904, Israel
| | - Henri-Baptiste Marjault
- Department of Surgery, University of Missouri School of Medicine, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO65201
| | - Thi Thao Nguyen
- Gehrke Proteomics Center, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO65211
| | - Shubham Mittal
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Raheel S. Ahmed
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - DeAna Grant
- Electron Microscopy Core Facility, University of Missouri, NextGen Precision Health Institute, Columbia, MO65211
| | - Camila Manrique-Acevedo
- Division of Endocrinology and Metabolism, Department of Medicine, University of Missouri, Columbia, MO 65201
- NextGen Precision Health, University of Missouri, Columbia, MO 65201
- Harry S. Truman Memorial Veterans’ Hospital, Columbia, MO 65201
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Department of Physics, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| | - José N. Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of Biosciences, Rice University, Houston, TX77005
| | - Ron Mittler
- Department of Surgery, University of Missouri School of Medicine, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO65201
| |
Collapse
|
10
|
Zhao H, Petrey D, Murray D, Honig B. ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models. Proc Natl Acad Sci U S A 2024; 121:e2400260121. [PMID: 38743624 PMCID: PMC11127014 DOI: 10.1073/pnas.2400260121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open
Abstract
We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.
Collapse
Affiliation(s)
- Haiqing Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Donald Petrey
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY10032
- Department of Medicine, Columbia University, New York, NY10032
- Zuckerman Institute, Columbia University, New York, NY10027
| |
Collapse
|
11
|
Jaafari H, Bueno C, Schafer NP, Martin J, Morcos F, Wolynes PG. The physical and evolutionary energy landscapes of devolved protein sequences corresponding to pseudogenes. Proc Natl Acad Sci U S A 2024; 121:e2322428121. [PMID: 38739795 PMCID: PMC11127006 DOI: 10.1073/pnas.2322428121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/26/2024] [Indexed: 05/16/2024] Open
Abstract
Protein evolution is guided by structural, functional, and dynamical constraints ensuring organismal viability. Pseudogenes are genomic sequences identified in many eukaryotes that lack translational activity due to sequence degradation and thus over time have undergone "devolution." Previously pseudogenized genes sometimes regain their protein-coding function, suggesting they may still encode robust folding energy landscapes despite multiple mutations. We study both the physical folding landscapes of protein sequences corresponding to human pseudogenes using the Associative Memory, Water Mediated, Structure and Energy Model, and the evolutionary energy landscapes obtained using direct coupling analysis (DCA) on their parent protein families. We found that generally mutations that have occurred in pseudogene sequences have disrupted their native global network of stabilizing residue interactions, making it harder for them to fold if they were translated. In some cases, however, energetic frustration has apparently decreased when the functional constraints were removed. We analyzed this unexpected situation for Cyclophilin A, Profilin-1, and Small Ubiquitin-like Modifier 2 Protein. Our analysis reveals that when such mutations in the pseudogene ultimately stabilize folding, at the same time, they likely alter the pseudogenes' former biological activity, as estimated by DCA. We localize most of these stabilizing mutations generally to normally frustrated regions required for binding to other partners.
Collapse
Affiliation(s)
- Hana Jaafari
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
- Applied Physics Graduate Program, Smalley-Curl Institute, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
| | - Carlos Bueno
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
| | | | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| | - Peter G. Wolynes
- Center for Theoretical Biophysics, Rice University, Houston, TX77005
- Department of Chemistry, Rice University, Houston, TX77005
- Department of Physics and Astronomy, Rice University, Houston, TX77005
- Department of Biochemistry and Cell Biology, Rice University, Houston, TX77005
| |
Collapse
|
12
|
Doga H, Raubenolt B, Cumbo F, Joshi J, DiFilippo FP, Qin J, Blankenberg D, Shehab O. A Perspective on Protein Structure Prediction Using Quantum Computers. J Chem Theory Comput 2024; 20:3359-3378. [PMID: 38703105 PMCID: PMC11099973 DOI: 10.1021/acs.jctc.4c00067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 04/19/2024] [Accepted: 04/22/2024] [Indexed: 05/06/2024]
Abstract
Despite the recent advancements by deep learning methods such as AlphaFold2, in silico protein structure prediction remains a challenging problem in biomedical research. With the rapid evolution of quantum computing, it is natural to ask whether quantum computers can offer some meaningful benefits for approaching this problem. Yet, identifying specific problem instances amenable to quantum advantage and estimating the quantum resources required are equally challenging tasks. Here, we share our perspective on how to create a framework for systematically selecting protein structure prediction problems that are amenable for quantum advantage, and estimate quantum resources for such problems on a utility-scale quantum computer. As a proof-of-concept, we validate our problem selection framework by accurately predicting the structure of a catalytic loop of the Zika Virus NS3 Helicase, on quantum hardware.
Collapse
Affiliation(s)
- Hakan Doga
- IBM Quantum,
Almaden Research Center, San Jose, California 95120, United States
| | - Bryan Raubenolt
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Fabio Cumbo
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Jayadev Joshi
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Frank P. DiFilippo
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Jun Qin
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Daniel Blankenberg
- Center
for Computational Life Sciences, Lerner
Research Institute, The Cleveland Clinic, Cleveland, Ohio 44106, United States
| | - Omar Shehab
- IBM
Quantum, IBM Thomas J Watson Research Center, Yorktown Heights, New York 10598, United States
| |
Collapse
|
13
|
Hacisuleyman A, Erman B. Synergy and anti-cooperativity in allostery: Molecular dynamics study of WT and oncogenic KRAS-RGL1. Proteins 2024; 92:665-678. [PMID: 38153169 DOI: 10.1002/prot.26657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 11/03/2023] [Accepted: 12/15/2023] [Indexed: 12/29/2023]
Abstract
This study focuses on investigating the effects of an oncogenic mutation (G12V) on the stability and interactions within the KRAS-RGL1 protein complex. The KRAS-RGL1 complex is of particular interest due to its relevance to KRAS-associated cancers and the potential for developing targeted drugs against the KRAS system. The stability of the complex and the allosteric effects of specific residues are examined to understand their roles as modulators of complex stability and function. Using molecular dynamics simulations, we calculate the mutual information, MI, between two neighboring residues at the interface of the KRAS-RGL1 complex, and employ the concept of interaction information, II, to measure the contribution of a third residue to the interaction between interface residue pairs. Negative II indicates synergy, where the presence of the third residue strengthens the interaction, while positive II suggests anti-cooperativity. Our findings reveal that MI serves as a dominant factor in determining the results, with the G12V mutation increasing the MI between interface residues, indicating enhanced correlations due to the formation of a more compact structure in the complex. Interestingly, although II plays a role in understanding three-body interactions and the impact of distant residues, it is not significant enough to outweigh the influence of MI in determining the overall stability of the complex. Nevertheless, II may nonetheless be a relevant factor to consider in future drug design efforts. This study provides valuable insights into the mechanisms of complex stability and function, highlighting the significance of three-body interactions and the impact of distant residues on the binding stability of the complex. Additionally, our findings demonstrate that constraining the fluctuations of a third residue consistently increases the stability of the G12V variant, making it challenging to weaken complex formation of the mutated species through allosteric manipulation. The novel perspective offered by this approach on protein dynamics, function, and allostery has potential implications for understanding and targeting other protein complexes involved in vital cellular processes. The results contribute to our understanding of the effects of oncogenic mutations on protein-protein interactions and provide a foundation for future therapeutic interventions in the context of KRAS-associated cancers and beyond.
Collapse
Affiliation(s)
- Aysima Hacisuleyman
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Burak Erman
- Department of Chemical and Biological Engineering Koc University, Istanbul, Turkey
| |
Collapse
|
14
|
Humphreys IR, Zhang J, Baek M, Wang Y, Krishnakumar A, Pei J, Anishchenko I, Tower CA, Jackson BA, Warrier T, Hung DT, Peterson SB, Mougous JD, Cong Q, Baker D. Essential and virulence-related protein interactions of pathogens revealed through deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.12.589144. [PMID: 38645026 PMCID: PMC11030334 DOI: 10.1101/2024.04.12.589144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Identification of bacterial protein-protein interactions and predicting the structures of the complexes could aid in the understanding of pathogenicity mechanisms and developing treatments for infectious diseases. Here, we developed a deep learning-based pipeline that leverages residue-residue coevolution and protein structure prediction to systematically identify and structurally characterize protein-protein interactions at the proteome-wide scale. Using this pipeline, we searched through 78 million pairs of proteins across 19 human bacterial pathogens and identified 1923 confidently predicted complexes involving essential genes and 256 involving virulence factors. Many of these complexes were not previously known; we experimentally tested 12 such predictions, and half of them were validated. The predicted interactions span core metabolic and virulence pathways ranging from post-transcriptional modification to acid neutralization to outer membrane machinery and should contribute to our understanding of the biology of these important pathogens and the design of drugs to combat them.
Collapse
|
15
|
Grassmann G, Miotto M, Desantis F, Di Rienzo L, Tartaglia GG, Pastore A, Ruocco G, Monti M, Milanetti E. Computational Approaches to Predict Protein-Protein Interactions in Crowded Cellular Environments. Chem Rev 2024; 124:3932-3977. [PMID: 38535831 PMCID: PMC11009965 DOI: 10.1021/acs.chemrev.3c00550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 02/20/2024] [Accepted: 02/21/2024] [Indexed: 04/11/2024]
Abstract
Investigating protein-protein interactions is crucial for understanding cellular biological processes because proteins often function within molecular complexes rather than in isolation. While experimental and computational methods have provided valuable insights into these interactions, they often overlook a critical factor: the crowded cellular environment. This environment significantly impacts protein behavior, including structural stability, diffusion, and ultimately the nature of binding. In this review, we discuss theoretical and computational approaches that allow the modeling of biological systems to guide and complement experiments and can thus significantly advance the investigation, and possibly the predictions, of protein-protein interactions in the crowded environment of cell cytoplasm. We explore topics such as statistical mechanics for lattice simulations, hydrodynamic interactions, diffusion processes in high-viscosity environments, and several methods based on molecular dynamics simulations. By synergistically leveraging methods from biophysics and computational biology, we review the state of the art of computational methods to study the impact of molecular crowding on protein-protein interactions and discuss its potential revolutionizing effects on the characterization of the human interactome.
Collapse
Affiliation(s)
- Greta Grassmann
- Department
of Biochemical Sciences “Alessandro Rossi Fanelli”, Sapienza University of Rome, Rome 00185, Italy
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Mattia Miotto
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Fausta Desantis
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- The
Open University Affiliated Research Centre at Istituto Italiano di
Tecnologia, Genoa 16163, Italy
| | - Lorenzo Di Rienzo
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
| | - Gian Gaetano Tartaglia
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Neuroscience and Brain Technologies, Istituto Italiano di Tecnologia, Genoa 16163, Italy
- Center
for Human Technologies, Genoa 16152, Italy
| | - Annalisa Pastore
- Experiment
Division, European Synchrotron Radiation
Facility, Grenoble 38043, France
| | - Giancarlo Ruocco
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Physics, Sapienza University, Rome 00185, Italy
| | - Michele Monti
- RNA
System Biology Lab, Department of Neuroscience and Brain Technologies, Istituto Italiano di Tecnologia, Genoa 16163, Italy
| | - Edoardo Milanetti
- Center
for Life Nano & Neuro Science, Istituto
Italiano di Tecnologia, Rome 00161, Italy
- Department
of Physics, Sapienza University, Rome 00185, Italy
| |
Collapse
|
16
|
Biswas A, Choudhuri I, Arnold E, Lyumkis D, Haldane A, Levy RM. Kinetic coevolutionary models predict the temporal emergence of HIV-1 resistance mutations under drug selection pressure. Proc Natl Acad Sci U S A 2024; 121:e2316662121. [PMID: 38557187 PMCID: PMC11009627 DOI: 10.1073/pnas.2316662121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 02/23/2024] [Indexed: 04/04/2024] Open
Abstract
Drug resistance in HIV type 1 (HIV-1) is a pervasive problem that affects the lives of millions of people worldwide. Although records of drug-resistant mutations (DRMs) have been extensively tabulated within public repositories, our understanding of the evolutionary kinetics of DRMs and how they evolve together remains limited. Epistasis, the interaction between a DRM and other residues in HIV-1 protein sequences, is key to the temporal evolution of drug resistance. We use a Potts sequence-covariation statistical-energy model of HIV-1 protein fitness under drug selection pressure, which captures epistatic interactions between all positions, combined with kinetic Monte-Carlo simulations of sequence evolutionary trajectories, to explore the acquisition of DRMs as they arise in an ensemble of drug-naive patient protein sequences. We follow the time course of 52 DRMs in the enzymes protease, RT, and integrase, the primary targets of antiretroviral therapy. The rates at which DRMs emerge are highly correlated with their observed acquisition rates reported in the literature when drug pressure is applied. This result highlights the central role of epistasis in determining the kinetics governing DRM emergence. Whereas rapidly acquired DRMs begin to accumulate as soon as drug pressure is applied, slowly acquired DRMs are contingent on accessory mutations that appear only after prolonged drug pressure. We provide a foundation for using computational methods to determine the temporal evolution of drug resistance using Potts statistical potentials, which can be used to gain mechanistic insights into drug resistance pathways in HIV-1 and other infectious agents.
Collapse
Affiliation(s)
- Avik Biswas
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA92037
- Department of Physics, University of California San Diego, La Jolla, CA92093
| | - Indrani Choudhuri
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Chemistry, Temple University, Philadelphia, PA19122
| | - Eddy Arnold
- Department of Chemistry and Chemical Biology, Center for Advanced Biotechnology and Medicine, Rutgers University, Piscataway, NJ08854
| | - Dmitry Lyumkis
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA92037
- Graduate School of Biological Sciences, Department of Molecular Biology, University of California San Diego, La Jolla, CA92093
| | - Allan Haldane
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Physics, Temple University, Philadelphia, PA19122
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology, College of Science and Technology, Temple University, Philadelphia, PA19122
- Department of Chemistry, Temple University, Philadelphia, PA19122
| |
Collapse
|
17
|
Si Y, Yan C. Protein language model-embedded geometric graphs power inter-protein contact prediction. eLife 2024; 12:RP92184. [PMID: 38564241 PMCID: PMC10987090 DOI: 10.7554/elife.92184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024] Open
Abstract
Accurate prediction of contacting residue pairs between interacting proteins is very useful for structural characterization of protein-protein interactions. Although significant improvement has been made in inter-protein contact prediction recently, there is still a large room for improving the prediction accuracy. Here we present a new deep learning method referred to as PLMGraph-Inter for inter-protein contact prediction. Specifically, we employ rotationally and translationally invariant geometric graphs obtained from structures of interacting proteins to integrate multiple protein language models, which are successively transformed by graph encoders formed by geometric vector perceptrons and residual networks formed by dimensional hybrid residual blocks to predict inter-protein contacts. Extensive evaluation on multiple test sets illustrates that PLMGraph-Inter outperforms five top inter-protein contact prediction methods, including DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter, by large margins. In addition, we also show that the prediction of PLMGraph-Inter can complement the result of AlphaFold-Multimer. Finally, we show leveraging the contacts predicted by PLMGraph-Inter as constraints for protein-protein docking can dramatically improve its performance for protein complex structure prediction.
Collapse
Affiliation(s)
- Yunda Si
- School of Physics, Huazhong University of Science and TechnologyWuhanChina
| | - Chengfei Yan
- School of Physics, Huazhong University of Science and TechnologyWuhanChina
| |
Collapse
|
18
|
Sampaio Filho CIN, de Arcangelis L, Herrmann HJ, Plenz D, Kells P, Ribeiro TL, Andrade JS. Ising-like model replicating time-averaged spiking behaviour of in vitro neuronal networks. Sci Rep 2024; 14:7002. [PMID: 38523136 DOI: 10.1038/s41598-024-55922-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 02/28/2024] [Indexed: 03/26/2024] Open
Abstract
We analyze time-averaged experimental data from in vitro activities of neuronal networks. Through a Pairwise Maximum-Entropy method, we identify through an inverse binary Ising-like model the local fields and interaction couplings which best reproduce the average activities of each neuron as well as the statistical correlations between the activities of each pair of neurons in the system. The specific information about the type of neurons is mainly stored in the local fields, while a symmetric distribution of interaction constants seems generic. Our findings demonstrate that, despite not being directly incorporated into the inference approach, the experimentally observed correlations among groups of three neurons are accurately captured by the derived Ising-like model. Within the context of the thermodynamic analogy inherent to the Ising-like models developed in this study, our findings additionally indicate that these models demonstrate characteristics of second-order phase transitions between ferromagnetic and paramagnetic states at temperatures above, but close to, unity. Considering that the operating temperature utilized in the Maximum-Entropy method isT o = 1 , this observation further expands the thermodynamic conceptual parallelism postulated in this work for the manifestation of criticality in neuronal network behavior.
Collapse
Affiliation(s)
| | - Lucilla de Arcangelis
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", 81100, Caserta, Italy
| | - Hans J Herrmann
- Departamento de Física, Universidade Federal do Ceará, Fortaleza, 60451-970, Brazil
- PMMH, ESPCI, CNRS UMR 7636, 7 Quai St. Bernard, 75005, Paris, France
| | - Dietmar Plenz
- Section on Critical Brain Dynamics, NIMH, Bethesda, MD, 20892, USA
| | - Patrick Kells
- Section on Critical Brain Dynamics, NIMH, Bethesda, MD, 20892, USA
| | | | - José S Andrade
- Departamento de Física, Universidade Federal do Ceará, Fortaleza, 60451-970, Brazil
| |
Collapse
|
19
|
Schug A. Residue coevolution and mutational landscape for OmpR and NarL: You can teach old dogs new tricks. Biophys J 2024; 123:653-654. [PMID: 38379283 PMCID: PMC10995386 DOI: 10.1016/j.bpj.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Revised: 02/19/2024] [Accepted: 02/19/2024] [Indexed: 02/22/2024] Open
Affiliation(s)
- Alexander Schug
- Jülich Supercomputing Centre, Forschungszentrum Jülich, Jülich, Germany; Faculty of Biology, University of Duisburg-Essen, Essen, Germany.
| |
Collapse
|
20
|
Judge A, Sankaran B, Hu L, Palaniappan M, Birgy A, Prasad BVV, Palzkill T. Network of epistatic interactions in an enzyme active site revealed by large-scale deep mutational scanning. Proc Natl Acad Sci U S A 2024; 121:e2313513121. [PMID: 38483989 PMCID: PMC10962969 DOI: 10.1073/pnas.2313513121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 02/14/2024] [Indexed: 03/19/2024] Open
Abstract
Cooperative interactions between amino acids are critical for protein function. A genetic reflection of cooperativity is epistasis, which is when a change in the amino acid at one position changes the sequence requirements at another position. To assess epistasis within an enzyme active site, we utilized CTX-M β-lactamase as a model system. CTX-M hydrolyzes β-lactam antibiotics to provide antibiotic resistance, allowing a simple functional selection for rapid sorting of modified enzymes. We created all pairwise mutations across 17 active site positions in the β-lactamase enzyme and quantitated the function of variants against two β-lactam antibiotics using next-generation sequencing. Context-dependent sequence requirements were determined by comparing the antibiotic resistance function of double mutations across the CTX-M active site to their predicted function based on the constituent single mutations, revealing both positive epistasis (synergistic interactions) and negative epistasis (antagonistic interactions) between amino acid substitutions. The resulting trends demonstrate that positive epistasis is present throughout the active site, that epistasis between residues is mediated through substrate interactions, and that residues more tolerant to substitutions serve as generic compensators which are responsible for many cases of positive epistasis. Additionally, we show that a key catalytic residue (Glu166) is amenable to compensatory mutations, and we characterize one such double mutant (E166Y/N170G) that acts by an altered catalytic mechanism. These findings shed light on the unique biochemical factors that drive epistasis within an enzyme active site and will inform enzyme engineering efforts by bridging the gap between amino acid sequence and catalytic function.
Collapse
Affiliation(s)
- Allison Judge
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Banumathi Sankaran
- Department of Molecular Biophysics and Integrated Bioimaging, Berkeley Center for Structural Biology Lawrence Berkeley National Laboratory, Berkeley, CA94720
| | - Liya Hu
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Murugesan Palaniappan
- Department of Pathology and Immunology, Center for Drug Discovery, Baylor College of Medicine, Houston, TX77030
| | - André Birgy
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
- Infections, Antimicrobials, Modelling, Evolution, UMR 1137, French Insitute for Medical Research (INSERM), Faculty of Health, Université Paris Cité, Paris75006, France
| | - B. V. Venkataram Prasad
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| | - Timothy Palzkill
- Verna and Marrs McLean Department of Biochemistry and Molecular Pharmacology, Baylor College of Medicine, Houston, TX77030
| |
Collapse
|
21
|
Shibata M, Lin X, Onuchic JN, Yura K, Cheng RR. Residue coevolution and mutational landscape for OmpR and NarL response regulator subfamilies. Biophys J 2024; 123:681-692. [PMID: 38291753 PMCID: PMC10995415 DOI: 10.1016/j.bpj.2024.01.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 12/31/2023] [Accepted: 01/24/2024] [Indexed: 02/01/2024] Open
Abstract
DNA-binding response regulators (DBRRs) are a broad class of proteins that operate in tandem with their partner kinase proteins to form two-component signal transduction systems in bacteria. Typical DBRRs are composed of two domains where the conserved N-terminal domain accepts transduced signals and the evolutionarily diverse C-terminal domain binds to DNA. These domains are assumed to be functionally independent, and hence recombination of the two domains should yield novel DBRRs of arbitrary input/output response, which can be used as biosensors. This idea has been proved to be successful in some cases; yet, the error rate is not trivial. Improvement of the success rate of this technique requires a deeper understanding of the linker-domain and inter-domain residue interactions, which have not yet been thoroughly examined. Here, we studied residue coevolution of DBRRs of the two main subfamilies (OmpR and NarL) using large collections of bacterial amino acid sequences to extensively investigate the evolutionary signatures of linker-domain and inter-domain residue interactions. Coevolutionary analysis uncovered evolutionarily selected linker-domain and inter-domain residue interactions of known experimental structures, as well as previously unknown inter-domain residue interactions. We examined the possibility of these inter-domain residue interactions as contacts that stabilize an inactive conformation of the DBRR where DNA binding is inhibited for both subfamilies. The newly gained insights on linker-domain/inter-domain residue interactions and shared inactivation mechanisms improve the understanding of the functional mechanism of DBRRs, providing clues to efficiently create functional DBRR-based biosensors. Additionally, we show the feasibility of applying coevolutionary landscape models to predict the functionality of domain-swapped DBRR proteins. The presented result demonstrates that sequence information can be used to filter out bioengineered DBRR proteins that are predicted to be nonfunctional due to a high negative predictive value.
Collapse
Affiliation(s)
- Mayu Shibata
- Graduate School of Humanities and Sciences, Ochanomizu University, Bunkyo, Tokyo, Japan; Center for Theoretical Biological Physics, Rice University, Houston Texas
| | - Xingcheng Lin
- Department of Physics, North Carolina State University, Raleigh, North Carolina; Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston Texas; Department of Physics and Astronomy, Chemistry, and Biosciences, Rice University, Houston, Texas
| | - Kei Yura
- Graduate School of Humanities and Sciences, Ochanomizu University, Bunkyo, Tokyo, Japan; Center for Interdisciplinary AI and Data Science, Ochanomizu University, Bunkyo, Tokyo, Japan; Graduate School of Advanced Science and Engineering, Waseda University, Shinjuku, Tokyo, Japan
| | - Ryan R Cheng
- Department of Chemistry, University of Kentucky, Lexington, Kentucky.
| |
Collapse
|
22
|
Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024; 30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]
Abstract
Directed evolution stands as a seminal technology for generating novel protein functionalities, a cornerstone in biocatalysis, metabolic engineering, and synthetic biology. Today, with the development of various mutagenesis methods and advanced analytical machines, the challenge of diversity generation and high-throughput screening platforms is largely solved, and one of the remaining challenges is: how to empower the potential of single beneficial substitutions with recombination to achieve the epistatic effect. This review overviews experimental and computer-assisted recombination methods in protein engineering campaigns. In addition, integrated and machine learning-guided strategies were highlighted to discuss how these recombination approaches contribute to generating the screening library with better diversity, coverage, and size. A decision tree was finally summarized to guide the further selection of proper recombination strategies in practice, which was beneficial for accelerating protein engineering.
Collapse
Affiliation(s)
- Xinyue Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Anni Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Xiujuan Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Haiyang Cui
- School of Life Sciences, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| |
Collapse
|
23
|
Fang T, Szklarczyk D, Hachilif R, von Mering C. Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration. Sci Rep 2024; 14:6009. [PMID: 38472223 PMCID: PMC10933411 DOI: 10.1038/s41598-024-55655-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.
Collapse
Affiliation(s)
- Tao Fang
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Damian Szklarczyk
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Radja Hachilif
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Christian von Mering
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| |
Collapse
|
24
|
Jänes J, Beltrao P. Deep learning for protein structure prediction and design-progress and applications. Mol Syst Biol 2024; 20:162-169. [PMID: 38291232 PMCID: PMC10912668 DOI: 10.1038/s44320-024-00016-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 12/21/2023] [Accepted: 01/11/2024] [Indexed: 02/01/2024] Open
Abstract
Proteins are the key molecular machines that orchestrate all biological processes of the cell. Most proteins fold into three-dimensional shapes that are critical for their function. Studying the 3D shape of proteins can inform us of the mechanisms that underlie biological processes in living cells and can have practical applications in the study of disease mutations or the discovery of novel drug treatments. Here, we review the progress made in sequence-based prediction of protein structures with a focus on applications that go beyond the prediction of single monomer structures. This includes the application of deep learning methods for the prediction of structures of protein complexes, different conformations, the evolution of protein structures and the application of these methods to protein design. These developments create new opportunities for research that will have impact across many areas of biomedical research.
Collapse
Affiliation(s)
- Jürgen Jänes
- Institute of Molecular Systems Biology, ETH Zürich, 8093, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Pedro Beltrao
- Institute of Molecular Systems Biology, ETH Zürich, 8093, Zürich, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
25
|
Barretto LAF, Van PKT, Fowler CC. Conserved patterns of sequence diversification provide insight into the evolution of two-component systems in Enterobacteriaceae. Microb Genom 2024; 10:001215. [PMID: 38502064 PMCID: PMC11004495 DOI: 10.1099/mgen.0.001215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/29/2024] [Indexed: 03/20/2024] Open
Abstract
Two-component regulatory systems (TCSs) are a major mechanism used by bacteria to sense and respond to their environments. Many of the same TCSs are used by biologically diverse organisms with different regulatory needs, suggesting that the functions of TCS must evolve. To explore this topic, we analysed the amino acid sequence divergence patterns of a large set of broadly conserved TCS across different branches of Enterobacteriaceae, a family of Gram-negative bacteria that includes biomedically important genera such as Salmonella, Escherichia, Klebsiella and others. Our analysis revealed trends in how TCS sequences change across different proteins or functional domains of the TCS, and across different lineages. Based on these trends, we identified individual TCS that exhibit atypical evolutionary patterns. We observed that the relative extent to which the sequence of a given TCS varies across different lineages is generally well conserved, unveiling a hierarchy of TCS sequence conservation with EnvZ/OmpR as the most conserved TCS. We provide evidence that, for the most divergent of the TCS analysed, PmrA/PmrB, different alleles were horizontally acquired by different branches of this family, and that different PmrA/PmrB sequence variants have highly divergent signal-sensing domains. Collectively, this study sheds light on how TCS evolve, and serves as a compendium for how the sequences of the TCS in this family have diverged over the course of evolution.
Collapse
Affiliation(s)
- Luke A. F. Barretto
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G2E9, Canada
| | - Patryc-Khang T. Van
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G2E9, Canada
| | - Casey C. Fowler
- Department of Biological Sciences, University of Alberta, Edmonton, AB, T6G2E9, Canada
| |
Collapse
|
26
|
Alvarez S, Nartey CM, Mercado N, de la Paz JA, Huseinbegovic T, Morcos F. In vivo functional phenotypes from a computational epistatic model of evolution. Proc Natl Acad Sci U S A 2024; 121:e2308895121. [PMID: 38285950 PMCID: PMC10861889 DOI: 10.1073/pnas.2308895121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/19/2023] [Indexed: 01/31/2024] Open
Abstract
Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo [Formula: see text]-lactamase activity in Escherichia coli TEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
Collapse
Affiliation(s)
- Sophia Alvarez
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Charisse M. Nartey
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Nicholas Mercado
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | | | - Tea Huseinbegovic
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX75080
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX75080
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX75080
| |
Collapse
|
27
|
Kalogeropoulos K, Bohn MF, Jenkins DE, Ledergerber J, Sørensen CV, Hofmann N, Wade J, Fryer T, Thi Tuyet Nguyen G, Auf dem Keller U, Laustsen AH, Jenkins TP. A comparative study of protein structure prediction tools for challenging targets: Snake venom toxins. Toxicon 2024; 238:107559. [PMID: 38113945 DOI: 10.1016/j.toxicon.2023.107559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/06/2023] [Accepted: 12/08/2023] [Indexed: 12/21/2023]
Abstract
Protein structure determination is a critical aspect of biological research, enabling us to understand protein function and potential applications. Recent advances in deep learning and artificial intelligence have led to the development of several protein structure prediction tools, such as AlphaFold2 and ColabFold. However, their performance has primarily been evaluated on well-characterised proteins and their ability to predict sturtctures of proteins lacking experimental structures, such as many snake venom toxins, has been less scrutinised. In this study, we evaluated three modelling tools on their prediction of over 1000 snake venom toxin structures for which no experimental structures exist. Our findings show that AlphaFold2 (AF2) performed the best across all assessed parameters. We also observed that ColabFold (CF) only scored slightly worse than AF2, while being computationally less intensive. All tools struggled with regions of intrinsic disorder, such as loops and propeptide regions, and performed well in predicting the structure of functional domains. Overall, our study highlights the importance of exercising caution when working with proteins with no experimental structures available, particularly those that are large and contain flexible regions. Nonetheless, leveraging computational structure prediction tools can provide valuable insights into the modelling of protein interactions with different targets and reveal potential binding sites, active sites, and conformational changes, as well as into the design of potential molecular binders for reagent, diagnostic, or therapeutic purposes.
Collapse
Affiliation(s)
| | - Markus-Frederik Bohn
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Jann Ledergerber
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark; Department of Chemistry and Applied Bioscience, ETH Zurich, Zurich, Switzerland
| | - Christoffer V Sørensen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Nils Hofmann
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Jack Wade
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Thomas Fryer
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Giang Thi Tuyet Nguyen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ulrich Auf dem Keller
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Andreas H Laustsen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Timothy P Jenkins
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
28
|
Sesta L, Pagnani A, Fernandez-de-Cossio-Diaz J, Uguzzoni G. Inference of annealed protein fitness landscapes with AnnealDCA. PLoS Comput Biol 2024; 20:e1011812. [PMID: 38377054 PMCID: PMC10878520 DOI: 10.1371/journal.pcbi.1011812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 01/08/2024] [Indexed: 02/22/2024] Open
Abstract
The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.
Collapse
Affiliation(s)
- Luca Sesta
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
| | - Andrea Pagnani
- Department of Applied Science and Technology, Politecnico di Torino, Torino, Italy
- Italian Institute for Genomic Medicine, Torino, Italy
- INFN, Sezione di Torino, Torino, Italy
| | | | | |
Collapse
|
29
|
Pucci F, Zerihun MB, Rooman M, Schug A. pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics 2024; 40:btae074. [PMID: 38335928 PMCID: PMC10881095 DOI: 10.1093/bioinformatics/btae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 01/25/2024] [Accepted: 02/06/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION The accurate prediction of how mutations change biophysical properties of proteins or RNA is a major goal in computational biology with tremendous impacts on protein design and genetic variant interpretation. Evolutionary approaches such as coevolution can help solving this issue. RESULTS We present pycofitness, a standalone Python-based software package for the in silico mutagenesis of protein and RNA sequences. It is based on coevolution and, more specifically, on a popular inverse statistical approach, namely direct coupling analysis by pseudo-likelihood maximization. Its efficient implementation and user-friendly command line interface make it an easy-to-use tool even for researchers with no bioinformatics background. To illustrate its strengths, we present three applications in which pycofitness efficiently predicts the deleteriousness of genetic variants and the effect of mutations on protein fitness and thermodynamic stability. AVAILABILITY AND IMPLEMENTATION https://github.com/KIT-MBS/pycofitness.
Collapse
Affiliation(s)
- Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Mehari B Zerihun
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Alexander Schug
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
- Department of Biology, University of Duisburg-Essen, D-45141 Essen, Germany
| |
Collapse
|
30
|
Yan Z, Wang J. Evolution shapes interaction patterns for epistasis and specific protein binding in a two-component signaling system. Commun Chem 2024; 7:13. [PMID: 38233668 PMCID: PMC10794238 DOI: 10.1038/s42004-024-01098-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 01/05/2024] [Indexed: 01/19/2024] Open
Abstract
The elegant design of protein sequence/structure/function relationships arises from the interaction patterns between amino acid positions. A central question is how evolutionary forces shape the interaction patterns that encode long-range epistasis and binding specificity. Here, we combined family-wide evolutionary analysis of natural homologous sequences and structure-oriented evolution simulation for two-component signaling (TCS) system. The magnitude-frequency relationship of coupling conservation between positions manifests a power-law-like distribution and the positions with highly coupling conservation are sparse but distributed intensely on the binding surfaces and hydrophobic core. The structure-specific interaction pattern involves further optimization of local frustrations at or near the binding surface to adapt the binding partner. The construction of family-wide conserved interaction patterns and structure-specific ones demonstrates that binding specificity is modulated by both direct intermolecular interactions and long-range epistasis across the binding complex. Evolution sculpts the interaction patterns via sequence variations at both family-wide and structure-specific levels for TCS system.
Collapse
Affiliation(s)
- Zhiqiang Yan
- Center for Theoretical Interdisciplinary Sciences, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, Zhejiang, 325001, PR China
| | - Jin Wang
- Department of Chemistry and Physics, State University of New York at Stony Brook, Stony Brook, NY, 11790, USA.
| |
Collapse
|
31
|
Wei Y, Wei H, Tian C, Wu Q, Li D, Huang C, Zhang G, Chen R, Wang N, Li Y, Li B, Chu XM. The Transcriptome Analysis of Circular RNAs Between the Doxorubicin- Induced Cardiomyocytes and Bone Marrow Mesenchymal Stem Cells- Derived Exosomes Treated Ones. Comb Chem High Throughput Screen 2024; 27:1056-1070. [PMID: 38305398 DOI: 10.2174/0113862073261891231115072310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 09/10/2023] [Accepted: 09/21/2023] [Indexed: 02/03/2024]
Abstract
AIM To analyze the sequencing results of circular RNAs (circRNAs) in cardiomyocytes between the doxorubicin (DOX)-injured group and exosomes treatment group. Moreover, to offer potential circRNAs possibly secreted by exosomes mediating the therapeutic effect on DOX-induced cardiotoxicity for further study. METHODS The DOX-injured group (DOX group) of cardiomyocytes was treated with DOX, while an exosomes-treated group of injured cardiomyocytes were cocultured with bone marrow mesenchymal stem cells (BMSC)-derived exosomes (BEC group). The high-throughput sequencing of circRNAs was conducted after the extraction of RNA from cardiomyocytes. The differential expression of circRNA was analyzed after identifying the number, expression, and conservative of circRNAs. Then, the target genes of differentially expressed circRNAs were predicted based on the targetscan and Miranda database. Next, the GO and KEGG enrichment analyses of target genes of circRNAs were performed. The crucial signaling pathways participating in the therapeutic process were identified. Finally, a real-time quantitative polymerase chain reaction experiment was conducted to verify the results obtained by sequencing. RESULTS Thirty-two circRNAs are differentially expressed between the two groups, of which twenty-three circRNAs were elevated in the exosomes-treated group (BEC group). The GO analysis shows that target genes of differentially expressed circRNAs are mainly enriched in the intracellular signalactivity, regulation of nucleic acid-templated transcription, Golgi-related activity, and GTPase activator activity. The KEGG analysis displays that they were involved in the autophagy biological process and NOD-like receptor signaling pathway. The verification experiment suggested that mmu_circ_0000425 (ID: 116324210) was both decreased in the DOX group and elevated in BEC group, which was consistent with the result of sequencing. CONCLUSION mmu_circ_0000425 in exosomes derived from bone marrow mesenchymal stem cells (BMSC) may have a therapeutic role in alleviating doxorubicin-induced cardiotoxicity (DIC).
Collapse
Affiliation(s)
- Yanhuan Wei
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
- Department of Emergency Medicine, Rizhao People's Hospital, Rizhao, China
| | - Haixia Wei
- Qingdao Chengyang People's Hospital, Qingdao, China
| | - Chao Tian
- Hepatopancreatobiliary Center, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, Beijing, China
| | - Qinchao Wu
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Daisong Li
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Chao Huang
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Guoliang Zhang
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Ruolan Chen
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Ni Wang
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Yonghong Li
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Bing Li
- Department of Genetics, Basic Medicine School, Qingdao University, Qingdao, China
- Department of Hematology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Xian-Ming Chu
- Department of Cardiology, The Affiliated Hospital of Qingdao University, Qingdao, China
- Department of Cardiology, The Affiliated Cardiovascular Hospital of Qingdao University, Qingdao, China
| |
Collapse
|
32
|
Zhang J, Liu S, Chen M, Chu H, Wang M, Wang Z, Yu J, Ni N, Yu F, Chen D, Yang YI, Xue B, Yang L, Liu Y, Gao YQ. Unsupervisedly Prompting AlphaFold2 for Accurate Few-Shot Protein Structure Prediction. J Chem Theory Comput 2023; 19:8460-8471. [PMID: 37947474 DOI: 10.1021/acs.jctc.3c00528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
Data-driven predictive methods that can efficiently and accurately transform protein sequences into biologically active structures are highly valuable for scientific research and medical development. Determining an accurate folding landscape using coevolutionary information is fundamental to the success of modern protein structure prediction methods. As the state of the art, AlphaFold2 has dramatically raised the accuracy without performing explicit coevolutionary analysis. Nevertheless, its performance still shows strong dependence on available sequence homologues. Based on the interrogation on the cause of such dependence, we presented EvoGen, a meta generative model, to remedy the underperformance of AlphaFold2 for poor MSA targets. By prompting the model with calibrated or virtually generated homologue sequences, EvoGen helps AlphaFold2 fold accurately in the low-data regime and even achieve encouraging performance with single-sequence predictions. Being able to make accurate predictions with few-shot MSA not only generalizes AlphaFold2 better for orphan sequences but also democratizes its use for high-throughput applications. Besides, EvoGen combined with AlphaFold2 yields a probabilistic structure generation method that could explore alternative conformations of protein sequences, and the task-aware differentiable algorithm for sequence generation will benefit other related tasks including protein design.
Collapse
Affiliation(s)
- Jun Zhang
- Changping Laboratory, Beijing 102200, China
| | - Sirui Liu
- Changping Laboratory, Beijing 102200, China
| | - Mengyun Chen
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Haotian Chu
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Min Wang
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Zidong Wang
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Jialiang Yu
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Ningxi Ni
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Fan Yu
- Huawei Hangzhou Research Institute, Huawei Technologies Co. Ltd., Hangzhou 310051, China
| | - Dechin Chen
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Yi Isaac Yang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Boxin Xue
- Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Lijiang Yang
- Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Yuan Liu
- Department of Chemical Biology, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
| | - Yi Qin Gao
- Changping Laboratory, Beijing 102200, China
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
- Beijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China
- Biomedical Pioneering Innovation Center, Peking University, Beijing 100871, China
| |
Collapse
|
33
|
Musil M, Jezik A, Horackova J, Borko S, Kabourek P, Damborsky J, Bednar D. FireProt 2.0: web-based platform for the fully automated design of thermostable proteins. Brief Bioinform 2023; 25:bbad425. [PMID: 38018911 PMCID: PMC10685400 DOI: 10.1093/bib/bbad425] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 10/25/2023] [Accepted: 11/01/2023] [Indexed: 11/30/2023] Open
Abstract
Thermostable proteins find their use in numerous biomedical and biotechnological applications. However, the computational design of stable proteins often results in single-point mutations with a limited effect on protein stability. However, the construction of stable multiple-point mutants can prove difficult due to the possibility of antagonistic effects between individual mutations. FireProt protocol enables the automated computational design of highly stable multiple-point mutants. FireProt 2.0 builds on top of the previously published FireProt web, retaining the original functionality and expanding it with several new stabilization strategies. FireProt 2.0 integrates the AlphaFold database and the homology modeling for structure prediction, enabling calculations starting from a sequence. Multiple-point designs are constructed using the Bron-Kerbosch algorithm minimizing the antagonistic effect between the individual mutations. Users can newly limit the FireProt calculation to a set of user-defined mutations, run a saturation mutagenesis of the whole protein or select rigidifying mutations based on B-factors. Evolution-based back-to-consensus strategy is complemented by ancestral sequence reconstruction. FireProt 2.0 is significantly faster and a reworked graphical user interface broadens the tool's availability even to users with older hardware. FireProt 2.0 is freely available at http://loschmidt.chemi.muni.cz/fireprotweb.
Collapse
Affiliation(s)
- Milos Musil
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Centre, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Andrej Jezik
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
| | - Jana Horackova
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
| | - Simeon Borko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
- Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic
- International Clinical Research Centre, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Petr Kabourek
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Centre, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Centre, St. Anne’s University Hospital Brno, Brno, Czech Republic
| | - David Bednar
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Masaryk University, Brno, Czech Republic
- International Clinical Research Centre, St. Anne’s University Hospital Brno, Brno, Czech Republic
| |
Collapse
|
34
|
Zhang H, Quadeer AA, McKay MR. Direct-acting antiviral resistance of Hepatitis C virus is promoted by epistasis. Nat Commun 2023; 14:7457. [PMID: 37978179 PMCID: PMC10656532 DOI: 10.1038/s41467-023-42550-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 10/13/2023] [Indexed: 11/19/2023] Open
Abstract
Direct-acting antiviral agents (DAAs) provide efficacious therapeutic treatments for chronic Hepatitis C virus (HCV) infection. However, emergence of drug resistance mutations (DRMs) can greatly affect treatment outcomes and impede virological cure. While multiple DRMs have been observed for all currently used DAAs, the evolutionary determinants of such mutations are not currently well understood. Here, by considering DAAs targeting the nonstructural 3 (NS3) protein of HCV, we present results suggesting that epistasis plays an important role in the evolution of DRMs. Employing a sequence-based fitness landscape model whose predictions correlate highly with experimental data, we identify specific DRMs that are associated with strong epistatic interactions, and these are found to be enriched in multiple NS3-specific DAAs. Evolutionary modelling further supports that the identified DRMs involve compensatory mutational interactions that facilitate relatively easy escape from drug-induced selection pressures. Our results indicate that accounting for epistasis is important for designing future HCV NS3-targeting DAAs.
Collapse
Affiliation(s)
- Hang Zhang
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
| | - Ahmed Abdul Quadeer
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China.
| | - Matthew R McKay
- Department of Electrical and Electronic Engineering, University of Melbourne, Melbourne, VIC, Australia.
- Department of Microbiology and Immunology, University of Melbourne, at The Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia.
| |
Collapse
|
35
|
Malbranke C, Rostain W, Depardieu F, Cocco S, Monasson R, Bikard D. Computational design of novel Cas9 PAM-interacting domains using evolution-based modelling and structural quality assessment. PLoS Comput Biol 2023; 19:e1011621. [PMID: 37976326 PMCID: PMC10729993 DOI: 10.1371/journal.pcbi.1011621] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 12/19/2023] [Accepted: 10/19/2023] [Indexed: 11/19/2023] Open
Abstract
We present here an approach to protein design that combines (i) scarce functional information such as experimental data (ii) evolutionary information learned from a natural sequence variants and (iii) physics-grounded modeling. Using a Restricted Boltzmann Machine (RBM), we learn a sequence model of a protein family. We use semi-supervision to leverage available functional information during the RBM training. We then propose a strategy to explore the protein representation space that can be informed by external models such as an empirical force-field method (FoldX). Our approach is applied to a domain of the Cas9 protein responsible for recognition of a short DNA motif. We experimentally assess the functionality of 71 variants generated to explore a range of RBM and FoldX energies. Sequences with as many as 50 differences (20% of the protein domain) to the wild-type retained functionality. Overall, 21/71 sequences designed with our method were functional. Interestingly, 6/71 sequences showed an improved activity in comparison with the original wild-type protein sequence. These results demonstrate the interest in further exploring the synergies between machine-learning of protein sequence representations and physics grounded modeling strategies informed by structural information.
Collapse
Affiliation(s)
- Cyril Malbranke
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - William Rostain
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - Florence Depardieu
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Superieure, PSL Research, CNRS UMR 8023, Sorbonne Université, Paris, France
| | - David Bikard
- Institut Pasteur, Université Paris Cité, CNRS UMR 6047, Synthetic Biology, Paris, France
| |
Collapse
|
36
|
Bastolla U, Abia D, Piette O. PC_ali: a tool for improved multiple alignments and evolutionary inference based on a hybrid protein sequence and structure similarity score. Bioinformatics 2023; 39:btad630. [PMID: 37847775 PMCID: PMC10628387 DOI: 10.1093/bioinformatics/btad630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 08/01/2023] [Accepted: 10/17/2023] [Indexed: 10/19/2023] Open
Abstract
MOTIVATION Evolutionary inference depends crucially on the quality of multiple sequence alignments (MSA), which is problematic for distantly related proteins. Since protein structure is more conserved than sequence, it seems natural to use structure alignments for distant homologs. However, structure alignments may not be suitable for inferring evolutionary relationships. RESULTS Here we examined four protein similarity measures that depend on sequence and structure (fraction of aligned residues, sequence identity, fraction of superimposed residues, and contact overlap), finding that they are intimately correlated but none of them provides a complete and unbiased picture of conservation in proteins. Therefore, we propose the new hybrid protein sequence and structure similarity score PC_sim based on their main principal component. The corresponding divergence measure PC_div shows the strongest correlation with divergences obtained from individual similarities, suggesting that it infers accurate evolutionary divergences. We developed the program PC_ali that constructs protein MSAs either de novo or modifying an input MSA, using a similarity matrix based on PC_sim. The program constructs a starting MSA based on the maximal cliques of the graph of these PAs and it refines it through progressive alignments along the tree reconstructed with PC_div. Compared with eight state-of-the-art multiple structure or sequence alignment tools, PC_ali achieves higher or equal aligned fraction and structural scores, sequence identity higher than structure aligners although lower than sequence aligners, highest score PC_sim, and highest similarity with the MSAs produced by other tools and with the reference MSA Balibase. AVAILABILITY AND IMPLEMENTATION https://github.com/ugobas/PC_ali.
Collapse
Affiliation(s)
- Ugo Bastolla
- Centro de Biologia Molecular “Severo Ochoa” (CBMSO), CSIC-UAM Cantoblanco, 28049 Madrid, Spain
| | - David Abia
- Bioinformatics Facility CBMSO, CSIC-UAM Cantoblanco, 28049 Madrid, Spain
| | - Oscar Piette
- Centro de Biologia Molecular “Severo Ochoa” (CBMSO), CSIC-UAM Cantoblanco, 28049 Madrid, Spain
| |
Collapse
|
37
|
Sawa T, Moriwaki Y, Jiang H, Murase K, Takayama S, Shimizu K, Terada T. Comprehensive computational analysis of the SRK-SP11 molecular interaction underlying self-incompatibility in Brassicaceae using improved structure prediction for cysteine-rich proteins. Comput Struct Biotechnol J 2023; 21:5228-5239. [PMID: 37928947 PMCID: PMC10624595 DOI: 10.1016/j.csbj.2023.10.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 10/03/2023] [Accepted: 10/16/2023] [Indexed: 11/07/2023] Open
Abstract
Plants employ self-incompatibility (SI) to promote cross-fertilization. In Brassicaceae, this process is regulated by the formation of a complex between the pistil determinant S receptor kinase (SRK) and the pollen determinant S-locus protein 11 (SP11, also known as S-locus cysteine-rich protein, SCR). In our previous study, we used the crystal structures of two eSRK-SP11 complexes in Brassica rapa S8 and S9 haplotypes and nine computationally predicted complex models to demonstrate that only the SRK ectodomain (eSRK) and SP11 pairs derived from the same S haplotype exhibit high binding free energy. However, predicting the eSRK-SP11 complex structures for the other 100 + S haplotypes and genera remains difficult because of SP11 polymorphism in sequence and structure. Although protein structure prediction using AlphaFold2 exhibits considerably high accuracy for most protein monomers and complexes, 46% of the predicted SP11 structures that we tested showed < 75 mean per-residue confidence score (pLDDT). Here, we demonstrate that the use of curated multiple sequence alignment (MSA) for cysteine-rich proteins significantly improved model accuracy for SP11 and eSRK-SP11 complexes. Additionally, we calculated the binding free energies of the predicted eSRK-SP11 complexes using molecular dynamics (MD) simulations and observed that some Arabidopsis haplotypes formed a binding mode that was critically different from that of B. rapa S8 and S9. Thus, our computational results provide insights into the haplotype-specific eSRK-SP11 binding modes in Brassicaceae at the residue level. The predicted models are freely available at Zenodo, https://doi.org/10.5281/zenodo.8047768.
Collapse
Affiliation(s)
- Tomoki Sawa
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Yoshitaka Moriwaki
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hanting Jiang
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Kohji Murase
- Department of Applied Biological Chemistry, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Seiji Takayama
- Department of Applied Biological Chemistry, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 113-8657, Japan
| | - Kentaro Shimizu
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Tohru Terada
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan
| |
Collapse
|
38
|
Lynn CW, Yu Q, Pang R, Bialek W, Palmer SE. Exactly solvable statistical physics models for large neuronal populations. ARXIV 2023:arXiv:2310.10860v1. [PMID: 37904743 PMCID: PMC10614989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Maximum entropy methods provide a principled path connecting measurements of neural activity directly to statistical physics models, and this approach has been successful for populations of N ~ 100 neurons. As N increases in new experiments, we enter an undersampled regime where we have to choose which observables should be constrained in the maximum entropy construction. The best choice is the one that provides the greatest reduction in entropy, defining a "minimax entropy" principle. This principle becomes tractable if we restrict attention to correlations among pairs of neurons that link together into a tree; we can find the best tree efficiently, and the underlying statistical physics models are exactly solved. We use this approach to analyze experiments on N ~ 1500 neurons in the mouse hippocampus, and show that the resulting model captures the distribution of synchronous activity in the network.
Collapse
Affiliation(s)
- Christopher W. Lynn
- Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, New York, NY 10016, USA
- Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ 08544, USA
- Department of Physics, Quantitative Biology Institute, and Wu Tsai Institute, Yale University, New Haven, CT 06520, USA
| | - Qiwei Yu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rich Pang
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ 08544, USA
| | - William Bialek
- Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ 08544, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
- Center for Studies in Physics and Biology, Rockefeller University, New York, NY 10065 USA
| | - Stephanie E. Palmer
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL 60637, USA
- Department of Physics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
39
|
Posani L, Rizzato F, Monasson R, Cocco S. Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data. PLoS Comput Biol 2023; 19:e1011521. [PMID: 37883593 PMCID: PMC10645369 DOI: 10.1371/journal.pcbi.1011521] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 11/14/2023] [Accepted: 09/15/2023] [Indexed: 10/28/2023] Open
Abstract
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
Collapse
Affiliation(s)
- Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Francesca Rizzato
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| |
Collapse
|
40
|
Zhao H, Murray D, Petrey D, Honig B. ZEPPI: proteome-scale sequence-based evaluation of protein-protein interaction models. RESEARCH SQUARE 2023:rs.3.rs-3289791. [PMID: 37790387 PMCID: PMC10543297 DOI: 10.21203/rs.3.rs-3289791/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence co-evolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. ZEPPI can be implemented on a proteome-wide scale as evidenced by calculations on millions of structural models of dimeric complexes in the E. coli and human interactomes found in the PrePPI database. A number of examples that illustrate how these tools can yield novel functional hypotheses are provided.
Collapse
Affiliation(s)
- Haiqing Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Donald Petrey
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY 10032, USA
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY 10032, USA
- Department of Medicine, Columbia University, New York, NY 10032, USA
- Zuckerman Mind Brain and Behavior Institute, Columbia University, New York, NY 10027, USA
| |
Collapse
|
41
|
Braghetto A, Orlandini E, Baiesi M. Interpretable Machine Learning of Amino Acid Patterns in Proteins: A Statistical Ensemble Approach. J Chem Theory Comput 2023; 19:6011-6022. [PMID: 37552831 PMCID: PMC10500975 DOI: 10.1021/acs.jctc.3c00383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Indexed: 08/10/2023]
Abstract
Explainable and interpretable unsupervised machine learning helps one to understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of α-helices or β-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of α-helices; (ii) there is a class of α-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side and Val, Leu, Iso, and Phe on the other display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an effective hydrophobicity, though they are not the most powerful (non)hydrophobic amino acids.
Collapse
Affiliation(s)
- Anna Braghetto
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Enzo Orlandini
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Marco Baiesi
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| |
Collapse
|
42
|
Ghoreyshi ZS, George JT. Quantitative approaches for decoding the specificity of the human T cell repertoire. Front Immunol 2023; 14:1228873. [PMID: 37781387 PMCID: PMC10539903 DOI: 10.3389/fimmu.2023.1228873] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 08/17/2023] [Indexed: 10/03/2023] Open
Abstract
T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method's mathematical approach, predictive performance, and limitations.
Collapse
Affiliation(s)
- Zahra S. Ghoreyshi
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
| | - Jason T. George
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
- Engineering Medicine Program, Texas A&M University, Houston, TX, United States
- Center for Theoretical Biological Physics, Rice University, Houston, TX, United States
| |
Collapse
|
43
|
Taubert O, von der Lehr F, Bazarova A, Faber C, Knechtges P, Weiel M, Debus C, Coquelin D, Basermann A, Streit A, Kesselheim S, Götz M, Schug A. RNA contact prediction by data efficient deep learning. Commun Biol 2023; 6:913. [PMID: 37674020 PMCID: PMC10482910 DOI: 10.1038/s42003-023-05244-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 08/14/2023] [Indexed: 09/08/2023] Open
Abstract
On the path to full understanding of the structure-function relationship or even design of RNA, structure prediction would offer an intriguing complement to experimental efforts. Any deep learning on RNA structure, however, is hampered by the sparsity of labeled training data. Utilizing the limited data available, we here focus on predicting spatial adjacencies ("contact maps") as a proxy for 3D structure. Our model, BARNACLE, combines the utilization of unlabeled data through self-supervised pre-training and efficient use of the sparse labeled data through an XGBoost classifier. BARNACLE shows a considerable improvement over both the established classical baseline and a deep neural network. In order to demonstrate that our approach can be applied to tasks with similar data constraints, we show that our findings generalize to the related setting of accessible surface area prediction.
Collapse
Affiliation(s)
- Oskar Taubert
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
| | - Fabrice von der Lehr
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
| | - Alina Bazarova
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Christian Faber
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
| | - Philipp Knechtges
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Marie Weiel
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Charlotte Debus
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Daniel Coquelin
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Achim Basermann
- Institute for Software Technology (SC), German Aerospace Centre (DLR), 51147, Köln, Germany
| | - Achim Streit
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany
| | - Stefan Kesselheim
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany
- Helmholtz AI, 81675, Munich, Germany
| | - Markus Götz
- Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology, 76344, Eggenstein-Leopoldshafen, Germany.
- Helmholtz AI, 81675, Munich, Germany.
| | - Alexander Schug
- Jülich Supercomputing Centre, Forschungszentrum Jülich, 52428, Jülich, Germany.
- Faculty of Biology, University of Duisburg-Essen, 45117, Essen, Germany.
| |
Collapse
|
44
|
Ho W, Huang H, Huang J. IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning. Protein Sci 2023; 32:e4739. [PMID: 37498545 PMCID: PMC10443345 DOI: 10.1002/pro.4739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Revised: 06/21/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023]
Abstract
Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self-attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder (mybinder.org). The only required input is the primary sequence. The training scripts are available on GitHub (https://github.com/allmwh/IFF). The training datasets have been deposited in an Open Science Framework repository (https://osf.io/jk29b).
Collapse
Affiliation(s)
- Wen‐Lin Ho
- Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
| | - Hsuan‐Cheng Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
| | - Jie‐rong Huang
- Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Department of Life Sciences and Institute of Genome SciencesNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
| |
Collapse
|
45
|
Hoshal BD, Holmes CM, Bojanek K, Salisbury J, Berry MJ, Marre O, Palmer SE. Stimulus invariant aspects of the retinal code drive discriminability of natural scenes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.08.552526. [PMID: 37609259 PMCID: PMC10441377 DOI: 10.1101/2023.08.08.552526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Everything that the brain sees must first be encoded by the retina, which maintains a reliable representation of the visual world in many different, complex natural scenes while also adapting to stimulus changes. Decomposing the population code into independent and cell-cell interactions reveals how broad scene structure is encoded in the adapted retinal output. By recording from the same retina while presenting many different natural movies, we see that the population structure, characterized by strong interactions, is consistent across both natural and synthetic stimuli. We show that these interactions contribute to encoding scene identity. We also demonstrate that this structure likely arises in part from shared bipolar cell input as well as from gap junctions between retinal ganglion cells and amacrine cells.
Collapse
Affiliation(s)
- Benjamin D Hoshal
- Committee on Computational Neuroscience, University of Chicago, Chicago, Illinois 60637, USA
| | - Caroline M Holmes
- Department of Physics, Princeton University, Princeton, New Jersey, 08540
| | - Kyle Bojanek
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, Illinois 60637, USA
| | - Jared Salisbury
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, Illinois 60637, USA
| | - Michael J Berry
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ 08544
| | - Olivier Marre
- Institut de la Vision, Sorbonne Université, INSERM, CNRS, Paris, France
| | - Stephanie E Palmer
- Department of Organismal Biology and Anatomy and Department of Physics, University of Chicago, Chicago, Illinois 60637, USA Center for the Physics of Biological Function, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
46
|
Miotto M, Rosito M, Paoluzzi M, de Turris V, Folli V, Leonetti M, Ruocco G, Rosa A, Gosti G. Collective behavior and self-organization in neural rosette morphogenesis. Front Cell Dev Biol 2023; 11:1134091. [PMID: 37635866 PMCID: PMC10448396 DOI: 10.3389/fcell.2023.1134091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Accepted: 07/26/2023] [Indexed: 08/29/2023] Open
Abstract
Neural rosettes develop from the self-organization of differentiating human pluripotent stem cells. This process mimics the emergence of the embryonic central nervous system primordium, i.e., the neural tube, whose formation is under close investigation as errors during such process result in severe diseases like spina bifida and anencephaly. While neural tube formation is recognized as an example of self-organization, we still do not understand the fundamental mechanisms guiding the process. Here, we discuss the different theoretical frameworks that have been proposed to explain self-organization in morphogenesis. We show that an explanation based exclusively on stem cell differentiation cannot describe the emergence of spatial organization, and an explanation based on patterning models cannot explain how different groups of cells can collectively migrate and produce the mechanical transformations required to generate the neural tube. We conclude that neural rosette development is a relevant experimental 2D in-vitro model of morphogenesis because it is a multi-scale self-organization process that involves both cell differentiation and tissue development. Ultimately, to understand rosette formation, we first need to fully understand the complex interplay between growth, migration, cytoarchitecture organization, and cell type evolution.
Collapse
Affiliation(s)
- Mattia Miotto
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- Department of Physics, Sapienza University of Rome, Rome, Italy
| | - Maria Rosito
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- Department of Physiology and Pharmacology V. Erspamer, Sapienza University of Rome, Rome, Italy
| | - Matteo Paoluzzi
- Departament de Física de la Matèria Condensada, Universitat de Barcelona, Barcelona, Spain
| | - Valeria de Turris
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
| | - Viola Folli
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- D-TAILS srl, Rome, Italy
| | - Marco Leonetti
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- D-TAILS srl, Rome, Italy
- Soft and Living Matter Laboratory, Institute of Nanotechnology, Consiglio Nazionale delle Ricerche, Rome, Italy
| | - Giancarlo Ruocco
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- Department of Physics, Sapienza University of Rome, Rome, Italy
| | - Alessandro Rosa
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- Department of Biology and Biotechnologies Charles Darwin, Sapienza University of Rome, Rome, Italy
| | - Giorgio Gosti
- Center for Life Nano and Neuro Science, Istituto Italiano di Tecnologia, Rome, Italy
- Soft and Living Matter Laboratory, Institute of Nanotechnology, Consiglio Nazionale delle Ricerche, Rome, Italy
| |
Collapse
|
47
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
48
|
Shome S, Jia K, Sivasankar S, Jernigan RL. Characterizing interactions in E-cadherin assemblages. Biophys J 2023; 122:3069-3077. [PMID: 37345249 PMCID: PMC10432173 DOI: 10.1016/j.bpj.2023.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 09/26/2022] [Accepted: 06/14/2023] [Indexed: 06/23/2023] Open
Abstract
Cadherin intermolecular interactions are critical for cell-cell adhesion and play essential roles in tissue formation and the maintenance of tissue structures. In this study, we focus on E-cadherin, a classical cadherin that connects epithelial cells, to understand how they interact in cis and trans conformations when attached to the same cell or opposing cells. We employ coevolutionary sequence analysis and molecular dynamics simulations to confirm previously known interaction sites as well as to identify new interaction sites. The sequence coevolutionary results yield a surprising result indicating that there are no strongly favored intermolecular interaction sites, which is unusual and suggests that many interaction sites may be possible, with none being strongly preferred over others. By using molecular dynamics, we test the persistence of these interactions and how they facilitate adhesion. We build several types of cadherin assemblages, with different numbers and combinations of cis and trans interfaces to understand how these conformations act to facilitate adhesion. Our results suggest that, in addition to the established interaction sites on the EC1 and EC2 domains, an additional plausible cis interface at the EC3-EC5 domain exists. Furthermore, we identify specific mutations at cis/trans binding sites that impair adhesion within E-cadherin assemblages.
Collapse
Affiliation(s)
- Sayane Shome
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa
| | - Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa
| | - Sanjeevi Sivasankar
- Department of Biomedical Engineering, University of California, Davis, Davis, California
| | - Robert L Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa.
| |
Collapse
|
49
|
Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL, Xiong C, Sun ZZ, Socher R, Fraser JS, Naik N. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 2023; 41:1099-1106. [PMID: 36702895 PMCID: PMC10400306 DOI: 10.1038/s41587-022-01618-2] [Citation(s) in RCA: 177] [Impact Index Per Article: 177.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Accepted: 11/17/2022] [Indexed: 01/27/2023]
Abstract
Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.
Collapse
Affiliation(s)
- Ali Madani
- Salesforce Research, Palo Alto, CA, USA.
- Profluent Bio, San Francisco, CA, USA.
| | | | - Eric R Greene
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | - Subu Subramanian
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA, USA
| | | | - James M Holton
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, Menlo Park, CA, USA
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
| | - Jose Luis Olmos
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | | | | | | | - James S Fraser
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
| | | |
Collapse
|
50
|
Kloucek MB, Machon T, Kajimura S, Royall CP, Masuda N, Turci F. Biases in inverse Ising estimates of near-critical behavior. Phys Rev E 2023; 108:014109. [PMID: 37583208 DOI: 10.1103/physreve.108.014109] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 04/27/2023] [Indexed: 08/17/2023]
Abstract
Inverse Ising inference allows pairwise interactions of complex binary systems to be reconstructed from empirical correlations. Typical estimators used for this inference, such as pseudo-likelihood maximization (PLM), are biased. Using the Sherrington-Kirkpatrick model as a benchmark, we show that these biases are large in critical regimes close to phase boundaries, and they may alter the qualitative interpretation of the inferred model. In particular, we show that the small-sample bias causes models inferred through PLM to appear closer to criticality than one would expect from the data. Data-driven methods to correct this bias are explored and applied to a functional magnetic resonance imaging data set from neuroscience. Our results indicate that additional care should be taken when attributing criticality to real-world data sets.
Collapse
Affiliation(s)
- Maximilian B Kloucek
- School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol BS8 1TL, United Kingdom
- Bristol Centre for Functional Nanomaterials, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol BS8 1TL, United Kingdom
| | - Thomas Machon
- School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol BS8 1TL, United Kingdom
| | - Shogo Kajimura
- Faculty of Information and Human Sciences, Kyoto Institute of Technology, Kyoto 606-8585, Japan
| | - C Patrick Royall
- Gulliver UMR CNRS 7083, ESPCI Paris, Université PSL, 75005 Paris, France
| | - Naoki Masuda
- Department of Mathematics, State University of New York at Buffalo, Buffalo, New York 14260-2900, USA
- Computational and Data-Enabled Science and Engineering Program, State University of New York at Buffalo, Buffalo, New York 14260-5030, USA
| | - Francesco Turci
- School of Physics, HH Wills Physics Laboratory, University of Bristol, Tyndall Avenue, Bristol BS8 1TL, United Kingdom
| |
Collapse
|