1
|
Martin J, Lequerica Mateos M, Onuchic JN, Coluzza I, Morcos F. Machine learning in biological physics: From biomolecular prediction to design. Proc Natl Acad Sci U S A 2024; 121:e2311807121. [PMID: 38913893 DOI: 10.1073/pnas.2311807121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Abstract
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Collapse
Affiliation(s)
- Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080
| | - Marcos Lequerica Mateos
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa 48940, Spain
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, TX 77005
- Department of Physics and Astronomy, Rice University, Houston, TX 77005
- Department of Chemistry, Rice University, Houston, TX 77005
- Department of BioSciences, Rice University, Houston, TX 77005
| | - Ivan Coluzza
- BCMaterials, Basque Center for Materials, Applications and Nanostructures, Universidad del País Vasco/Euskal Herriko Unibertsitatea Science Park, Leioa 48940, Spain
- Basque Foundation for Science, Ikerbasque, Bilbao 48940, Spain
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX 75080
- Department of Bioengineering, Center for Systems Biology, University of Texas at Dallas, Richardson, TX 75080
| |
Collapse
|
2
|
Lupo U, Sgarbossa D, Bitbol AF. Pairing interacting protein sequences using masked language modeling. Proc Natl Acad Sci U S A 2024; 121:e2311887121. [PMID: 38913900 DOI: 10.1073/pnas.2311887121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 12/18/2023] [Indexed: 06/26/2024] Open
Abstract
Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
3
|
Basu S, Subedi U, Tonelli M, Afshinpour M, Tiwari N, Fuentes EJ, Chakravarty S. Assessing the functional roles of coevolving PHD finger residues. Protein Sci 2024; 33:e5065. [PMID: 38923615 PMCID: PMC11201814 DOI: 10.1002/pro.5065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/21/2024] [Accepted: 05/16/2024] [Indexed: 06/28/2024]
Abstract
Although in silico folding based on coevolving residue constraints in the deep-learning era has transformed protein structure prediction, the contributions of coevolving residues to protein folding, stability, and other functions in physical contexts remain to be clarified and experimentally validated. Herein, the PHD finger module, a well-known histone reader with distinct subtypes containing subtype-specific coevolving residues, was used as a model to experimentally assess the contributions of coevolving residues and to clarify their specific roles. The results of the assessment, including proteolysis and thermal unfolding of wildtype and mutant proteins, suggested that coevolving residues have varying contributions, despite their large in silico constraints. Residue positions with large constraints were found to contribute to stability in one subtype but not others. Computational sequence design and generative model-based energy estimates of individual structures were also implemented to complement the experimental assessment. Sequence design and energy estimates distinguish coevolving residues that contribute to folding from those that do not. The results of proteolytic analysis of mutations at positions contributing to folding were consistent with those suggested by sequence design and energy estimation. Thus, we report a comprehensive assessment of the contributions of coevolving residues, as well as a strategy based on a combination of approaches that should enable detailed understanding of the residue contributions in other large protein families.
Collapse
Affiliation(s)
- Shraddha Basu
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Ujwal Subedi
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Marco Tonelli
- National Magnetic Resonance Facility at Madison (NMRFAM), University of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Maral Afshinpour
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Nitija Tiwari
- Department of Biochemistry & Molecular BiologyUniversity of IowaIowa CityIowaUSA
| | - Ernesto J. Fuentes
- Department of Biochemistry & Molecular BiologyUniversity of IowaIowa CityIowaUSA
| | - Suvobrata Chakravarty
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| |
Collapse
|
4
|
Sela M, Church JR, Schapiro I, Schneidman-Duhovny D. RhoMax: Computational Prediction of Rhodopsin Absorption Maxima Using Geometric Deep Learning. J Chem Inf Model 2024; 64:4630-4639. [PMID: 38829021 PMCID: PMC11200256 DOI: 10.1021/acs.jcim.4c00467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/17/2024] [Indexed: 06/05/2024]
Abstract
Microbial rhodopsins (MRs) are a diverse and abundant family of photoactive membrane proteins that serve as model systems for biophysical techniques. Optogenetics utilizes genetic engineering to insert specialized proteins into specific neurons or brain regions, allowing for manipulation of their activity through light and enabling the mapping and control of specific brain areas in living organisms. The obstacle of optogenetics lies in the fact that light has a limited ability to penetrate biological tissues, particularly blue light in the visible spectrum. Despite this challenge, most optogenetic systems rely on blue light due to the scarcity of red-shifted opsins. Finding additional red-shifted rhodopsins would represent a major breakthrough in overcoming the challenge of limited light penetration in optogenetics. However, determining the wavelength absorption maxima for rhodopsins based on their protein sequence is a significant hurdle. Current experimental methods are time-consuming, while computational methods lack accuracy. The paper introduces a new computational approach called RhoMax that utilizes structure-based geometric deep learning to predict the absorption wavelength of rhodopsins solely based on their sequences. The method takes advantage of AlphaFold2 for accurate modeling of rhodopsin structures. Once trained on a balanced train set, RhoMax rapidly and precisely predicted the maximum absorption wavelength of more than half of the sequences in our test set with an accuracy of 0.03 eV. By leveraging computational methods for absorption maxima determination, we can drastically reduce the time needed for designing new red-shifted microbial rhodopsins, thereby facilitating advances in the field of optogenetics.
Collapse
Affiliation(s)
- Meitar Sela
- The
Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Jonathan R. Church
- Fritz
Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Igor Schapiro
- Fritz
Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Dina Schneidman-Duhovny
- The
Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| |
Collapse
|
5
|
Fram B, Su Y, Truebridge I, Riesselman AJ, Ingraham JB, Passera A, Napier E, Thadani NN, Lim S, Roberts K, Kaur G, Stiffler MA, Marks DS, Bahl CD, Khan AR, Sander C, Gauthier NP. Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Nat Commun 2024; 15:5141. [PMID: 38902262 PMCID: PMC11190266 DOI: 10.1038/s41467-024-49119-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 05/24/2024] [Indexed: 06/22/2024] Open
Abstract
A major challenge in protein design is to augment existing functional proteins with multiple property enhancements. Altering several properties likely necessitates numerous primary sequence changes, and novel methods are needed to accurately predict combinations of mutations that maintain or enhance function. Models of sequence co-variation (e.g., EVcouplings), which leverage extensive information about various protein properties and activities from homologous protein sequences, have proven effective for many applications including structure determination and mutation effect prediction. We apply EVcouplings to computationally design variants of the model protein TEM-1 β-lactamase. Nearly all the 14 experimentally characterized designs were functional, including one with 84 mutations from the nearest natural homolog. The designs also had large increases in thermostability, increased activity on multiple substrates, and nearly identical structure to the wild type enzyme. This study highlights the efficacy of evolutionary models in guiding large sequence alterations to generate functional diversity for protein design applications.
Collapse
Affiliation(s)
- Benjamin Fram
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Yang Su
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Ian Truebridge
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Adam J Riesselman
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Program in Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - John B Ingraham
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Alessandro Passera
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Campus-Vienna-Biocenter 1, 1030, Vienna, Austria
| | - Eve Napier
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
| | - Nicole N Thadani
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Apriori Bio, Cambridge, MA, USA
| | - Samuel Lim
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Kristen Roberts
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Gurleen Kaur
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Michael A Stiffler
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Dyno Therapeutics, 343 Arsenal Street, Watertown, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christopher D Bahl
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Amir R Khan
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
- Division of Newborn Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicholas P Gauthier
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
6
|
Calvanese F, Lambert CN, Nghe P, Zamponi F, Weigt M. Towards parsimonious generative modeling of RNA families. Nucleic Acids Res 2024; 52:5465-5477. [PMID: 38661206 PMCID: PMC11162787 DOI: 10.1093/nar/gkae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 03/05/2024] [Accepted: 04/05/2024] [Indexed: 04/26/2024] Open
Abstract
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
Collapse
Affiliation(s)
- Francesco Calvanese
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Camille N Lambert
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Philippe Nghe
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
| |
Collapse
|
7
|
Zhao H, Petrey D, Murray D, Honig B. ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models. Proc Natl Acad Sci U S A 2024; 121:e2400260121. [PMID: 38743624 PMCID: PMC11127014 DOI: 10.1073/pnas.2400260121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open
Abstract
We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.
Collapse
Affiliation(s)
- Haiqing Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Donald Petrey
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY10032
- Department of Medicine, Columbia University, New York, NY10032
- Zuckerman Institute, Columbia University, New York, NY10027
| |
Collapse
|
8
|
Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, O'Donnell TJ, Berenberg D, Fisk I, Zanichelli N, Zhang B, Nowaczynski A, Wang B, Stepniewska-Dziubinska MM, Zhang S, Ojewole A, Guney ME, Biderman S, Watkins AM, Ra S, Lorenzo PR, Nivon L, Weitzner B, Ban YEA, Chen S, Zhang M, Li C, Song SL, He Y, Sorger PK, Mostaque E, Zhang Z, Bonneau R, AlQuraishi M. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024:10.1038/s41592-024-02272-z. [PMID: 38744917 DOI: 10.1038/s41592-024-02272-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/03/2024] [Indexed: 05/16/2024]
Abstract
AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.
Collapse
Affiliation(s)
- Gustaf Ahdritz
- Department of Systems Biology, Columbia University, New York, NY, USA
- Harvard University, Cambridge, MA, USA
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
| | | | - Sachin Kadyan
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Qinghui Xia
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - William Gerecke
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
| | - Ian Fisk
- Flatiron Institute, New York, NY, USA
| | | | - Bo Zhang
- Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
| | | | | | | | | | | | | | - Stella Biderman
- EleutherAI, New York, NY, USA
- Booz Allen Hamilton, McLean, VA, USA
| | | | - Stephen Ra
- Prescient Design, Genentech, New York, NY, USA
| | | | | | | | | | | | - Minjia Zhang
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | | | | | | | - Peter K Sorger
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Zhao Zhang
- Rutgers University, New Brunswick, NJ, USA
| | | | | |
Collapse
|
9
|
Zheng W. Predicting hotspots for disease-causing single nucleotide variants using sequences-based coevolution, network analysis, and machine learning. PLoS One 2024; 19:e0302504. [PMID: 38743747 PMCID: PMC11093321 DOI: 10.1371/journal.pone.0302504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 04/05/2024] [Indexed: 05/16/2024] Open
Abstract
To enable personalized medicine, it is important yet highly challenging to accurately predict disease-causing mutations in target proteins at high throughput. Previous computational methods have been developed using evolutionary information in combination with various biochemical and structural features of protein residues to discriminate neutral vs. deleterious mutations. However, the power of these methods is often limited because they either assume known protein structures or treat residues independently without fully considering their interactions. To address the above limitations, we build upon recent progress in machine learning, network analysis, and protein language models, and develop a sequences-based variant site prediction workflow based on the protein residue contact networks: 1. We employ and integrate various methods of building protein residue networks using state-of-the-art coevolution analysis tools (RaptorX, DeepMetaPSICOV, and SPOT-Contact) powered by deep learning. 2. We use machine learning algorithms (Random Forest, Gradient Boosting, and Extreme Gradient Boosting) to optimally combine 20 network centrality scores to jointly predict key residues as hot spots for disease mutations. 3. Using a dataset of 107 proteins rich in disease mutations, we rigorously evaluate the network scores individually and collectively (via machine learning). This work supports a promising strategy of combining an ensemble of network scores based on different coevolution analysis methods (and optionally predictive scores from other methods) via machine learning to predict hotspot sites of disease mutations, which will inform downstream applications of disease diagnosis and targeted drug design.
Collapse
Affiliation(s)
- Wenjun Zheng
- Department of Physics, State University of New York at Buffalo, Buffalo, NY, United States of America
| |
Collapse
|
10
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593772. [PMID: 38798671 PMCID: PMC11118547 DOI: 10.1101/2024.05.12.593772] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
11
|
Posfai A, McCandlish DM, Kinney JB. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593774. [PMID: 38798625 PMCID: PMC11118426 DOI: 10.1101/2024.05.12.593774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships, which describe how biological sequences encode functional activities, are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when the transformations of model parameters that compensate for these symmetry transformations are described by redundant irreducible matrix representations. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the dimension of the space of gauge freedoms, as well as efficient computation of a sparse basis for this space. Finally, we show that the ability to interpret model parameters as quantifying allelic effects places strong constraints on the form that models can take, and in particular show that all nontrivial equivariant models of allelic effects must exhibit gauge freedoms. Our work thus advances the understanding of the relationship between symmetries and gauge freedoms in quantitative models of sequence-function relationships.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
12
|
Pan Z, Theesfeld CL. Deciphering missense coding variants with AlphaMissense. Kidney Int 2024:S0085-2538(24)00190-X. [PMID: 38647510 DOI: 10.1016/j.kint.2024.02.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 02/12/2024] [Indexed: 04/25/2024]
Affiliation(s)
- Zhicheng Pan
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Chandra L Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA.
| |
Collapse
|
13
|
Zhang J, Durham J, Qian Cong. Revolutionizing protein-protein interaction prediction with deep learning. Curr Opin Struct Biol 2024; 85:102775. [PMID: 38330793 DOI: 10.1016/j.sbi.2024.102775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/31/2023] [Accepted: 01/05/2024] [Indexed: 02/10/2024]
Abstract
Protein-protein interactions (PPIs) are pivotal for driving diverse biological processes, and any disturbance in these interactions can lead to disease. Thus, the study of PPIs has been a central focus in biology. Recent developments in deep learning methods, coupled with the vast genomic sequence data, have significantly boosted the accuracy of predicting protein structures and modeling protein complexes, approaching levels comparable to experimental techniques. Herein, we review the latest advances in the computational methods for modeling 3D protein complexes and the prediction of protein interaction partners, emphasizing the application of deep learning methods deriving from coevolution analysis. The review also highlights biomedical applications of PPI prediction and outlines challenges in the field.
Collapse
Affiliation(s)
- Jing Zhang
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA. https://twitter.com/jzhang_genome
| | - Jesse Durham
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; HaroldC.Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
| |
Collapse
|
14
|
Kalogeropoulos K, Bohn MF, Jenkins DE, Ledergerber J, Sørensen CV, Hofmann N, Wade J, Fryer T, Thi Tuyet Nguyen G, Auf dem Keller U, Laustsen AH, Jenkins TP. A comparative study of protein structure prediction tools for challenging targets: Snake venom toxins. Toxicon 2024; 238:107559. [PMID: 38113945 DOI: 10.1016/j.toxicon.2023.107559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/06/2023] [Accepted: 12/08/2023] [Indexed: 12/21/2023]
Abstract
Protein structure determination is a critical aspect of biological research, enabling us to understand protein function and potential applications. Recent advances in deep learning and artificial intelligence have led to the development of several protein structure prediction tools, such as AlphaFold2 and ColabFold. However, their performance has primarily been evaluated on well-characterised proteins and their ability to predict sturtctures of proteins lacking experimental structures, such as many snake venom toxins, has been less scrutinised. In this study, we evaluated three modelling tools on their prediction of over 1000 snake venom toxin structures for which no experimental structures exist. Our findings show that AlphaFold2 (AF2) performed the best across all assessed parameters. We also observed that ColabFold (CF) only scored slightly worse than AF2, while being computationally less intensive. All tools struggled with regions of intrinsic disorder, such as loops and propeptide regions, and performed well in predicting the structure of functional domains. Overall, our study highlights the importance of exercising caution when working with proteins with no experimental structures available, particularly those that are large and contain flexible regions. Nonetheless, leveraging computational structure prediction tools can provide valuable insights into the modelling of protein interactions with different targets and reveal potential binding sites, active sites, and conformational changes, as well as into the design of potential molecular binders for reagent, diagnostic, or therapeutic purposes.
Collapse
Affiliation(s)
| | - Markus-Frederik Bohn
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Jann Ledergerber
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark; Department of Chemistry and Applied Bioscience, ETH Zurich, Zurich, Switzerland
| | - Christoffer V Sørensen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Nils Hofmann
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Jack Wade
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Thomas Fryer
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Giang Thi Tuyet Nguyen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ulrich Auf dem Keller
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Andreas H Laustsen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Timothy P Jenkins
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
15
|
Pucci F, Zerihun MB, Rooman M, Schug A. pycofitness-Evaluating the fitness landscape of RNA and protein sequences. Bioinformatics 2024; 40:btae074. [PMID: 38335928 PMCID: PMC10881095 DOI: 10.1093/bioinformatics/btae074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 01/25/2024] [Accepted: 02/06/2024] [Indexed: 02/12/2024] Open
Abstract
MOTIVATION The accurate prediction of how mutations change biophysical properties of proteins or RNA is a major goal in computational biology with tremendous impacts on protein design and genetic variant interpretation. Evolutionary approaches such as coevolution can help solving this issue. RESULTS We present pycofitness, a standalone Python-based software package for the in silico mutagenesis of protein and RNA sequences. It is based on coevolution and, more specifically, on a popular inverse statistical approach, namely direct coupling analysis by pseudo-likelihood maximization. Its efficient implementation and user-friendly command line interface make it an easy-to-use tool even for researchers with no bioinformatics background. To illustrate its strengths, we present three applications in which pycofitness efficiently predicts the deleteriousness of genetic variants and the effect of mutations on protein fitness and thermodynamic stability. AVAILABILITY AND IMPLEMENTATION https://github.com/KIT-MBS/pycofitness.
Collapse
Affiliation(s)
- Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Mehari B Zerihun
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Alexander Schug
- John von Neumann Institute for Computing, Jülich Supercomputer Centre, 52428 Jülich, Germany
- Department of Biology, University of Duisburg-Essen, D-45141 Essen, Germany
| |
Collapse
|
16
|
Peng CX, Liang F, Xia YH, Zhao KL, Hou MH, Zhang GJ. Recent Advances and Challenges in Protein Structure Prediction. J Chem Inf Model 2024; 64:76-95. [PMID: 38109487 DOI: 10.1021/acs.jcim.3c01324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Artificial intelligence has made significant advances in the field of protein structure prediction in recent years. In particular, DeepMind's end-to-end model, AlphaFold2, has demonstrated the capability to predict three-dimensional structures of numerous unknown proteins with accuracy levels comparable to those of experimental methods. This breakthrough has opened up new possibilities for understanding protein structure and function as well as accelerating drug discovery and other applications in the field of biology and medicine. Despite the remarkable achievements of artificial intelligence in the field, there are still some challenges and limitations. In this Review, we discuss the recent progress and some of the challenges in protein structure prediction. These challenges include predicting multidomain protein structures, protein complex structures, multiple conformational states of proteins, and protein folding pathways. Furthermore, we highlight directions in which further improvements can be conducted.
Collapse
Affiliation(s)
- Chun-Xiang Peng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Fang Liang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu-Hao Xia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Kai-Long Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Ming-Hua Hou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
17
|
Wayment-Steele HK, Ojoawo A, Otten R, Apitz JM, Pitsawong W, Hömberger M, Ovchinnikov S, Colwell L, Kern D. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature 2024; 625:832-839. [PMID: 37956700 PMCID: PMC10808063 DOI: 10.1038/s41586-023-06832-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 11/03/2023] [Indexed: 11/15/2023]
Abstract
AlphaFold2 (ref. 1) has revolutionized structural biology by accurately predicting single structures of proteins. However, a protein's biological function often depends on multiple conformational substates2, and disease-causing point mutations often cause population changes within these substates3,4. We demonstrate that clustering a multiple-sequence alignment by sequence similarity enables AlphaFold2 to sample alternative states of known metamorphic proteins with high confidence. Using this method, named AF-Cluster, we investigated the evolutionary distribution of predicted structures for the metamorphic protein KaiB5 and found that predictions of both conformations were distributed in clusters across the KaiB family. We used nuclear magnetic resonance spectroscopy to confirm an AF-Cluster prediction: a cyanobacteria KaiB variant is stabilized in the opposite state compared with the more widely studied variant. To test AF-Cluster's sensitivity to point mutations, we designed and experimentally verified a set of three mutations predicted to flip KaiB from Rhodobacter sphaeroides from the ground to the fold-switched state. Finally, screening for alternative states in protein families without known fold switching identified a putative alternative state for the oxidoreductase Mpt53 in Mycobacterium tuberculosis. Further development of such bioinformatic methods in tandem with experiments will probably have a considerable impact on predicting protein energy landscapes, essential for illuminating biological function.
Collapse
Affiliation(s)
- Hannah K Wayment-Steele
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Adedolapo Ojoawo
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Renee Otten
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Treeline Biosciences, Watertown, MA, USA
| | - Julia M Apitz
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
| | - Warintra Pitsawong
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Biomolecular Discovery, Relay Therapeutics, Cambridge, MA, USA
| | - Marc Hömberger
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA
- Treeline Biosciences, Watertown, MA, USA
| | | | - Lucy Colwell
- Google Research, Cambridge, MA, USA
- Cambridge University, Cambridge, UK
| | - Dorothee Kern
- Department of Biochemistry, Brandeis University and Howard Hughes Medical Institute, Waltham, MA, USA.
| |
Collapse
|
18
|
Wang J, Ye H, Li X, Lv X, Lou J, Chen Y, Yu S, Zhang L. Genome-Wide Analysis of the MADS-Box Gene Family in Hibiscus syriacus and Their Role in Floral Organ Development. Int J Mol Sci 2023; 25:406. [PMID: 38203576 PMCID: PMC10779063 DOI: 10.3390/ijms25010406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/12/2024] Open
Abstract
Hibiscus syriacus belongs to the Malvaceae family, and is a plant with medicinal, edible, and greening values. MADS-box transcription factor is a large family of regulatory factors involved in a variety of biological processes in plants. Here, we performed a genome-wide characterization of MADS-box proteins in H. syriacus and investigated gene structure, phylogenetics, cis-acting elements, three-dimensional structure, gene expression, and protein interaction to identify candidate MADS-box genes that mediate petal developmental regulation in H. syriacus. A total of 163 candidate MADS-box genes were found and classified into type I (Mα, Mβ, and Mγ) and type II (MIKC and Mδ). Analysis of cis-acting elements in the promoter region showed that most elements were correlated to plant hormones. The analysis of nine HsMADS expressions of two different H. syriacus cultivars showed that they were differentially expressed between two type flowers. The analysis of protein interaction networks also indicated that MADS proteins played a crucial role in floral organ identification, inflorescence and fruit development, and flowering time. This research is the first to analyze the MADS-box family of H. syriacus and provides an important reference for further study of the biological functions of the MADS-box, especially in flower organ development.
Collapse
Affiliation(s)
- Jie Wang
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Heng Ye
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Xiaolong Li
- College of Horticulture Science, Zhejiang A&F University, Hangzhou 311300, China;
| | - Xue Lv
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Jiaqi Lou
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Yulu Chen
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Shuhan Yu
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| | - Lu Zhang
- College of Landscape Architecture, Zhejiang A&F University, Hangzhou 311300, China; (J.W.); (H.Y.); (X.L.); (J.L.); (Y.C.)
| |
Collapse
|
19
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
20
|
Xie WJ, Liu D, Wang X, Zhang A, Wei Q, Nandi A, Dong S, Warshel A. Enhancing luciferase activity and stability through generative modeling of natural enzyme sequences. Proc Natl Acad Sci U S A 2023; 120:e2312848120. [PMID: 37983512 PMCID: PMC10691223 DOI: 10.1073/pnas.2312848120] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 10/09/2023] [Indexed: 11/22/2023] Open
Abstract
The availability of natural protein sequences synergized with generative AI provides new paradigms to engineer enzymes. Although active enzyme variants with numerous mutations have been designed using generative models, their performance often falls short of their wild type counterparts. Additionally, in practical applications, choosing fewer mutations that can rival the efficacy of extensive sequence alterations is usually more advantageous. Pinpointing beneficial single mutations continues to be a formidable task. In this study, using the generative maximum entropy model to analyze Renilla luciferase (RLuc) homologs, and in conjunction with biochemistry experiments, we demonstrated that natural evolutionary information could be used to predictively improve enzyme activity and stability by engineering the active center and protein scaffold, respectively. The success rate to improve either luciferase activity or stability of designed single mutants is ~50%. This finding highlights nature's ingenious approach to evolving proficient enzymes, wherein diverse evolutionary pressures are preferentially applied to distinct regions of the enzyme, ultimately culminating in an overall high performance. We also reveal an evolutionary preference in RLuc toward emitting blue light that holds advantages in terms of water penetration compared to other light spectra. Taken together, our approach facilitates navigation through enzyme sequence space and offers effective strategies for computer-aided rational enzyme engineering.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA90089
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL32610
| | - Dangliang Liu
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, School of Pharmaceutical Sciences, Peking University, Beijing100191, China
| | - Xiaoya Wang
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, School of Pharmaceutical Sciences, Peking University, Beijing100191, China
| | - Aoxuan Zhang
- Department of Chemistry, University of Southern California, Los Angeles, CA90089
| | - Qijia Wei
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, School of Pharmaceutical Sciences, Peking University, Beijing100191, China
| | - Ashim Nandi
- Department of Chemistry, University of Southern California, Los Angeles, CA90089
| | - Suwei Dong
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, School of Pharmaceutical Sciences, Peking University, Beijing100191, China
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA90089
| |
Collapse
|
21
|
Feng T, Liu J, Zhang X, Fan D, Bai Y. Protein engineering of multi-enzyme virus-like particle nanoreactors for enhanced chiral alcohol synthesis. NANOSCALE ADVANCES 2023; 5:6606-6616. [PMID: 38024302 PMCID: PMC10662152 DOI: 10.1039/d3na00515a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 10/17/2023] [Indexed: 12/01/2023]
Abstract
In the past decade, virus-like particles (VLPs) that can encapsulate single or multiple enzymes have been studied extensively as typical nanoreactors for biocatalysis in vitro, yet their catalytic efficiencies are usually inadequate for real applications. These biocatalytic nanoreactors should be engineered like their free-enzyme counterparts to improve their catalytic performance for potential applications. Herein we engineer biocatalytic VLPs for the enhanced synthesis of chiral alcohols. Different methods including directed evolution were applied to the entire bacteriophage P22 VLPs (except the coat protein), which encapsulated a carbonyl reductase from Scheffersomyces stipitis (SsCR) and a glucose dehydrogenase from Bacillus megaterium (BmGDH) in their capsids. The best variant, namely M5, showed an enhanced turnover frequency (TOF, min-1) up to 15-fold toward the majority of tested aromatic prochiral ketones, and gave up to 99% enantiomeric excess in the synthesis of chiral alcohol pharmaceutical intermediates. A comparison with the mutations of the free-enzyme counterparts showed that the same amino acid mutations led to different changes in the catalytic efficiencies of free and confined enzymes. Finally, the engineered M5 nanoreactor showed improved efficiency in the scale-up synthesis of chiral alcohols. The conversions of three substrates catalyzed by M5 were all higher than those catalyzed by the wild-type nanoreactor, demonstrating that enzyme-encapsulating VLPs can evolve to enhance their catalytic performance for potential applications.
Collapse
Affiliation(s)
- Taotao Feng
- State Key Laboratory of Bioreactor Engineering, Shanghai Collaborative Innovation Center for Biomanufacturing, East China University of Science and Technology Shanghai 200237 China
| | - Jiaxu Liu
- State Key Laboratory of Bioreactor Engineering, Shanghai Collaborative Innovation Center for Biomanufacturing, East China University of Science and Technology Shanghai 200237 China
| | - Xiaoyan Zhang
- State Key Laboratory of Bioreactor Engineering, Shanghai Collaborative Innovation Center for Biomanufacturing, East China University of Science and Technology Shanghai 200237 China
| | - Daidi Fan
- Shaanxi R&D Centre of Biomaterials and Fermentation Engineering, School of Chemical Engineering, Northwest University Xi'an Shaanxi 710069 China
| | - Yunpeng Bai
- State Key Laboratory of Bioreactor Engineering, Shanghai Collaborative Innovation Center for Biomanufacturing, East China University of Science and Technology Shanghai 200237 China
| |
Collapse
|
22
|
Fongang B, Wadop YN, Zhu Y, Wagner EJ, Kudlicki A, Rowicka M. Coevolution combined with molecular dynamics simulations provides structural and mechanistic insights into the interactions between the integrator complex subunits. Comput Struct Biotechnol J 2023; 21:5686-5697. [PMID: 38074468 PMCID: PMC10700540 DOI: 10.1016/j.csbj.2023.11.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/10/2023] [Accepted: 11/10/2023] [Indexed: 01/18/2024] Open
Abstract
Finding the 3D structure of large, multi-subunit complexes is difficult, despite recent advances in cryo-EM technology, due to remaining challenges to expressing and purifying subunits. Computational approaches that predict protein-protein interactions, including Direct Coupling Analysis (DCA), represent an attractive alternative for dissecting interactions within protein complexes. However, they are readily applicable only to small proteins due to high computational complexity and a high number of false positives. To solve this problem, we proposed a modified DCA approach, a powerful tool to predict the most likely interfaces of protein complexes. Since our modified approach cannot provide structural and mechanistic details of interacting peptides, we combine it with Molecular Dynamics (MD) simulations. To illustrate this novel approach, we predict interacting domains and structural details of interactions of two Integrator complex subunits, INTS9 and INTS11. Our predictions of interacting residues of INTS9/INTS11 are highly consistent with crystallographic structure. We then expand our procedure to two complexes whose structures are not well-studied: 1) The heterodimer formed by the Cleavage and Polyadenylation Specificity Factor 100-kD (CPSF100) and 73-kD (CPSF73); 2) The heterotrimer formed by INTS4/INTS9/INTS11. Experimental data supports our predictions of interactions within these two complexes, demonstrating that combining DCA and MD simulations is a powerful approach to revealing structural insights of large protein complexes.
Collapse
Affiliation(s)
- Bernard Fongang
- Glenn Biggs Institute for Alzheimer's & Neurodegenerative Diseases, The University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
- Department of Biochemistry and Structural Biology, The University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
- Department of Population Health Sciences, The University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
| | - Yannick N. Wadop
- Glenn Biggs Institute for Alzheimer's & Neurodegenerative Diseases, The University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
| | - Yingjie Zhu
- Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, TX, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
| | - Eric J. Wagner
- Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, TX, United States
- Department of Biochemistry and Biophysics, The University of Rochester Medical Center, Rochester, NY, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
| | - Andrzej Kudlicki
- Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, TX, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
- Informatics Service Center, The University of Texas Medical Branch, Galveston, TX, United States
| | - Maga Rowicka
- Department of Biochemistry and Molecular Biology, The University of Texas Medical Branch, Galveston, TX, United States
- Institute for Translational Sciences, The University of Texas Medical Branch, Galveston, TX, United States
| |
Collapse
|
23
|
Zhang H, Quadeer AA, McKay MR. Direct-acting antiviral resistance of Hepatitis C virus is promoted by epistasis. Nat Commun 2023; 14:7457. [PMID: 37978179 PMCID: PMC10656532 DOI: 10.1038/s41467-023-42550-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 10/13/2023] [Indexed: 11/19/2023] Open
Abstract
Direct-acting antiviral agents (DAAs) provide efficacious therapeutic treatments for chronic Hepatitis C virus (HCV) infection. However, emergence of drug resistance mutations (DRMs) can greatly affect treatment outcomes and impede virological cure. While multiple DRMs have been observed for all currently used DAAs, the evolutionary determinants of such mutations are not currently well understood. Here, by considering DAAs targeting the nonstructural 3 (NS3) protein of HCV, we present results suggesting that epistasis plays an important role in the evolution of DRMs. Employing a sequence-based fitness landscape model whose predictions correlate highly with experimental data, we identify specific DRMs that are associated with strong epistatic interactions, and these are found to be enriched in multiple NS3-specific DAAs. Evolutionary modelling further supports that the identified DRMs involve compensatory mutational interactions that facilitate relatively easy escape from drug-induced selection pressures. Our results indicate that accounting for epistasis is important for designing future HCV NS3-targeting DAAs.
Collapse
Affiliation(s)
- Hang Zhang
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
| | - Ahmed Abdul Quadeer
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China.
| | - Matthew R McKay
- Department of Electrical and Electronic Engineering, University of Melbourne, Melbourne, VIC, Australia.
- Department of Microbiology and Immunology, University of Melbourne, at The Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, Australia.
| |
Collapse
|
24
|
Khakzad H, Igashov I, Schneuing A, Goverde C, Bronstein M, Correia B. A new age in protein design empowered by deep learning. Cell Syst 2023; 14:925-939. [PMID: 37972559 DOI: 10.1016/j.cels.2023.10.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 06/22/2023] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Collapse
Affiliation(s)
- Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France; École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Ilia Igashov
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Arne Schneuing
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Bruno Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
25
|
Rix G, Williams RL, Spinner H, Hu VJ, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.13.566922. [PMID: 38014077 PMCID: PMC10680746 DOI: 10.1101/2023.11.13.566922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
When nature maintains or evolves a gene's function over millions of years at scale, it produces a diversity of homologous sequences whose patterns of conservation and change contain rich structural, functional, and historical information about the gene. However, natural gene diversity likely excludes vast regions of functional sequence space and includes phylogenetic and evolutionary eccentricities, limiting what information we can extract. We introduce an accessible experimental approach for compressing long-term gene evolution to laboratory timescales, allowing for the direct observation of extensive adaptation and divergence followed by inference of structural, functional, and environmental constraints for any selectable gene. To enable this approach, we developed a new orthogonal DNA replication (OrthoRep) system that durably hypermutates chosen genes at a rate of >10 -4 substitutions per base in vivo . When OrthoRep was used to evolve a conditionally essential maladapted enzyme, we obtained thousands of unique multi-mutation sequences with many pairs >60 amino acids apart (>15% divergence), revealing known and new factors influencing enzyme adaptation. The fitness of evolved sequences was not predictable by advanced machine learning models trained on natural variation. We suggest that OrthoRep supports the prospective and systematic discovery of constraints shaping gene evolution, uncovering of new regions in fitness landscapes, and general applications in biomolecular engineering.
Collapse
|
26
|
Kewalramani N, Emili A, Crovella M. State-of-the-art computational methods to predict protein-protein interactions with high accuracy and coverage. Proteomics 2023; 23:e2200292. [PMID: 37401192 DOI: 10.1002/pmic.202200292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Revised: 05/24/2023] [Accepted: 06/09/2023] [Indexed: 07/05/2023]
Abstract
Prediction of protein-protein interactions (PPIs) commonly involves a significant computational component. Rapid recent advances in the power of computational methods for protein interaction prediction motivate a review of the state-of-the-art. We review the major approaches, organized according to the primary source of data utilized: protein sequence, protein structure, and protein co-abundance. The advent of deep learning (DL) has brought with it significant advances in interaction prediction, and we show how DL is used for each source data type. We review the literature taxonomically, present example case studies in each category, and conclude with observations about the strengths and weaknesses of machine learning methods in the context of the principal sources of data for protein interaction prediction.
Collapse
Affiliation(s)
- Neal Kewalramani
- Program in Bioinformatics, Boston University, Boston, Massachusetts, USA
| | - Andrew Emili
- OHSU Knight Cancer Institute, Portland, Oregon, USA
| | - Mark Crovella
- Department of Computer Science and Program in Bioinformatics, Boston University, Boston, Massachusetts, USA
| |
Collapse
|
27
|
Liu Z, Zhu YH, Shen LC, Xiao X, Qiu WR, Yu DJ. Integrating unsupervised language model with multi-view multiple sequence alignments for high-accuracy inter-chain contact prediction. Comput Biol Med 2023; 166:107529. [PMID: 37748220 DOI: 10.1016/j.compbiomed.2023.107529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Revised: 08/30/2023] [Accepted: 09/19/2023] [Indexed: 09/27/2023]
Abstract
Accurate identification of inter-chain contacts in the protein complex is critical to determine the corresponding 3D structures and understand the biological functions. We proposed a new deep learning method, ICCPred, to deduce the inter-chain contacts from the amino acid sequences of the protein complex. This pipeline was built on the designed deep residual network architecture, integrating the pre-trained language model with three multiple sequence alignments (MSAs) from different biological views. Experimental results on 709 non-redundant benchmarking protein complexes showed that the proposed ICCPred significantly increased inter-chain contact prediction accuracy compared to the state-of-the-art approaches. Detailed data analyses showed that the significant advantage of ICCPred lies in the utilization of pre-trained transformer language models which can effectively extract the complementary co-evolution diversity from three MSAs. Meanwhile, the designed deep residual network enhances the correlation between the co-evolution diversity and the patterns of inter-chain contacts. These results demonstrated a new avenue for high-accuracy deep-learning inter-chain contact prediction that is applicable to large-scale protein-protein interaction annotations from sequence alone.
Collapse
Affiliation(s)
- Zi Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing, 210094, China; Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 , China
| | - Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, 210095 , China
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing, 210094, China
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 , China
| | - Wang-Ren Qiu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen, 333403 , China.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Xiaolingwei 200, Nanjing, 210094, China.
| |
Collapse
|
28
|
Kilian M, Bischofs IB. Co-evolution at protein-protein interfaces guides inference of stoichiometry of oligomeric protein complexes by de novo structure prediction. Mol Microbiol 2023; 120:763-782. [PMID: 37777474 DOI: 10.1111/mmi.15169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/10/2023] [Accepted: 09/11/2023] [Indexed: 10/02/2023]
Abstract
The quaternary structure with specific stoichiometry is pivotal to the specific function of protein complexes. However, determining the structure of many protein complexes experimentally remains a major bottleneck. Structural bioinformatics approaches, such as the deep learning algorithm Alphafold2-multimer (AF2-multimer), leverage the co-evolution of amino acids and sequence-structure relationships for accurate de novo structure and contact prediction. Pseudo-likelihood maximization direct coupling analysis (plmDCA) has been used to detect co-evolving residue pairs by statistical modeling. Here, we provide evidence that combining both methods can be used for de novo prediction of the quaternary structure and stoichiometry of a protein complex. We achieve this by augmenting the existing AF2-multimer confidence metrics with an interpretable score to identify the complex with an optimal fraction of native contacts of co-evolving residue pairs at intermolecular interfaces. We use this strategy to predict the quaternary structure and non-trivial stoichiometries of Bacillus subtilis spore germination protein complexes with unknown structures. Co-evolution at intermolecular interfaces may therefore synergize with AI-based de novo quaternary structure prediction of structurally uncharacterized bacterial protein complexes.
Collapse
Affiliation(s)
- Max Kilian
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| | - Ilka B Bischofs
- Max-Planck-Institute for Terrestrial Microbiology, Marburg, Germany
- BioQuant Center for Quantitative Analysis of Molecular and Cellular Biosystems, Heidelberg University, Heidelberg, Germany
- Center for Molecular Biology of Heidelberg University (ZMBH), Heidelberg, Germany
| |
Collapse
|
29
|
Zheng K, Liang Y, Paez-Espino D, Zou X, Gao C, Shao H, Sung YY, Mok WJ, Wong LL, Zhang YZ, Tian J, Chen F, Jiao N, Suttle CA, He J, McMinn A, Wang M. Identification of hidden N4-like viruses and their interactions with hosts. mSystems 2023; 8:e0019723. [PMID: 37702511 PMCID: PMC10654107 DOI: 10.1128/msystems.00197-23] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 07/19/2023] [Indexed: 09/14/2023] Open
Abstract
IMPORTANCE The findings of this study are significant, as N4-like viruses represent a unique viral lineage with a distinct replication mechanism and a conserved core genome. This work has resulted in a comprehensive global map of the entire N4-like viral lineage, including information on their distribution in different biomes, evolutionary divergence, genomic diversity, and the potential for viral-mediated host metabolic reprogramming. As such, this work significantly contributes to our understanding of the ecological function and viral-host interactions of bacteriophages.
Collapse
Affiliation(s)
- Kaiyang Zheng
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
| | - Yantao Liang
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
| | - David Paez-Espino
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, California, USA
- Mammoth Biosciences Inc., South San Francisco, California, USA
| | - Xiao Zou
- Qingdao Central Hospital, Qingdao, China
| | - Chen Gao
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
| | - Hongbing Shao
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
| | - Yeong Yik Sung
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
- Institute of Marine Biotechnology, Universiti Malaysia Terengganu (UMT), Kuala Terengganu, Malaysia
| | - Wen Jye Mok
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
- Institute of Marine Biotechnology, Universiti Malaysia Terengganu (UMT), Kuala Terengganu, Malaysia
| | - Li Lian Wong
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
- Institute of Marine Biotechnology, Universiti Malaysia Terengganu (UMT), Kuala Terengganu, Malaysia
| | - Yu-Zhong Zhang
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- State Key Laboratory of Microbial Technology, Marine Biotechnology Research Center, Shandong University, Qingdao, China
| | - Jiwei Tian
- Key Laboratory of Physical Oceanography, Ministry of Education, Ocean University of China, Qingdao, China
| | - Feng Chen
- Institute of Marine and Environmental Technology, University of Maryland Center for Environmental Science, Baltimore, Maryland, USA
| | - Nianzhi Jiao
- State Key Laboratory of Marine Environmental Sciences, Institute of Marine Microbes and Ecospheres, Xiamen University, Xiamen, China
| | - Curtis A. Suttle
- Department of Earth, Ocean and Atmospheric Sciences, Institute for the Oceans and Fisheries, The University of British Columbia, Vancouver, British Columbia, Canada
- Department of Microbiology and Immunology, Institute for the Oceans and Fisheries, The University of British Columbia, Vancouver, British Columbia, Canada
- Department of Botany, Institute for the Oceans and Fisheries, The University of British Columbia, Vancouver, British Columbia, Canada
| | - Jianfeng He
- SOA Key Laboratory for Polar Science, Polar Research Institute of China, Shanghai, China
| | - Andrew McMinn
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- Institute for Marine and Antarctic Studies, University of Tasmania, Hobart, Tasmania, Australia
| | - Min Wang
- Key Laboratory of Polar Oceanography and Global Ocean Change, Frontiers Science Center for Deep Ocean Multispheres and Earth System, College of Marine Life Sciences, Institute of Evolution and Marine Biodiversity, Ocean University of China, Qingdao, China
- UMT-OUC Joint Centre for Marine Studies, Qingdao, China
- The Affiliated Hospital of Qingdao University, Qingdao, China
| |
Collapse
|
30
|
Lynn CW, Yu Q, Pang R, Bialek W, Palmer SE. Exactly solvable statistical physics models for large neuronal populations. ARXIV 2023:arXiv:2310.10860v1. [PMID: 37904743 PMCID: PMC10614989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Maximum entropy methods provide a principled path connecting measurements of neural activity directly to statistical physics models, and this approach has been successful for populations of N ~ 100 neurons. As N increases in new experiments, we enter an undersampled regime where we have to choose which observables should be constrained in the maximum entropy construction. The best choice is the one that provides the greatest reduction in entropy, defining a "minimax entropy" principle. This principle becomes tractable if we restrict attention to correlations among pairs of neurons that link together into a tree; we can find the best tree efficiently, and the underlying statistical physics models are exactly solved. We use this approach to analyze experiments on N ~ 1500 neurons in the mouse hippocampus, and show that the resulting model captures the distribution of synchronous activity in the network.
Collapse
Affiliation(s)
- Christopher W. Lynn
- Initiative for the Theoretical Sciences, The Graduate Center, City University of New York, New York, NY 10016, USA
- Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ 08544, USA
- Department of Physics, Quantitative Biology Institute, and Wu Tsai Institute, Yale University, New Haven, CT 06520, USA
| | - Qiwei Yu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rich Pang
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ 08544, USA
| | - William Bialek
- Joseph Henry Laboratories of Physics, Princeton University, Princeton, NJ 08544, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
- Center for Studies in Physics and Biology, Rockefeller University, New York, NY 10065 USA
| | - Stephanie E. Palmer
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL 60637, USA
- Department of Physics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
31
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
32
|
Xie WJ, Liu D, Wang X, Zhang A, Wei Q, Nandi A, Dong S, Warshel A. Enhancing Luciferase Activity and Stability through Generative Modeling of Natural Enzyme Sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.18.558367. [PMID: 37786693 PMCID: PMC10541610 DOI: 10.1101/2023.09.18.558367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
The availability of natural protein sequences synergized with generative artificial intelligence (AI) provides new paradigms to create enzymes. Although active enzyme variants with numerous mutations have been produced using generative models, their performance often falls short compared to their wild-type counterparts. Additionally, in practical applications, choosing fewer mutations that can rival the efficacy of extensive sequence alterations is usually more advantageous. Pinpointing beneficial single mutations continues to be a formidable task. In this study, using the generative maximum entropy model to analyze Renilla luciferase homologs, and in conjunction with biochemistry experiments, we demonstrated that natural evolutionary information could be used to predictively improve enzyme activity and stability by engineering the active center and protein scaffold, respectively. The success rate of designed single mutants is ~50% to improve either luciferase activity or stability. These finding highlights nature's ingenious approach to evolving proficient enzymes, wherein diverse evolutionary pressures are preferentially applied to distinct regions of the enzyme, ultimately culminating in an overall high performance. We also reveal an evolutionary preference in Renilla luciferase towards emitting blue light that holds advantages in terms of water penetration compared to other light spectrum. Taken together, our approach facilitates navigation through enzyme sequence space and offers effective strategies for computer-aided rational enzyme engineering.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Dangliang Liu
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Xiaoya Wang
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Aoxuan Zhang
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| | - Qijia Wei
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Ashim Nandi
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| | - Suwei Dong
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
33
|
Posani L, Rizzato F, Monasson R, Cocco S. Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data. PLoS Comput Biol 2023; 19:e1011521. [PMID: 37883593 PMCID: PMC10645369 DOI: 10.1371/journal.pcbi.1011521] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 11/14/2023] [Accepted: 09/15/2023] [Indexed: 10/28/2023] Open
Abstract
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
Collapse
Affiliation(s)
- Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Francesca Rizzato
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| |
Collapse
|
34
|
Wang H, Zang Y, Kang Y, Zhang J, Zhang L, Zhang S. ETLD: an encoder-transformation layer-decoder architecture for protein contact and mutation effects prediction. Brief Bioinform 2023; 24:bbad290. [PMID: 37598423 DOI: 10.1093/bib/bbad290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 06/21/2023] [Accepted: 07/26/2023] [Indexed: 08/22/2023] Open
Abstract
The latent features extracted from the multiple sequence alignments (MSAs) of homologous protein families are useful for identifying residue-residue contacts, predicting mutation effects, shaping protein evolution, etc. Over the past three decades, a growing body of supervised and unsupervised machine learning methods have been applied to this field, yielding fruitful results. Here, we propose a novel self-supervised model, called encoder-transformation layer-decoder (ETLD) architecture, capable of capturing protein sequence latent features directly from MSAs. Compared to the typical autoencoder model, ETLD introduces a transformation layer with the ability to learn inter-site couplings, which can be used to parse out the two-dimensional residue-residue contacts map after a simple mathematical derivation or an additional supervised neural network. ETLD retains the process of encoding and decoding sequences, and the predicted probabilities of amino acids at each site can be further used to construct the mutation landscapes for mutation effects prediction, outperforming advanced models such as GEMME, DeepSequence and EVmutation in general. Overall, ETLD is a highly interpretable unsupervised model with great potential for improvement and can be further combined with supervised methods for more extensive and accurate predictions.
Collapse
Affiliation(s)
- He Wang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Yongjian Zang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Ying Kang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Jianwen Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Lei Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| | - Shengli Zhang
- MOE Key Laboratory for Nonequilibrium Synthesis and Modulation of Condensed Matter, School of Physics, Xi'an Jiaotong University, Xi'an 710049, China
| |
Collapse
|
35
|
Ghoreyshi ZS, George JT. Quantitative approaches for decoding the specificity of the human T cell repertoire. Front Immunol 2023; 14:1228873. [PMID: 37781387 PMCID: PMC10539903 DOI: 10.3389/fimmu.2023.1228873] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 08/17/2023] [Indexed: 10/03/2023] Open
Abstract
T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method's mathematical approach, predictive performance, and limitations.
Collapse
Affiliation(s)
- Zahra S. Ghoreyshi
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
| | - Jason T. George
- Department of Biomedical Engineering, Texas A&M University, College Station, TX, United States
- Engineering Medicine Program, Texas A&M University, Houston, TX, United States
- Center for Theoretical Biological Physics, Rice University, Houston, TX, United States
| |
Collapse
|
36
|
Herrington NB, Stein D, Li YC, Pandey G, Schlessinger A. Exploring the Druggable Conformational Space of Protein Kinases Using AI-Generated Structures. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.31.555779. [PMID: 37693436 PMCID: PMC10491245 DOI: 10.1101/2023.08.31.555779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2023]
Abstract
Protein kinase function and interactions with drugs are controlled in part by the movement of the DFG and ɑC-Helix motifs, which enable kinases to adopt various conformational states. Small molecule ligands elicit therapeutic effects with distinct selectivity profiles and residence times that often depend on the kinase conformation(s) they bind. However, the limited availability of experimentally determined structural data for kinases in inactive states restricts drug discovery efforts for this major protein family. Modern AI-based structural modeling methods hold potential for exploring the previously experimentally uncharted druggable conformational space for kinases. Here, we first evaluated the currently explored conformational space of kinases in the PDB and models generated by AlphaFold2 (AF2) (1) and ESMFold (2), two prominent AI-based structure prediction methods. We then investigated AF2's ability to predict kinase structures in different conformations at various multiple sequence alignment (MSA) depths, based on this parameter's ability to explore conformational diversity. Our results showed a bias within the PDB and predicted structural models generated by AF2 and ESMFold toward structures of kinases in the active state over alternative conformations, particularly those conformations controlled by the DFG motif. Finally, we demonstrate that predicting kinase structures using AF2 at lower MSA depths allows the exploration of the space of these alternative conformations, including identifying previously unobserved conformations for 398 kinases. The results of our analysis of structural modeling by AF2 create a new avenue for the pursuit of new therapeutic agents against a notoriously difficult-to-target family of proteins. Significance Statement Greater abundance of kinase structural data in inactive conformations, currently lacking in structural databases, would improve our understanding of how protein kinases function and expand drug discovery and development for this family of therapeutic targets. Modern approaches utilizing artificial intelligence and machine learning have potential for efficiently capturing novel protein conformations. We provide evidence for a bias within AlphaFold2 and ESMFold to predict structures of kinases in their active states, similar to their overrepresentation in the PDB. We show that lowering the AlphaFold2 algorithm's multiple sequence alignment depth can help explore kinase conformational space more broadly. It can also enable the prediction of hundreds of kinase structures in novel conformations, many of whose models are likely viable for drug discovery.
Collapse
|
37
|
Simpkin AJ, Caballero I, McNicholas S, Stevenson K, Jiménez E, Sánchez Rodríguez F, Fando M, Uski V, Ballard C, Chojnowski G, Lebedev A, Krissinel E, Usón I, Rigden DJ, Keegan RM. Predicted models and CCP4. Acta Crystallogr D Struct Biol 2023; 79:806-819. [PMID: 37594303 PMCID: PMC10478639 DOI: 10.1107/s2059798323006289] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Accepted: 07/19/2023] [Indexed: 08/19/2023] Open
Abstract
In late 2020, the results of CASP14, the 14th event in a series of competitions to assess the latest developments in computational protein structure-prediction methodology, revealed the giant leap forward that had been made by Google's Deepmind in tackling the prediction problem. The level of accuracy in their predictions was the first instance of a competitor achieving a global distance test score of better than 90 across all categories of difficulty. This achievement represents both a challenge and an opportunity for the field of experimental structural biology. For structure determination by macromolecular X-ray crystallography, access to highly accurate structure predictions is of great benefit, particularly when it comes to solving the phase problem. Here, details of new utilities and enhanced applications in the CCP4 suite, designed to allow users to exploit predicted models in determining macromolecular structures from X-ray diffraction data, are presented. The focus is mainly on applications that can be used to solve the phase problem through molecular replacement.
Collapse
Affiliation(s)
- Adam J. Simpkin
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
| | - Iracema Caballero
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
| | - Stuart McNicholas
- York Structural Biology Laboratory, Department of Chemistry, The University of York, York YO10 5DD, United Kingdom
| | - Kyle Stevenson
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Elisabet Jiménez
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
| | - Filomeno Sánchez Rodríguez
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
- York Structural Biology Laboratory, Department of Chemistry, The University of York, York YO10 5DD, United Kingdom
| | - Maria Fando
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Ville Uski
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Charles Ballard
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Grzegorz Chojnowski
- European Molecular Biology Laboratory, Hamburg Unit, Notkestrasse 85, 22607 Hamburg, Germany
| | - Andrey Lebedev
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Eugene Krissinel
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| | - Isabel Usón
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Barcelona, Spain
- ICREA, Institució Catalana de Recerca i Estudis Avançats, Passeig Lluís Companys 23, 08003 Barcelona, Spain
| | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
| | - Ronan M. Keegan
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, United Kingdom
- UKRI–STFC, Rutherford Appleton Laboratory, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom
| |
Collapse
|
38
|
Putnam CD, Kolodner RD. Insights into DNA cleavage by MutL homologs from analysis of conserved motifs in eukaryotic Mlh1. Bioessays 2023; 45:e2300031. [PMID: 37424007 PMCID: PMC10530380 DOI: 10.1002/bies.202300031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 06/02/2023] [Accepted: 06/15/2023] [Indexed: 07/11/2023]
Abstract
MutL family proteins contain an N-terminal ATPase domain (NTD), an unstructured interdomain linker, and a C-terminal domain (CTD), which mediates constitutive dimerization between subunits and often contains an endonuclease active site. Most MutL homologs direct strand-specific DNA mismatch repair by cleaving the error-containing daughter DNA strand. The strand cleavage reaction is poorly understood; however, the structure of the endonuclease active site is consistent with a two- or three-metal ion cleavage mechanism. A motif required for this endonuclease activity is present in the unstructured linker of Mlh1 and is conserved in all eukaryotic Mlh1 proteins, except those from metamonads, which also lack the almost absolutely conserved Mlh1 C-terminal phenylalanine-glutamate-arginine-cysteine (FERC) sequence. We hypothesize that the cysteine in the FERC sequence is autoinhibitory, as it sequesters the active site. We further hypothesize that the evolutionary co-occurrence of the conserved linker motif with the FERC sequence indicates a functional interaction, possibly by linker motif-mediated displacement of the inhibitory cysteine. This role is consistent with available data for interactions between the linker motif with DNA and the CTDs in the vicinity of the active site.
Collapse
Affiliation(s)
- Christopher D. Putnam
- Ludwig Institute for Cancer Research San Diego Branch, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
- Departments of Medicine, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
- Moores Cancer Center, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
| | - Richard D. Kolodner
- Ludwig Institute for Cancer Research San Diego Branch, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
- Cellular and Molecular Medicine, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
- Moores Cancer Center, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
- Institute of Genomic Medicine, University of California School of Medicine, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0660
| |
Collapse
|
39
|
Porter LL. Fluid protein fold space and its implications. Bioessays 2023; 45:e2300057. [PMID: 37431685 PMCID: PMC10529699 DOI: 10.1002/bies.202300057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/21/2023] [Accepted: 06/23/2023] [Indexed: 07/12/2023]
Abstract
Fold-switching proteins, which remodel their secondary and tertiary structures in response to cellular stimuli, suggest a new view of protein fold space. For decades, experimental evidence has indicated that protein fold space is discrete: dissimilar folds are encoded by dissimilar amino acid sequences. Challenging this assumption, fold-switching proteins interconnect discrete groups of dissimilar protein folds, making protein fold space fluid. Three recent observations support the concept of fluid fold space: (1) some amino acid sequences interconvert between folds with distinct secondary structures, (2) some naturally occurring sequences have switched folds by stepwise mutation, and (3) fold switching is evolutionarily selected and likely confers advantage. These observations indicate that minor amino acid sequence modifications can transform protein structure and function. Consequently, proteomic structural and functional diversity may be expanded by alternative splicing, small nucleotide polymorphisms, post-translational modifications, and modified translation rates.
Collapse
Affiliation(s)
- Lauren L. Porter
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
- National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD
| |
Collapse
|
40
|
Rahman M, Nemenman I. Inferring local structure from pairwise correlations. Phys Rev E 2023; 108:034410. [PMID: 37849214 DOI: 10.1103/physreve.108.034410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Accepted: 09/11/2023] [Indexed: 10/19/2023]
Abstract
To construct models of large, multivariate complex systems, such as those in biology, one needs to constrain which variables are allowed to interact. This can be viewed as detecting "local" structures among the variables. In the context of a simple toy model of two-dimensional natural and synthetic images, we show that pairwise correlations between the variables-even when severely undersampled-provide enough information to recover local relations, including the dimensionality of the data, and to reconstruct arrangement of pixels in fully scrambled images. This proves to be successful even though higher order interaction structures are present in our data. We build intuition behind the success, which we hope might contribute to modeling complex, multivariate systems and to explaining the success of modern attention-based machine learning approaches.
Collapse
Affiliation(s)
- Mahajabin Rahman
- Department of Physics, Emory University, Atlanta, Georgia 30322, USA
| | - Ilya Nemenman
- Department of Physics, Department of Biology, and Initiative in Theory and Modeling of Living Systems, Emory University, Atlanta, Georgia 30322, USA
| |
Collapse
|
41
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
42
|
Shome S, Jia K, Sivasankar S, Jernigan RL. Characterizing interactions in E-cadherin assemblages. Biophys J 2023; 122:3069-3077. [PMID: 37345249 PMCID: PMC10432173 DOI: 10.1016/j.bpj.2023.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 09/26/2022] [Accepted: 06/14/2023] [Indexed: 06/23/2023] Open
Abstract
Cadherin intermolecular interactions are critical for cell-cell adhesion and play essential roles in tissue formation and the maintenance of tissue structures. In this study, we focus on E-cadherin, a classical cadherin that connects epithelial cells, to understand how they interact in cis and trans conformations when attached to the same cell or opposing cells. We employ coevolutionary sequence analysis and molecular dynamics simulations to confirm previously known interaction sites as well as to identify new interaction sites. The sequence coevolutionary results yield a surprising result indicating that there are no strongly favored intermolecular interaction sites, which is unusual and suggests that many interaction sites may be possible, with none being strongly preferred over others. By using molecular dynamics, we test the persistence of these interactions and how they facilitate adhesion. We build several types of cadherin assemblages, with different numbers and combinations of cis and trans interfaces to understand how these conformations act to facilitate adhesion. Our results suggest that, in addition to the established interaction sites on the EC1 and EC2 domains, an additional plausible cis interface at the EC3-EC5 domain exists. Furthermore, we identify specific mutations at cis/trans binding sites that impair adhesion within E-cadherin assemblages.
Collapse
Affiliation(s)
- Sayane Shome
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa
| | - Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa
| | - Sanjeevi Sivasankar
- Department of Biomedical Engineering, University of California, Davis, Davis, California
| | - Robert L Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa.
| |
Collapse
|
43
|
Erlandson SC, Rawson S, Osei-Owusu J, Brock KP, Liu X, Paulo JA, Mintseris J, Gygi SP, Marks DS, Cong X, Kruse AC. The relaxin receptor RXFP1 signals through a mechanism of autoinhibition. Nat Chem Biol 2023; 19:1013-1021. [PMID: 37081311 PMCID: PMC10530065 DOI: 10.1038/s41589-023-01321-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 03/27/2023] [Indexed: 04/22/2023]
Abstract
The relaxin family peptide receptor 1 (RXFP1) is the receptor for relaxin-2, an important regulator of reproductive and cardiovascular physiology. RXFP1 is a multi-domain G protein-coupled receptor (GPCR) with an ectodomain consisting of a low-density lipoprotein receptor class A (LDLa) module and leucine-rich repeats. The mechanism of RXFP1 signal transduction is clearly distinct from that of other GPCRs, but remains very poorly understood. In the present study, we determine the cryo-electron microscopy structure of active-state human RXFP1, bound to a single-chain version of the endogenous agonist relaxin-2 and the heterotrimeric Gs protein. Evolutionary coupling analysis and structure-guided functional experiments reveal that RXFP1 signals through a mechanism of autoinhibition. Our results explain how an unusual GPCR family functions, providing a path to rational drug development targeting the relaxin receptors.
Collapse
Affiliation(s)
- Sarah C Erlandson
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Shaun Rawson
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - James Osei-Owusu
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Kelly P Brock
- Department of Systems Biology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Xinyue Liu
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Joao A Paulo
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Julian Mintseris
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Steven P Gygi
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA
| | - Xiaojing Cong
- Institut de Génomique Fonctionnelle, Université de Montpellier, CNRS, INSERM, Montpellier, France
| | - Andrew C Kruse
- Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
44
|
Montezano D, Bernstein R, Copeland MM, Slusky JSG. General features of transmembrane beta barrels from a large database. Proc Natl Acad Sci U S A 2023; 120:e2220762120. [PMID: 37432995 PMCID: PMC10629564 DOI: 10.1073/pnas.2220762120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 06/03/2023] [Indexed: 07/13/2023] Open
Abstract
Large datasets contribute new insights to subjects formerly investigated by exemplars. We used coevolution data to create a large, high-quality database of transmembrane β-barrels (TMBB). By applying simple feature detection on generated evolutionary contact maps, our method (IsItABarrel) achieves 95.88% balanced accuracy when discriminating among protein classes. Moreover, comparison with IsItABarrel revealed a high rate of false positives in previous TMBB algorithms. In addition to being more accurate than previous datasets, our database (available online) contains 1,938,936 bacterial TMBB proteins from 38 phyla, respectively, 17 and 2.2 times larger than the previous sets TMBB-DB and OMPdb. We anticipate that due to its quality and size, the database will serve as a useful resource where high-quality TMBB sequence data are required. We found that TMBBs can be divided into 11 types, three of which have not been previously reported. We find tremendous variance in proteome percentage among TMBB-containing organisms with some using 6.79% of their proteome for TMBBs and others using as little as 0.27% of their proteome. The distribution of the lengths of the TMBBs is suggestive of previously hypothesized duplication events. In addition, we find that the C-terminal β-signal varies among different classes of bacteria though its consensus sequence is LGLGYRF. However, this β-signal is only characteristic of prototypical TMBBs. The ten non-prototypical barrel types have other C-terminal motifs, and it remains to be determined if these alternative motifs facilitate TMBB insertion or perform any other signaling function.
Collapse
Affiliation(s)
- Daniel Montezano
- Computational Biology Program, University of Kansas, Lawrence, KS66045
| | - Rebecca Bernstein
- Computational Biology Program, University of Kansas, Lawrence, KS66045
| | | | - Joanna S. G. Slusky
- Computational Biology Program, University of Kansas, Lawrence, KS66045
- Department of Molecular Biosciences, University of Kansas, Lawrence, KS66045
| |
Collapse
|
45
|
Konecki DM, Hamrick S, Wang C, Agosto MA, Wensel TG, Lichtarge O. CovET: A covariation-evolutionary trace method that identifies protein structure-function modules. J Biol Chem 2023; 299:104896. [PMID: 37290531 PMCID: PMC10338321 DOI: 10.1016/j.jbc.2023.104896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 06/01/2023] [Accepted: 06/02/2023] [Indexed: 06/10/2023] Open
Abstract
Measuring the relative effect that any two sequence positions have on each other may improve protein design or help better interpret coding variants. Current approaches use statistics and machine learning but rarely consider phylogenetic divergences which, as shown by Evolutionary Trace studies, provide insight into the functional impact of sequence perturbations. Here, we reframe covariation analyses in the Evolutionary Trace framework to measure the relative tolerance to perturbation of each residue pair during evolution. This approach (CovET) systematically accounts for phylogenetic divergences: at each divergence event, we penalize covariation patterns that belie evolutionary coupling. We find that while CovET approximates the performance of existing methods to predict individual structural contacts, it performs significantly better at finding structural clusters of coupled residues and ligand binding sites. For example, CovET found more functionally critical residues when we examined the RNA recognition motif and WW domains. It correlates better with large-scale epistasis screen data. In the dopamine D2 receptor, top CovET residue pairs recovered accurately the allosteric activation pathway characterized for Class A G protein-coupled receptors. These data suggest that CovET ranks highest the sequence position pairs that play critical functional roles through epistatic and allosteric interactions in evolutionarily relevant structure-function motifs. CovET complements current methods and may shed light on fundamental molecular mechanisms of protein structure and function.
Collapse
Affiliation(s)
- Daniel M Konecki
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Spencer Hamrick
- Chemical, Physical, and Structural Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Chen Wang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA
| | - Melina A Agosto
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA
| | - Theodore G Wensel
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA; Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA; Cancer and Cell Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA
| | - Olivier Lichtarge
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA; Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, Texas, USA; Cancer and Cell Biology Graduate Program, Baylor College of Medicine, Houston, Texas, USA; Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas, USA.
| |
Collapse
|
46
|
Li EH, Spaman LE, Tejero R, Janet Huang Y, Ramelot TA, Fraga KJ, Prestegard JH, Kennedy MA, Montelione GT. Blind assessment of monomeric AlphaFold2 protein structure models with experimental NMR data. JOURNAL OF MAGNETIC RESONANCE (SAN DIEGO, CALIF. : 1997) 2023; 352:107481. [PMID: 37257257 PMCID: PMC10659763 DOI: 10.1016/j.jmr.2023.107481] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2023] [Revised: 05/08/2023] [Accepted: 05/15/2023] [Indexed: 06/02/2023]
Abstract
Recent advances in molecular modeling of protein structures are changing the field of structural biology. AlphaFold-2 (AF2), an AI system developed by DeepMind, Inc., utilizes attention-based deep learning to predict models of protein structures with high accuracy relative to structures determined by X-ray crystallography and cryo-electron microscopy (cryoEM). Comparing AF2 models to structures determined using solution NMR data, both high similarities and distinct differences have been observed. Since AF2 was trained on X-ray crystal and cryoEM structures, we assessed how accurately AF2 can model small, monomeric, solution protein NMR structures which (i) were not used in the AF2 training data set, and (ii) did not have homologous structures in the Protein Data Bank at the time of AF2 training. We identified nine open-source protein NMR data sets for such "blind" targets, including chemical shift, raw NMR FID data, NOESY peak lists, and (for 1 case) 15N-1H residual dipolar coupling data. For these nine small (70-108 residues) monomeric proteins, we generated AF2 prediction models and assessed how well these models fit to these experimental NMR data, using several well-established NMR structure validation tools. In most of these cases, the AF2 models fit the NMR data nearly as well, or sometimes better than, the corresponding NMR structure models previously deposited in the Protein Data Bank. These results provide benchmark NMR data for assessing new NMR data analysis and protein structure prediction methods. They also document the potential for using AF2 as a guiding tool in protein NMR data analysis, and more generally for hypothesis generation in structural biology research.
Collapse
Affiliation(s)
- Ethan H Li
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
| | - Laura E Spaman
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | - Roberto Tejero
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | - Yuanpeng Janet Huang
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | - Theresa A Ramelot
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | - Keith J Fraga
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| | - James H Prestegard
- Complex Carbohydrate Research Center, University of Georgia, Athens, GA 30602, USA.
| | - Michael A Kennedy
- Department of Chemistry and Biochemistry, Miami University, Oxford, OH 45056, USA.
| | - Gaetano T Montelione
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180, USA.
| |
Collapse
|
47
|
McWhite CD, Armour-Garb I, Singh M. Leveraging protein language models for accurate multiple sequence alignments. Genome Res 2023; 33:1145-1153. [PMID: 37414576 PMCID: PMC10538487 DOI: 10.1101/gr.277675.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/29/2023] [Indexed: 07/08/2023]
Abstract
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
Collapse
Affiliation(s)
- Claire D McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA;
| | - Isabel Armour-Garb
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
- Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA
| | - Mona Singh
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA;
- Department of Computer Science, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
48
|
Williams JA, Biancucci M, Lessen L, Tian S, Balsaraf A, Chen L, Chesterman C, Maruggi G, Vandepaer S, Huang Y, Mallett CP, Steff AM, Bottomley MJ, Malito E, Wahome N, Harshbarger WD. Structural and computational design of a SARS-CoV-2 spike antigen with improved expression and immunogenicity. SCIENCE ADVANCES 2023; 9:eadg0330. [PMID: 37285422 PMCID: PMC10246912 DOI: 10.1126/sciadv.adg0330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 05/02/2023] [Indexed: 06/09/2023]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of concern challenge the efficacy of approved vaccines, emphasizing the need for updated spike antigens. Here, we use an evolutionary-based design aimed at boosting protein expression levels of S-2P and improving immunogenic outcomes in mice. Thirty-six prototype antigens were generated in silico and 15 were produced for biochemical analysis. S2D14, which contains 20 computationally designed mutations within the S2 domain and a rationally engineered D614G mutation in the SD2 domain, has an ~11-fold increase in protein yield and retains RBD antigenicity. Cryo-electron microscopy structures reveal a mixture of populations in various RBD conformational states. Vaccination of mice with adjuvanted S2D14 elicited higher cross-neutralizing antibody titers than adjuvanted S-2P against the SARS-CoV-2 Wuhan strain and four variants of concern. S2D14 may be a useful scaffold or tool for the design of future coronavirus vaccines, and the approaches used for the design of S2D14 may be broadly applicable to streamline vaccine discovery.
Collapse
|
49
|
Gao H, Hamp T, Ede J, Schraiber JG, McRae J, Singer-Berk M, Yang Y, Dietrich ASD, Fiziev PP, Kuderna LFK, Sundaram L, Wu Y, Adhikari A, Field Y, Chen C, Batzoglou S, Aguet F, Lemire G, Reimers R, Balick D, Janiak MC, Kuhlwilm M, Orkin JD, Manu S, Valenzuela A, Bergman J, Rousselle M, Silva FE, Agueda L, Blanc J, Gut M, de Vries D, Goodhead I, Harris RA, Raveendran M, Jensen A, Chuma IS, Horvath JE, Hvilsom C, Juan D, Frandsen P, de Melo FR, Bertuol F, Byrne H, Sampaio I, Farias I, do Amaral JV, Messias M, da Silva MNF, Trivedi M, Rossi R, Hrbek T, Andriaholinirina N, Rabarivola CJ, Zaramody A, Jolly CJ, Phillips-Conroy J, Wilkerson G, Abee C, Simmons JH, Fernandez-Duque E, Kanthaswamy S, Shiferaw F, Wu D, Zhou L, Shao Y, Zhang G, Keyyu JD, Knauf S, Le MD, Lizano E, Merker S, Navarro A, Bataillon T, Nadler T, Khor CC, Lee J, Tan P, Lim WK, Kitchener AC, Zinner D, Gut I, Melin A, Guschanski K, Schierup MH, Beck RMD, Umapathy G, Roos C, Boubli JP, Lek M, Sunyaev S, O'Donnell-Luria A, Rehm HL, Xu J, Rogers J, Marques-Bonet T, Farh KKH. The landscape of tolerated genetic variation in humans and primates. Science 2023; 380:eabn8153. [PMID: 37262156 DOI: 10.1126/science.abn8197] [Citation(s) in RCA: 36] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 03/22/2023] [Indexed: 06/03/2023]
Abstract
Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases.
Collapse
Affiliation(s)
- Hong Gao
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Tobias Hamp
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Jeffrey Ede
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Joshua G Schraiber
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Jeremy McRae
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Moriel Singer-Berk
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Boston, MA, 02142, USA
| | - Yanshen Yang
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | | | - Petko P Fiziev
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Lukas F K Kuderna
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Laksshman Sundaram
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Yibing Wu
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Aashish Adhikari
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Yair Field
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Chen Chen
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Serafim Batzoglou
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Francois Aguet
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| | - Gabrielle Lemire
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Boston, MA, 02142, USA
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Rebecca Reimers
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
| | - Daniel Balick
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Mareike C Janiak
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Martin Kuhlwilm
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
- Department of Evolutionary Anthropology, University of Vienna, Djerassiplatz 1, 1030 Vienna, Austria
- Human Evolution and Archaeological Sciences (HEAS), University of Vienna, 1030 Vienna, Austria
| | - Joseph D Orkin
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
- Département d'anthropologie, Université de Montréal, 3150 Jean-Brillant, Montréal, QC H3T 1N8, Canada
| | - Shivakumara Manu
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
- Laboratory for the Conservation of Endangered Species, CSIR-Centre for Cellular and Molecular Biology, Hyderabad 500007, India
| | - Alejandro Valenzuela
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Juraj Bergman
- Bioinformatics Research Centre, Aarhus University, Aarhus 8000, Denmark
- Section for Ecoinformatics & Biodiversity, Department of Biology, Aarhus University, 8000 Aarhus, Denmark
| | | | - Felipe Ennes Silva
- Research Group on Primate Biology and Conservation, Mamirauá Institute for Sustainable Development, Estrada da Bexiga 2584, Tefé, Amazonas, CEP 69553-225, Brazil
- Evolutionary Biology and Ecology (EBE), Département de Biologie des Organismes, Université libre de Bruxelles (ULB), Av. Franklin D. Roosevelt 50, CP 160/12, B-1050 Brussels, Belgium
| | - Lidia Agueda
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
| | - Julie Blanc
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
| | - Marta Gut
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
| | - Dorien de Vries
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Ian Goodhead
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - R Alan Harris
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Muthuswamy Raveendran
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Axel Jensen
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, SE-75236 Uppsala, Sweden
| | | | - Julie E Horvath
- North Carolina Museum of Natural Sciences, Raleigh, NC 27601, USA
- Department of Biological and Biomedical Sciences, North Carolina Central University, Durham, NC 27707, USA
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, USA
- Department of Evolutionary Anthropology, Duke University, Durham, NC 27708, USA
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | | | - David Juan
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
| | | | | | - Fabrício Bertuol
- Universidade Federal do Amazonas, Departamento de Genética, Laboratório de Evolução e Genética Animal (LEGAL), Manaus, Amazonas, 69080-900, Brazil
| | - Hazel Byrne
- Department of Anthropology, University of Utah, Salt Lake City, UT 84102, USA
| | - Iracilda Sampaio
- Universidade Federal do Para, Guamá, Belém - PA, 66075-110, Brazil
| | - Izeni Farias
- Universidade Federal do Amazonas, Departamento de Genética, Laboratório de Evolução e Genética Animal (LEGAL), Manaus, Amazonas, 69080-900, Brazil
| | - João Valsecchi do Amaral
- Research Group on Terrestrial Vertebrate Ecology, Mamirauá Institute for Sustainable Development, Tefé, Amazonas, 69553-225, Brazil
- Rede de Pesquisa para Estudos sobre Diversidade, Conservação e Uso da Fauna na Amazônia - RedeFauna, Manaus, Amazonas, 69080-900, Brazil
- Comunidad de Manejo de Fauna Silvestre en la Amazonía y en Latinoamérica - ComFauna, Iquitos, Loreto, 16001, Peru
| | - Mariluce Messias
- Universidade Federal de Rondonia, Porto Velho, Rondônia, 78900-000, Brazil
- PPGREN - Programa de Pós-Graduação "Conservação e Uso dos Recursos Naturais and BIONORTE - Programa de Pós-Graduação em Biodiversidade e Biotecnologia da Rede BIONORTE, Universidade Federal de Rondonia, Porto Velho, Rondônia, 78900-000, Brazil
| | - Maria N F da Silva
- Instituto Nacional de Pesquisas da Amazonia, Petrópolis, Manaus - AM, 69067-375, Brazil
| | - Mihir Trivedi
- Laboratory for the Conservation of Endangered Species, CSIR-Centre for Cellular and Molecular Biology, Hyderabad 500007, India
| | - Rogerio Rossi
- Universidade Federal do Mato Grosso, Boa Esperança, Cuiabá - MT, 78060-900, Brazil
| | - Tomas Hrbek
- Universidade Federal do Amazonas, Departamento de Genética, Laboratório de Evolução e Genética Animal (LEGAL), Manaus, Amazonas, 69080-900, Brazil
- Department of Biology, Trinity University, San Antonio, TX 78212, USA
| | - Nicole Andriaholinirina
- Life Sciences and Environment, Technology and Environment of Mahajanga, University of Mahajanga, Mahajanga, 401, Madagascar
| | - Clément J Rabarivola
- Life Sciences and Environment, Technology and Environment of Mahajanga, University of Mahajanga, Mahajanga, 401, Madagascar
| | - Alphonse Zaramody
- Life Sciences and Environment, Technology and Environment of Mahajanga, University of Mahajanga, Mahajanga, 401, Madagascar
| | | | | | - Gregory Wilkerson
- Keeling Center for Comparative Medicine and Research, MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Christian Abee
- Keeling Center for Comparative Medicine and Research, MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Joe H Simmons
- Keeling Center for Comparative Medicine and Research, MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Eduardo Fernandez-Duque
- Yale University, New Haven, CT 06520, USA
- Universidad Nacional de Formosa, Argentina Fundacion ECO, Formosa, Argentina
| | | | - Fekadu Shiferaw
- Guinea Worm Eradication Program, The Carter Center Ethiopia, PoB 16316, Addis Ababa 1000, Ethiopia
| | - Dongdong Wu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223, China
| | - Long Zhou
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Yong Shao
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223, China
| | - Guojie Zhang
- Center for Evolutionary & Organismal Biology, Zhejiang University School of Medicine, Hangzhou 310058, China
- Villum Center for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, Yunnan 650223, China
- Liangzhu Laboratory, Zhejiang University Medical Center, 1369 West Wenyi Road, Hangzhou 311121, China
- Women's Hospital, School of Medicine, Zhejiang University, 1 Xueshi Road, Shangcheng District, Hangzhou 310006, China
| | - Julius D Keyyu
- Tanzania Wildlife Research Institute (TAWIRI), Head Office, P.O. Box 661, Arusha, Tanzania
| | - Sascha Knauf
- Institute of International Animal Health/One Health, Friedrich-Loeffler-Institut, Federal Research Institute for Animal Health, 17493 Greifswald - Insei Riems, Germany
| | - Minh D Le
- Department of Environmental Ecology, Faculty of Environmental Sciences, University of Science and Central Institute for Natural Resources and Environmental Studies, Vietnam National University, Hanoi 100000, Vietnam
| | - Esther Lizano
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
- Catalan Institution of Research and Advanced Studies (ICREA), Passeig de Lluís Companys, 23, 08010 Barcelona, Spain
| | - Stefan Merker
- Department of Zoology, State Museum of Natural History Stuttgart, 70191 Stuttgart, Germany
| | - Arcadi Navarro
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
- Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Edifici ICTA-ICP, c/ Columnes s/n, 08193 Cerdanyola del Vallès, Barcelona, Spain
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Av. Doctor Aiguader, N88, 08003 Barcelona, Spain
- BarcelonaBeta Brain Research Center, Pasqual Maragall Foundation, C. Wellington 30, 08005 Barcelona, Spain
| | - Thomas Bataillon
- Bioinformatics Research Centre, Aarhus University, Aarhus 8000, Denmark
| | - Tilo Nadler
- Cuc Phuong Commune, Nho Quan District, Ninh Binh Province 430000, Vietnam
| | - Chiea Chuen Khor
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Genome, Singapore 138672, Republic of Singapore
| | - Jessica Lee
- Mandai Nature, 80 Mandai Lake Road, Singapore 729826, Republic of Singapore
| | - Patrick Tan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Genome, Singapore 138672, Republic of Singapore
- SingHealth Duke-NUS Institute of Precision Medicine (PRISM), Singapore 168582, Republic of Singapore
- Cancer and Stem Cell Biology Program, Duke-NUS Medical School, Singapore 168582, Republic of Singapore
| | - Weng Khong Lim
- SingHealth Duke-NUS Institute of Precision Medicine (PRISM), Singapore 168582, Republic of Singapore
- Cancer and Stem Cell Biology Program, Duke-NUS Medical School, Singapore 168582, Republic of Singapore
- SingHealth Duke-NUS Genomic Medicine Centre, Singapore 168582, Republic of Singapore
| | - Andrew C Kitchener
- Department of Natural Sciences, National Museums Scotland, Chambers Street, Edinburgh EH1 1JF, UK
- School of Geosciences, University of Edinburgh, Drummond Street, Edinburgh EH8 9XP, UK
| | - Dietmar Zinner
- Cognitive Ethology Laboratory, Germany Primate Center, Leibniz Institute for Primate Research, 37077 Göttingen, Germany
- Department of Primate Cognition, Georg-August-Universität Göttingen, 37077 Göttingen, Germany
- Leibniz Science Campus Primate Cognition, 37077 Göttingen, Germany
| | - Ivo Gut
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Universitat Pompeu Fabra, Pg. Luís Companys 23, 08010 Barcelona, Spain
| | - Amanda Melin
- Department of Anthropology & Archaeology, University of Calgary, 2500 University Dr NW, Calgary, AB T2N 1N4, Canada
- Department of Medical Genetics, 3330 Hospital Drive NW, HMRB 202, Calgary, AB T2N 4N1, Canada
- Alberta Children's Hospital Research Institute, University of Calgary, 2500 University Dr NW, Calgary, AB T2N 1N4, Canada
| | - Katerina Guschanski
- Department of Ecology and Genetics, Animal Ecology, Uppsala University, SE-75236 Uppsala, Sweden
- Institute of Ecology and Evolution, School of Biological Sciences, University of Edinburgh, Edinburgh EH8 9XP, UK
| | | | - Robin M D Beck
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Govindhaswamy Umapathy
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
- Laboratory for the Conservation of Endangered Species, CSIR-Centre for Cellular and Molecular Biology, Hyderabad 500007, India
| | - Christian Roos
- Gene Bank of Primates and Primate Genetics Laboratory, German Primate Center, Leibniz Institute for Primate Research, Kellnerweg 4, 37077 Göttingen, Germany
| | - Jean P Boubli
- School of Science, Engineering & Environment, University of Salford, Salford M5 4WT, UK
| | - Monkol Lek
- Department of Genetics, Yale School of Medicine, New Haven, CT 06520, USA
| | - Shamil Sunyaev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, 02115, USA
- Division of Genetics, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, 02115, USA
| | - Anne O'Donnell-Luria
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Boston, MA, 02142, USA
- Division of Genetics and Genomics, Department of Pediatrics, Boston Children's Hospital, Harvard Medical School, Boston, MA, 02115, USA
- Analytic and Translational Genetics Unit, Department of Medicine, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Heidi L Rehm
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Boston, MA, 02142, USA
- Department of Genetics, Yale School of Medicine, New Haven, CT 06520, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Jinbo Xu
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| | - Jeffrey Rogers
- Human Genome Sequencing Center and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Tomas Marques-Bonet
- Institute of Evolutionary Biology (UPF-CSIC), PRBB, Dr. Aiguader 88, 08003 Barcelona, Spain
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028 Barcelona, Spain
- Catalan Institution of Research and Advanced Studies (ICREA), Passeig de Lluís Companys, 23, 08010 Barcelona, Spain
- Institut Català de Paleontologia Miquel Crusafont, Universitat Autònoma de Barcelona, Edifici ICTA-ICP, c/ Columnes s/n, 08193 Cerdanyola del Vallès, Barcelona, Spain
| | - Kyle Kai-How Farh
- Illumina Artificial Intelligence Laboratory, Illumina Inc., Foster City, CA, 94404, USA
| |
Collapse
|
50
|
Gao W, Yang A, Rivas E. Thirteen dubious ways to detect conserved structural RNAs. IUBMB Life 2023; 75:471-492. [PMID: 36495545 DOI: 10.1002/iub.2694] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 10/24/2022] [Indexed: 12/14/2022]
Abstract
Covariation induced by compensatory base substitutions in RNA alignments is a great way to deduce conserved RNA structure, in principle. In practice, success depends on many factors, importantly the quality and depth of the alignment and the choice of covariation statistic. Measuring covariation between pairs of aligned positions is easy. However, using covariation to infer evolutionarily conserved RNA structure is complicated by other extraneous sources of covariation such as that resulting from homologous sequences having evolved from a common ancestor. In order to provide evidence of evolutionarily conserved RNA structure, a method to distinguish covariation due to sources other than RNA structure is necessary. Moreover, there are several sorts of artifactually generated covariation signals that can further confound the analysis. Additionally, some covariation signal is difficult to detect due to incomplete comparative data. Here, we investigate and critically discuss the practice of inferring conserved RNA structure by comparative sequence analysis. We provide new methods on how to approach and decide which of the numerous long non-coding RNAs (lncRNAs) have biologically relevant structures.
Collapse
Affiliation(s)
- William Gao
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Ann Yang
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts, USA
| | - Elena Rivas
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts, USA
| |
Collapse
|