1
|
Doran BA, Chen RY, Giba H, Behera V, Barat B, Sundararajan A, Lin H, Sidebottom A, Pamer EG, Raman AS. Subspecies phylogeny in the human gut revealed by co-evolutionary constraints across the bacterial kingdom. Cell Syst 2025:S2405-4712(24)00402-2. [PMID: 39826551 DOI: 10.1016/j.cels.2024.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/16/2024] [Accepted: 12/18/2024] [Indexed: 01/22/2025]
Abstract
The human gut microbiome contains many bacterial strains of the same species ("strain-level variants") that shape microbiome function. The tremendous scale and molecular resolution at which microbial communities are being interrogated motivates addressing how to describe strain-level variants. We introduce the "Spectral Tree"-an inferred tree of relatedness built from patterns of co-evolutionary constraint between greater than 7,000 diverse bacteria. Using the Spectral Tree to describe over 600 diverse gut commensal strains that we isolated, whole-genome sequenced, and metabolically profiled revealed (1) widespread phylogenetic structure among strain-level variants, (2) the origins of subspecies phylogeny as a shared history of phage infections across humans, and (3) the key role of inter-human strain variation in predicting strain-level metabolic qualities. Overall, our work demonstrates the existence and metabolic importance of structured phylogeny below the level of species for commensal gut bacteria, motivating a redefinition of individual strains according to their evolutionary context. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Benjamin A Doran
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA; Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Robert Y Chen
- Department of Psychiatry, University of Washington, Seattle, WA 98195, USA
| | - Hannah Giba
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA; Department of Pathology, University of Chicago, Chicago, IL 60637, USA
| | - Vivek Behera
- Department of Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Bidisha Barat
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA
| | | | - Huaiying Lin
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA
| | - Ashley Sidebottom
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA
| | - Eric G Pamer
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA; Department of Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Arjun S Raman
- Duchossois Family Institute, University of Chicago, Chicago, IL 60637, USA; Department of Pathology, University of Chicago, Chicago, IL 60637, USA; Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
2
|
Ma E, Guo X, Hu M, Wang P, Wang X, Wei C, Cheng G. A predictive language model for SARS-CoV-2 evolution. Signal Transduct Target Ther 2024; 9:353. [PMID: 39710752 DOI: 10.1038/s41392-024-02066-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 11/05/2024] [Accepted: 11/13/2024] [Indexed: 12/24/2024] Open
Abstract
Modeling and predicting mutations are critical for COVID-19 and similar pandemic preparedness. However, existing predictive models have yet to integrate the regularity and randomness of viral mutations with minimal data requirements. Here, we develop a non-demanding language model utilizing both regularity and randomness to predict candidate SARS-CoV-2 variants and mutations that might prevail. We constructed the "grammatical frameworks" of the available S1 sequences for dimension reduction and semantic representation to grasp the model's latent regularity. The mutational profile, defined as the frequency of mutations, was introduced into the model to incorporate randomness. With this model, we successfully identified and validated several variants with significantly enhanced viral infectivity and immune evasion by wet-lab experiments. By inputting the sequence data from three different time points, we detected circulating strains or vital mutations for XBB.1.16, EG.5, JN.1, and BA.2.86 strains before their emergence. In addition, our results also predicted the previously unknown variants that may cause future epidemics. With both the data validation and experiment evidence, our study represents a fast-responding, concise, and promising language model, potentially generalizable to other viral pathogens, to forecast viral evolution and detect crucial hot mutation spots, thus warning the emerging variants that might raise public health concern.
Collapse
Affiliation(s)
- Enhao Ma
- School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China
| | - Xuan Guo
- School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China.
- Institute of Infectious Diseases, Shenzhen Bay Laboratory, Guangqiao Rd., Guangming District, Shenzhen, Guangdong, 518000, China.
| | - Mingda Hu
- Beijing Institute of Biotechnology, 20 Dongdajie, Fengtai District, Beijing, 100071, China
| | - Penghua Wang
- Department of Immunology, School of Medicine, University of Connecticut Health Center, Farmington, CT, 06030, USA
| | - Xin Wang
- Beijing Institute of Biotechnology, 20 Dongdajie, Fengtai District, Beijing, 100071, China
| | - Congwen Wei
- Beijing Institute of Biotechnology, 20 Dongdajie, Fengtai District, Beijing, 100071, China.
| | - Gong Cheng
- School of Basic Medical Science, Tsinghua University, 30 Shuangqing Rd., Haidian District, Beijing, 100084, China.
- Institute of Infectious Diseases, Shenzhen Bay Laboratory, Guangqiao Rd., Guangming District, Shenzhen, Guangdong, 518000, China.
| |
Collapse
|
3
|
Kenlay H, Dreyer FA, Kovaltsuk A, Miketa D, Pires D, Deane CM. Large scale paired antibody language models. PLoS Comput Biol 2024; 20:e1012646. [PMID: 39642174 DOI: 10.1371/journal.pcbi.1012646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 12/18/2024] [Accepted: 11/18/2024] [Indexed: 12/08/2024] Open
Abstract
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
Collapse
Affiliation(s)
- Henry Kenlay
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | | | | | - Dom Miketa
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | - Douglas Pires
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | - Charlotte M Deane
- Exscientia, Oxford Science Park, Oxford, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
4
|
Qiang H, Wang F, Lu W, Xing X, Kim H, Merette SAM, Ayres LB, Oler E, AbuSalim JE, Roichman A, Neinast M, Cordova RA, Lee WD, Herbst E, Gupta V, Neff S, Hiebert-Giesbrecht M, Young A, Gautam V, Tian S, Wang B, Röst H, Greiner R, Chen L, Johnston CW, Foster LJ, Shapiro AM, Wishart DS, Rabinowitz JD, Skinnider MA. Language model-guided anticipation and discovery of unknown metabolites. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.13.623458. [PMID: 39605668 PMCID: PMC11601323 DOI: 10.1101/2024.11.13.623458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Despite decades of study, large parts of the mammalian metabolome remain unexplored. Mass spectrometry-based metabolomics routinely detects thousands of small molecule-associated peaks within human tissues and biofluids, but typically only a small fraction of these can be identified, and structure elucidation of novel metabolites remains a low-throughput endeavor. Biochemical large language models have transformed the interpretation of DNA, RNA, and protein sequences, but have not yet had a comparable impact on understanding small molecule metabolism. Here, we present an approach that leverages chemical language models to discover previously uncharacterized metabolites. We introduce DeepMet, a chemical language model that learns the latent biosynthetic logic embedded within the structures of known metabolites and exploits this understanding to anticipate the existence of as-of-yet undiscovered metabolites. Prospective chemical synthesis of metabolites predicted to exist by DeepMet directs their targeted discovery. Integrating DeepMet with tandem mass spectrometry (MS/MS) data enables automated metabolite discovery within complex tissues. We harness DeepMet to discover several dozen structurally diverse mammalian metabolites. Our work demonstrates the potential for language models to accelerate the mapping of the metabolome.
Collapse
|
5
|
Praljak N, Yeh H, Moore M, Socolich M, Ranganathan R, Ferguson AL. Natural Language Prompts Guide the Design of Novel Functional Protein Sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.622734. [PMID: 39605414 PMCID: PMC11601239 DOI: 10.1101/2024.11.11.622734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The advent of natural language interaction with machines has ushered in new innovations in text-guided generation of images, audio, video, and more. In this arena, we introduce Bio logical M ulti- M odal M odel ( BioM3 ), as a novel framework for designing functional proteins via natural language prompts. This framework integrates natural language with protein design through a three-stage process: aligning protein and text representations in a joint embedding space learned using contrastive learning, refinement of the text embeddings, and conditional generation of protein sequences via a discrete autoregressive diffusion model. BioM3 synthe-sizes protein sequences with detailed descriptions of the protein structure, lineage, and function from text annotations to enable the conditional generation of novel sequences with desired attributes through natural language prompts. We present in silico validation of the model predictions for subcellular localization prediction, reaction classification, remote homology detection, scaffold in-painting, and structural plausibility, and in vivo and in vitro experimental tests of natural language prompt-designed synthetic analogs of Src-homology 3 (SH3) domain proteins that mediate signaling in the Sho1 osmotic stress response pathway in baker's yeast. BioM3 possesses state-of-the-art performance in zero-shot prediction and homology detection tasks, and generates proteins with native-like tertiary folds and wild-type levels of experimentally assayed function.
Collapse
|
6
|
Liu Z, Shen Y, Jiang Y, Zhu H, Hu H, Kang Y, Chen M, Li Z. Variation and evolution analysis of SARS-CoV-2 using self-game sequence optimization. Front Microbiol 2024; 15:1485748. [PMID: 39588108 PMCID: PMC11586374 DOI: 10.3389/fmicb.2024.1485748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Accepted: 10/18/2024] [Indexed: 11/27/2024] Open
Abstract
Introduction The evolution of SARS-CoV-2 has precipitated the emergence of new mutant strains, some exhibiting enhanced transmissibility and immune evasion capabilities, thus escalating the infection risk and diminishing vaccine efficacy. Given the continuous impact of SARS-CoV-2 mutations on global public health, the economy, and society, a profound comprehension of potential variations is crucial to effectively mitigate the impact of viral evolution. Yet, this task still faces considerable challenges. Methods This study introduces DARSEP, a method based on Deep learning Associates with Reinforcement learning for SARS-CoV-2 Evolution Prediction, combined with self-game sequence optimization and RetNet-based model. Results DARSEP accurately predicts evolutionary sequences and investigates the virus's evolutionary trajectory. It filters spike protein sequences with optimal fitness values from an extensive mutation space, selectively identifies those with a higher likelihood of evading immune detection, and devises a superior evolutionary analysis model for SARS-CoV-2 spike protein sequences. Comprehensive downstream task evaluations corroborate the model's efficacy in predicting potential mutation sites, elucidating SARS-CoV-2's evolutionary direction, and analyzing the development trends of Omicron variant strains through semantic changes. Conclusion Overall, DARSEP enriches our understanding of the dynamic evolution of SARS-CoV-2 and provides robust support for addressing present and future epidemic challenges.
Collapse
Affiliation(s)
- Ziyu Liu
- School of Information Engineering, Huzhou University, Huzhou, Zhejiang, China
| | - Yi Shen
- College of Life Sciences, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yunliang Jiang
- School of Computer Science and Technology, Zhejiang Normal University, Jinhua, Zhejiang, China
| | - Hancan Zhu
- School of Mathematics, Physics and Information, Shaoxing University, Shaoxing, Zhejiang, China
| | - Hailong Hu
- School of Information Engineering, Huzhou University, Huzhou, Zhejiang, China
| | - Yanlei Kang
- School of Information Engineering, Huzhou University, Huzhou, Zhejiang, China
| | - Ming Chen
- College of Life Sciences, Zhejiang University, Hangzhou, Zhejiang, China
| | - Zhong Li
- School of Information Engineering, Huzhou University, Huzhou, Zhejiang, China
| |
Collapse
|
7
|
Muir DF, Asper GPR, Notin P, Posner JA, Marks DS, Keiser MJ, Pinney MM. Evolutionary-Scale Enzymology Enables Biochemical Constant Prediction Across a Multi-Peaked Catalytic Landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.23.619915. [PMID: 39484523 PMCID: PMC11526920 DOI: 10.1101/2024.10.23.619915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Quantitatively mapping enzyme sequence-catalysis landscapes remains a critical challenge in understanding enzyme function, evolution, and design. Here, we expand an emerging microfluidic platform to measure catalytic constants-k cat and K M-for hundreds of diverse naturally occurring sequences and mutants of the model enzyme Adenylate Kinase (ADK). This enables us to dissect the sequence-catalysis landscape's topology, navigability, and mechanistic underpinnings, revealing distinct catalytic peaks organized by structural motifs. These results challenge long-standing hypotheses in enzyme adaptation, demonstrating that thermophilic enzymes are not slower than their mesophilic counterparts. Combining the rich representations of protein sequences provided by deep-learning models with our custom high-throughput kinetic data yields semi-supervised models that significantly outperform existing models at predicting catalytic parameters of naturally occurring ADK sequences. Our work demonstrates a promising strategy for dissecting sequence-catalysis landscapes across enzymatic evolution and building family-specific models capable of accurately predicting catalytic constants, opening new avenues for enzyme engineering and functional prediction.
Collapse
Affiliation(s)
- Duncan F Muir
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Program in Biophysics, University of California, San Francisco, San Francisco, CA, USA
| | - Garrison P R Asper
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | - Pascal Notin
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Jacob A Posner
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Department of Biology, San Francisco State University, San Francisco, CA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Michael J Keiser
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
- Institute for Neurodegenerative Diseases, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Margaux M Pinney
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Valhalla Fellow, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
8
|
Vazzana G, Savojardo C, Martelli PL, Casadio R. Testing the Capability of Embedding-Based Alignments on the GST Superfamily Classification: The Role of Protein Length. Molecules 2024; 29:4616. [PMID: 39407545 PMCID: PMC11478096 DOI: 10.3390/molecules29194616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 09/19/2024] [Accepted: 09/20/2024] [Indexed: 10/20/2024] Open
Abstract
In order to shed light on the usage of protein language model-based alignment procedures, we attempted the classification of Glutathione S-transferases (GST; EC 2.5.1.18) and compared our results with the ARBA/UNI rule-based annotation in UniProt. GST is a protein superfamily involved in cellular detoxification from harmful xenobiotics and endobiotics, widely distributed in prokaryotes and eukaryotes. What is particularly interesting is that the superfamily is characterized by different classes, comprising proteins from different taxa that can act in different cell locations (cytosolic, mitochondrial and microsomal compartments) with different folds and different levels of sequence identity with remote homologs. For this reason, GST functional annotation in a specific class is problematic: unless a structure is released, the protein can be classified only on the basis of sequence similarity, which excludes the annotation of remote homologs. Here, we adopt an embedding-based alignment to classify 15,061 GST proteins automatically annotated by the UniProt-ARBA/UNI rules. Embedding is based on the Meta ESM2-15b protein language. The embedding-based alignment reaches more than a 99% rate of perfect matching with the UniProt automatic procedure. Data analysis indicates that 46% of the UniProt automatically classified proteins do not conserve the typical length of canonical GSTs, whose structure is known. Therefore, 46% of the classified proteins do not conserve the template/s structure required for their family classification. Our approach finds that 41% of 64,207 GST UniProt proteins not yet assigned to any class can be classified consistently with the structural template length.
Collapse
Affiliation(s)
| | | | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, 40126 Bologna, Italy; (G.V.); (C.S.)
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, 40126 Bologna, Italy; (G.V.); (C.S.)
| |
Collapse
|
9
|
Tang Z, Somia N, Yu Y, Koo PK. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.29.582810. [PMID: 38464101 PMCID: PMC10925287 DOI: 10.1101/2024.02.29.582810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
Collapse
Affiliation(s)
- Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Nirali Somia
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Yiyang Yu
- The Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
10
|
Kabir A, Moldwin A, Bromberg Y, Shehu A. In the twilight zone of protein sequence homology: do protein language models learn protein structure? BIOINFORMATICS ADVANCES 2024; 4:vbae119. [PMID: 39183802 PMCID: PMC11344590 DOI: 10.1093/bioadv/vbae119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 08/01/2024] [Accepted: 08/12/2024] [Indexed: 08/27/2024]
Abstract
Motivation Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. Results We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementation We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Asher Moldwin
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| | - Yana Bromberg
- Department of Computer Science, Emory University, Atlanta, GA 30307, United States
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, United States
| |
Collapse
|
11
|
Nambiar A, Forsyth JM, Liu S, Maslov S. DR-BERT: A protein language model to annotate disordered regions. Structure 2024; 32:1260-1268.e3. [PMID: 38701796 DOI: 10.1016/j.str.2024.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/16/2023] [Accepted: 04/08/2024] [Indexed: 05/05/2024]
Abstract
Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR-BERT's ability to use contextual information.
Collapse
Affiliation(s)
- Ananthan Nambiar
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA.
| | - John Malcolm Forsyth
- Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Simon Liu
- Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Sergei Maslov
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Physics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA.
| |
Collapse
|
12
|
Cuturello F, Celoria M, Ansuini A, Cazzaniga A. Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models. Bioinformatics 2024; 40:btae447. [PMID: 39012369 PMCID: PMC11269464 DOI: 10.1093/bioinformatics/btae447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/19/2024] [Accepted: 07/10/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a Protein Language Model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data pre-processing to mitigate overfitting. RESULTS We ensure comprehensive comparisons by fine-tuning various pre-trained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. AVAILABILITY AND IMPLEMENTATION Code and data at https://github.com/RitAreaSciencePark/PLM4Muts. SUPPLEMENTARY INFORMATION Supplementary Information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Cuturello
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Marco Celoria
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
- HPC Department, , CINECA National Supercomputing Center, Bologna 40033, Italy
| | - Alessio Ansuini
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Alberto Cazzaniga
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| |
Collapse
|
13
|
Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large language models for science and medicine. Eur J Clin Invest 2024; 54:e14183. [PMID: 38381530 DOI: 10.1111/eci.14183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
Large language models (LLMs) are a type of machine learning model that learn statistical patterns over text, such as predicting the next words in a sequence of text. Both general purpose and task-specific LLMs have demonstrated potential across diverse applications. Science and medicine have many data types that are highly suitable for LLMs, such as scientific texts (publications, patents and textbooks), electronic medical records, large databases of DNA and protein sequences and chemical compounds. Carefully validated systems that can understand and reason across all these modalities may maximize benefits. Despite the inevitable limitations and caveats of any new technology and some uncertainties specific to LLMs, LLMs have the potential to be transformative in science and medicine.
Collapse
Affiliation(s)
- Amalio Telenti
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, California, USA
- Vir Biotechnology, Inc., San Francisco, California, USA
| | | | - Brian L Hie
- FAIR, Meta, Menlo Park, California, USA
- Department of Chemical Engineering, Stanford University, Stanford, California, USA
| | - Cyrus Maher
- Vir Biotechnology, Inc., San Francisco, California, USA
| | - Suchi Saria
- Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, Maryland, USA
| | - John P A Ioannidis
- Department of Medicine, Stanford University, Stanford, California, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA
- Department of Statistics, Stanford University, Stanford, California, USA
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
| |
Collapse
|
14
|
Burnim AA, Dufault-Thompson K, Jiang X. The three-sided right-handed β-helix is a versatile fold for glycan interactions. Glycobiology 2024; 34:cwae037. [PMID: 38767844 PMCID: PMC11129586 DOI: 10.1093/glycob/cwae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/13/2024] [Accepted: 05/17/2024] [Indexed: 05/22/2024] Open
Abstract
Interactions between proteins and glycans are critical to various biological processes. With databases of carbohydrate-interacting proteins and increasing amounts of structural data, the three-sided right-handed β-helix (RHBH) has emerged as a significant structural fold for glycan interactions. In this review, we provide an overview of the sequence, mechanistic, and structural features that enable the RHBH to interact with glycans. The RHBH is a prevalent fold that exists in eukaryotes, prokaryotes, and viruses associated with adhesin and carbohydrate-active enzyme (CAZyme) functions. An evolutionary trajectory analysis on structurally characterized RHBH-containing proteins shows that they likely evolved from carbohydrate-binding proteins with their carbohydrate-degrading activities evolving later. By examining three polysaccharide lyase and three glycoside hydrolase structures, we provide a detailed view of the modes of glycan binding in RHBH proteins. The 3-dimensional shape of the RHBH creates an electrostatically and spatially favorable glycan binding surface that allows for extensive hydrogen bonding interactions, leading to favorable and stable glycan binding. The RHBH is observed to be an adaptable domain capable of being modified with loop insertions and charge inversions to accommodate heterogeneous and flexible glycans and diverse reaction mechanisms. Understanding this prevalent protein fold can advance our knowledge of glycan binding in biological systems and help guide the efficient design and utilization of RHBH-containing proteins in glycobiology research.
Collapse
Affiliation(s)
- Audrey A Burnim
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| | - Keith Dufault-Thompson
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| | - Xiaofang Jiang
- National Library of Medicine, National Institutes of Health, Building 38A, Room 6N607, 8600 Rockville Pike, Bethesda, MD 20894 United States
| |
Collapse
|
15
|
Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Brief Bioinform 2024; 25:bbae245. [PMID: 38797969 PMCID: PMC11128484 DOI: 10.1093/bib/bbae245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 04/08/2024] [Accepted: 05/07/2024] [Indexed: 05/29/2024] Open
Abstract
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% nonredundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. The BALMFold structure prediction server is freely available at https://beamlab-sh.com/models/BALMFold.
Collapse
Affiliation(s)
- Hongtai Jing
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
| | - Zhengtao Gao
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Sheng Xu
- Shanghai AI Laboratory, Shanghai 200232, China
| | - Tao Shen
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Zelixir Biotech, Shanghai 201206, China
| | - Zhangzhi Peng
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shwai He
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Tao You
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shuang Ye
- Department of Gynecologic Oncology, Fudan University Shanghai Cancer Center, Shanghai 200032, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Wei Lin
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
- Shanghai AI Laboratory, Shanghai 200232, China
- School of Mathematical Sciences and Shanghai Center for Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Shanghai AI Laboratory, Shanghai 200232, China
| |
Collapse
|
16
|
Quddusi DM, Hiremath SA, Bajcinca N. Mutation prediction in the SARS-CoV-2 genome using attention-based neural machine translation. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:5996-6018. [PMID: 38872567 DOI: 10.3934/mbe.2024264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2024]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) has been evolving rapidly after causing havoc worldwide in 2020. Since then, it has been very hard to contain the virus owing to its frequently mutating nature. Changes in its genome lead to viral evolution, rendering it more resistant to existing vaccines and drugs. Predicting viral mutations beforehand will help in gearing up against more infectious and virulent versions of the virus in turn decreasing the damage caused by them. In this paper, we have proposed different NMT (neural machine translation) architectures based on RNNs (recurrent neural networks) to predict mutations in the SARS-CoV-2-selected non-structural proteins (NSP), i.e., NSP1, NSP3, NSP5, NSP8, NSP9, NSP13, and NSP15. First, we created and pre-processed the pairs of sequences from two languages using k-means clustering and nearest neighbors for training a neural translation machine. We also provided insights for training NMTs on long biological sequences. In addition, we evaluated and benchmarked our models to demonstrate their efficiency and reliability.
Collapse
Affiliation(s)
- Darrak Moin Quddusi
- Chair of Mechatronics in the Faculty of Mechanical and Process Engineering, Rheinland-Pfalz Technical University of Kaiserslautern-Landau, Kaiserslautern 67663, Germany
| | - Sandesh Athni Hiremath
- Chair of Mechatronics in the Faculty of Mechanical and Process Engineering, Rheinland-Pfalz Technical University of Kaiserslautern-Landau, Kaiserslautern 67663, Germany
| | - Naim Bajcinca
- Chair of Mechatronics in the Faculty of Mechanical and Process Engineering, Rheinland-Pfalz Technical University of Kaiserslautern-Landau, Kaiserslautern 67663, Germany
| |
Collapse
|
17
|
Zhao N, Zhao W, Tang X, Jiao C, Zhang Z. Design of health information management model for elderly care using an advanced higher-order hybrid clustering algorithm from the perspective of sports and medicine integration. PLoS One 2024; 19:e0302741. [PMID: 38758774 PMCID: PMC11101068 DOI: 10.1371/journal.pone.0302741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 04/11/2024] [Indexed: 05/19/2024] Open
Abstract
In the context of integrating sports and medicine domains, the urgent resolution of elderly health supervision requires effective data clustering algorithms. This paper introduces a novel higher-order hybrid clustering algorithm that combines density values and the particle swarm optimization (PSO) algorithm. Initially, the traditional PSO algorithm is enhanced by integrating the Global Evolution Dynamic Model (GEDM) into the Distribution Estimation Algorithm (EDA), constructing a weighted covariance matrix-based GEDM. This adapted PSO algorithm dynamically selects between the Global Evolution Dynamic Model and the standard PSO algorithm to update population information, significantly enhancing convergence speed while mitigating the risk of local optima entrapment. Subsequently, the higher-order hybrid clustering algorithm is formulated based on the density value and the refined PSO algorithm. The PSO clustering algorithm is adopted in the initial clustering phase, culminating in class clusters after a finite number of iterations. These clusters then undergo the application of the density peak search algorithm to identify candidate centroids. The final centroids are determined through a fusion of the initial class clusters and the identified candidate centroids. Results showcase remarkable improvements: achieving 99.13%, 82.22%, and 99.22% for F-measure, recall, and precision on dataset S1, and 75.22%, 64.0%, and 64.4% on dataset CMC. Notably, the proposed algorithm yields a 75.22%, 64.4%, and 64.6% rate on dataset S, significantly surpassing the comparative schemes' performance. Moreover, employing the text vector representation of the LDA topic vector model underscores the efficacy of the higher-order hybrid clustering algorithm in efficiently clustering text information. This innovative approach facilitates swift and accurate clustering of elderly health data from the perspective of sports and medicine integration. It enables the identification of patterns and regularities within the data, facilitating the formulation of personalized health management strategies and addressing latent health concerns among the elderly population.
Collapse
Affiliation(s)
- Ning Zhao
- Physical Education Department, Qiqihar Medical University, Qiqihar, Heilongjiang, China
| | - Wenkai Zhao
- The Third Affiliated Hospital of Qiqihar Medical University, Qiqihar, Heilongjiang, China
| | - Xiaoliang Tang
- Physical Education Department, Qiqihar Medical University, Qiqihar, Heilongjiang, China
| | - Chuanming Jiao
- Physical Education Department, Qiqihar Medical University, Qiqihar, Heilongjiang, China
| | - Zhong Zhang
- Physical Education Department, Qiqihar Medical University, Qiqihar, Heilongjiang, China
| |
Collapse
|
18
|
Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol 2024:10.1038/s41587-024-02214-2. [PMID: 38653796 DOI: 10.1038/s41587-024-02214-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/20/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.
Collapse
Affiliation(s)
| | - Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Sandra Viknander
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Clara Goldin
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | | | - Aleksej Zelezniak
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden.
- Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania.
- Randall Centre for Cell & Molecular Biophysics, King's College London, Guy's Campus, London, UK.
| | | |
Collapse
|
19
|
Ertelt M, Meiler J, Schoeder CT. Combining Rosetta Sequence Design with Protein Language Model Predictions Using Evolutionary Scale Modeling (ESM) as Restraint. ACS Synth Biol 2024; 13:1085-1092. [PMID: 38568188 PMCID: PMC11036486 DOI: 10.1021/acssynbio.3c00753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 02/16/2024] [Accepted: 03/20/2024] [Indexed: 04/20/2024]
Abstract
Computational protein sequence design has the ambitious goal of modifying existing or creating new proteins; however, designing stable and functional proteins is challenging without predictability of protein dynamics and allostery. Informing protein design methods with evolutionary information limits the mutational space to more native-like sequences and results in increased stability while maintaining functions. Recently, language models, trained on millions of protein sequences, have shown impressive performance in predicting the effects of mutations. Assessing Rosetta-designed sequences with a language model showed scores that were worse than those of their original sequence. To inform Rosetta design protocols with language model predictions, we added a new metric to restrain the energy function during design using the Evolutionary Scale Modeling (ESM) model. The resulting sequences have better language model scores and similar sequence recovery, with only a minor decrease in the fitness as assessed by Rosetta energy. In conclusion, our work combines the strength of recent machine learning approaches with the Rosetta protein design toolbox.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
| | - Jens Meiler
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
- Department
of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United
States
- Center
for Structural Biology, Vanderbilt University, Nashville, Tennessee 37235, United States
| | - Clara T. Schoeder
- Institute
for Drug Discovery, University Leipzig Medicine
Faculty, Liebigstr. 19, D-04103 Leipzig, Germany
- Center
for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, D-04105 Leipzig, Germany
| |
Collapse
|
20
|
He Y, Zhou X, Chang C, Chen G, Liu W, Li G, Fan X, Sun M, Miao C, Huang Q, Ma Y, Yuan F, Chang X. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol Cell 2024; 84:1257-1270.e6. [PMID: 38377993 DOI: 10.1016/j.molcel.2024.01.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 12/20/2023] [Accepted: 01/24/2024] [Indexed: 02/22/2024]
Abstract
Current base editors (BEs) use DNA deaminases, including cytidine deaminase in cytidine BE (CBE) or adenine deaminase in adenine BE (ABE), to facilitate transition nucleotide substitutions. Combining CBE or ABE with glycosylase enzymes can induce limited transversion mutations. Nonetheless, a critical demand remains for BEs capable of generating alternative mutation types, such as T>G corrections. In this study, we leveraged pre-trained protein language models to optimize a uracil-N-glycosylase (UNG) variant with altered specificity for thymines (eTDG). Notably, after two rounds of testing fewer than 50 top-ranking variants, more than 50% exhibited over 1.5-fold enhancement in enzymatic activities. When eTDG was fused with nCas9, it induced programmable T-to-S (G/C) substitutions and corrected db/db diabetic mutation in mice (up to 55%). Our findings not only establish orthogonal strategies for developing novel BEs but also demonstrate the capacities of protein language models for optimizing enzymes without extensive task-specific training data.
Collapse
Affiliation(s)
- Yan He
- Fudan University, 220 Handan Road, Shanghai 200433, China; School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Xibin Zhou
- School of Engineering, Westlake University, Hangzhou, Zhejiang 310014, China
| | - Chong Chang
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Ge Chen
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Weikuan Liu
- Fudan University, 220 Handan Road, Shanghai 200433, China; School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Geng Li
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Xiaoqi Fan
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Mingsun Sun
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Chensi Miao
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Qianyue Huang
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Yunqing Ma
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China
| | - Fajie Yuan
- School of Engineering, Westlake University, Hangzhou, Zhejiang 310014, China.
| | - Xing Chang
- School of Medicine, Westlake University, Hangzhou, Zhejiang 310014, China; School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310014, China; Research Center for Industries of the Future (RCIF), Westlake University, Hangzhou, Zhejiang 310014, China; Institute of Basic Medical Sciences, Westlake Institute for Advanced Study, Hangzhou, Zhejiang 310014, China; Westlake Center for Genome Editing, Westlake Laboratory of Life Sciences and Biomedicine, Westlake University, 18 Shilongshan Road, Hangzhou, Zhejiang 310024, China.
| |
Collapse
|
21
|
Yang KK, Fusi N, Lu AX. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst 2024; 15:286-294.e2. [PMID: 38428432 DOI: 10.1016/j.cels.2024.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 11/08/2023] [Accepted: 01/24/2024] [Indexed: 03/03/2024]
Abstract
Pretrained protein sequence language models have been shown to improve the performance of many prediction tasks and are now routinely integrated into bioinformatics tools. However, these models largely rely on the transformer architecture, which scales quadratically with sequence length in both run-time and memory. Therefore, state-of-the-art models have limitations on sequence length. To address this limitation, we investigated whether convolutional neural network (CNN) architectures, which scale linearly with sequence length, could be as effective as transformers in protein language models. With masked language model pretraining, CNNs are competitive with, and occasionally superior to, transformers across downstream applications while maintaining strong performance on sequences longer than those allowed in the current state-of-the-art transformer models. Our work suggests that computational efficiency can be improved without sacrificing performance, simply by using a CNN architecture instead of a transformer, and emphasizes the importance of disentangling pretraining task and model architecture. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Kevin K Yang
- Microsoft Research New England, Cambridge, MA 02139, USA.
| | - Nicolo Fusi
- Microsoft Research New England, Cambridge, MA 02139, USA
| | - Alex X Lu
- Microsoft Research New England, Cambridge, MA 02139, USA
| |
Collapse
|
22
|
Barton J, Gaspariunas A, Galson JD, Leem J. Building Representation Learning Models for Antibody Comprehension. Cold Spring Harb Perspect Biol 2024; 16:a041462. [PMID: 38012013 PMCID: PMC10910360 DOI: 10.1101/cshperspect.a041462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Antibodies are versatile proteins with both the capacity to bind a broad range of targets and a proven track record as some of the most successful therapeutics. However, the development of novel antibody therapeutics is a lengthy and costly process. It is challenging to predict the functional and biophysical properties of antibodies from their amino acid sequence alone, requiring numerous experiments for full characterization. Machine learning, specifically deep representation learning, has emerged as a family of methods that can complement wet lab approaches and accelerate the overall discovery and engineering process. Here, we review advances in antibody sequence representation learning, and how this has improved antibody structure prediction and facilitated antibody optimization. We discuss challenges in the development and implementation of such models, such as the lack of publicly available, well-curated antibody function data and highlight opportunities for improvement. These and future advances in machine learning for antibody sequences have the potential to increase the success rate in developing new therapeutics, resulting in broader access to transformative medicines and improved patient outcomes.
Collapse
Affiliation(s)
- Justin Barton
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| | | | - Jacob D Galson
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| | - Jinwoo Leem
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| |
Collapse
|
23
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
24
|
Chu HY, Fong JHC, Thean DGL, Zhou P, Fung FKC, Huang Y, Wong ASL. Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - John H C Fong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Dawn G L Thean
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Peng Zhou
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Frederic K C Fung
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Alan S L Wong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
25
|
Hie BL, Shanker VR, Xu D, Bruun TUJ, Weidenbacher PA, Tang S, Wu W, Pak JE, Kim PS. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol 2024; 42:275-283. [PMID: 37095349 PMCID: PMC10869273 DOI: 10.1038/s41587-023-01763-2] [Citation(s) in RCA: 104] [Impact Index Per Article: 104.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 03/28/2023] [Indexed: 04/26/2023]
Abstract
Natural evolution must explore a vast landscape of possible sequences for desirable yet rare mutations, suggesting that learning from natural evolutionary strategies could guide artificial evolution. Here we report that general protein language models can efficiently evolve human antibodies by suggesting mutations that are evolutionarily plausible, despite providing the model with no information about the target antigen, binding specificity or protein structure. We performed language-model-guided affinity maturation of seven antibodies, screening 20 or fewer variants of each antibody across only two rounds of laboratory evolution, and improved the binding affinities of four clinically relevant, highly mature antibodies up to sevenfold and three unmatured antibodies up to 160-fold, with many designs also demonstrating favorable thermostability and viral neutralization activity against Ebola and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pseudoviruses. The same models that improve antibody binding also guide efficient evolution across diverse protein families and selection pressures, including antibiotic resistance and enzyme activity, suggesting that these results generalize to many settings.
Collapse
Affiliation(s)
- Brian L Hie
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA.
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA.
| | - Varun R Shanker
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA, USA
| | - Duo Xu
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
| | - Theodora U J Bruun
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Stanford Medical Scientist Training Program, Stanford University School of Medicine, Stanford, CA, USA
| | - Payton A Weidenbacher
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
- Department of Chemistry, Stanford University, Stanford, CA, USA
| | - Shaogeng Tang
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA
| | - Wesley Wu
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - John E Pak
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Peter S Kim
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA, USA.
- Sarafan ChEM-H, Stanford University, Stanford, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
26
|
Harari S, Miller D, Fleishon S, Burstein D, Stern A. Using big sequencing data to identify chronic SARS-Coronavirus-2 infections. Nat Commun 2024; 15:648. [PMID: 38245511 PMCID: PMC10799923 DOI: 10.1038/s41467-024-44803-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 01/04/2024] [Indexed: 01/22/2024] Open
Abstract
The evolution of SARS-Coronavirus-2 (SARS-CoV-2) has been characterized by the periodic emergence of highly divergent variants. One leading hypothesis suggests these variants may have emerged during chronic infections of immunocompromised individuals, but limited data from these cases hinders comprehensive analyses. Here, we harnessed millions of SARS-CoV-2 genomes to identify potential chronic infections and used language models (LM) to infer chronic-associated mutations. First, we mined the SARS-CoV-2 phylogeny and identified chronic-like clades with identical metadata (location, age, and sex) spanning over 21 days, suggesting a prolonged infection. We inferred 271 chronic-like clades, which exhibited characteristics similar to confirmed chronic infections. Chronic-associated mutations were often high-fitness immune-evasive mutations located in the spike receptor-binding domain (RBD), yet a minority were unique to chronic infections and absent in global settings. The probability of observing high-fitness RBD mutations was 10-20 times higher in chronic infections than in global transmission chains. The majority of RBD mutations in BA.1/BA.2 chronic-like clades bore predictive value, i.e., went on to display global success. Finally, we used our LM to infer hundreds of additional chronic-like clades in the absence of metadata. Our approach allows mining extensive sequencing data and providing insights into future evolutionary patterns of SARS-CoV-2.
Collapse
Affiliation(s)
- Sheri Harari
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv, Israel
- Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel
| | - Danielle Miller
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv, Israel
- Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel
| | - Shay Fleishon
- Israeli Health Intelligence Agency, Public Health Division, Ministry of Health, Jerusalem, Israel
| | - David Burstein
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv, Israel
- Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel
| | - Adi Stern
- The Shmunis School of Biomedicine and Cancer Research, Tel Aviv University, Tel Aviv, Israel.
- Edmond J. Safra Center for Bioinformatics, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
27
|
Irvine EB, Reddy ST. Advancing Antibody Engineering through Synthetic Evolution and Machine Learning. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2024; 212:235-243. [PMID: 38166249 DOI: 10.4049/jimmunol.2300492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 10/20/2023] [Indexed: 01/04/2024]
Abstract
Abs are versatile molecules with the potential to achieve exceptional binding to target Ags, while also possessing biophysical properties suitable for therapeutic drug development. Protein display and directed evolution systems have transformed synthetic Ab discovery, engineering, and optimization, vastly expanding the number of Ab clones able to be experimentally screened for binding. Moreover, the burgeoning integration of high-throughput screening, deep sequencing, and machine learning has further augmented in vitro Ab optimization, promising to accelerate the design process and massively expand the Ab sequence space interrogated. In this Brief Review, we discuss the experimental and computational tools employed in synthetic Ab engineering and optimization. We also explore the therapeutic challenges posed by developing Abs for infectious diseases, and the prospects for leveraging machine learning-guided protein engineering to prospectively design Abs resistant to viral escape.
Collapse
Affiliation(s)
- Edward B Irvine
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
| | - Sai T Reddy
- Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
| |
Collapse
|
28
|
Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics 2024; 40:btad786. [PMID: 38175775 PMCID: PMC10792726 DOI: 10.1093/bioinformatics/btad786] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/27/2023] [Accepted: 12/29/2023] [Indexed: 01/06/2024] Open
Abstract
MOTIVATION Language models are routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful new approaches in the bioinformatics field. Protein language models (pLMs) generate high-dimensional embeddings on a per-residue level and encode a "semantic meaning" of each individual amino acid in the context of the full protein sequence. These representations have been used as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. RESULTS In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA) and show how these capture structural similarities even in the twilight zone, outperforming both classical methods as well as other approaches based on pLMs. The method shows excellent accuracy despite the absence of training and parameter optimization. We demonstrate that the combination of pLMs with alignment methods is a valuable approach for the detection of relationships between proteins in the twilight-zone. AVAILABILITY AND IMPLEMENTATION The code to run EBA and reproduce the analysis described in this article is available at: https://git.scicore.unibas.ch/schwede/EBA and https://git.scicore.unibas.ch/schwede/eba_benchmark.
Collapse
Affiliation(s)
- Lorenzo Pantolini
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gabriel Studer
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Joana Pereira
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Janani Durairaj
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel 4056, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel 4056, Switzerland
| |
Collapse
|
29
|
Doran BA, Chen RY, Giba H, Behera V, Barat B, Sundararajan A, Lin H, Sidebottom A, Pamer EG, Raman AS. An evolution-based framework for describing human gut bacteria. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.04.569969. [PMID: 38105970 PMCID: PMC10723311 DOI: 10.1101/2023.12.04.569969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The human gut microbiome contains many bacterial strains of the same species ('strain-level variants'). Describing strains in a biologically meaningful way rather than purely taxonomically is an important goal but challenging due to the genetic complexity of strain-level variation. Here, we measured patterns of co-evolution across >7,000 strains spanning the bacterial tree-of-life. Using these patterns as a prior for studying hundreds of gut commensal strains that we isolated, sequenced, and metabolically profiled revealed widespread structure beneath the phylogenetic level of species. Defining strains by their co-evolutionary signatures enabled predicting their metabolic phenotypes and engineering consortia from strain genome content alone. Our findings demonstrate a biologically relevant organization to strain-level variation and motivate a new schema for describing bacterial strains based on their evolutionary history.
Collapse
Affiliation(s)
- Benjamin A. Doran
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, IL, 60637
| | - Robert Y. Chen
- Department of Psychiatry, University of Washington, Seattle, WA, 98195
| | - Hannah Giba
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
- Department of Pathology, University of Chicago, Chicago, IL, 60637
| | - Vivek Behera
- Department of Medicine, University of Chicago, Chicago, IL, 60637
| | - Bidisha Barat
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
| | | | - Huaiying Lin
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
| | - Ashley Sidebottom
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
| | - Eric G. Pamer
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
- Department of Medicine, University of Chicago, Chicago, IL, 60637
| | - Arjun S. Raman
- Duchossois Family Institute, University of Chicago, Chicago, IL, 60637
- Department of Pathology, University of Chicago, Chicago, IL, 60637
- Center for the Physics of Evolving Systems, University of Chicago, Chicago, IL, 60637
| |
Collapse
|
30
|
Zhao W, Luo X, Tong F, Zheng X, Li J, Zhao G, Zhao D. Improving antibody optimization ability of generative adversarial network through large language model. Comput Struct Biotechnol J 2023; 21:5839-5850. [PMID: 38074472 PMCID: PMC10698008 DOI: 10.1016/j.csbj.2023.11.041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/21/2023] [Accepted: 11/21/2023] [Indexed: 10/16/2024] Open
Abstract
Generative adversarial networks (GANs) have successfully generated functional protein sequences. However, traditional GANs often suffer from inherent randomness, resulting in a lower probability of obtaining desirable sequences. Due to the high cost of wet-lab experiments, the main goal of computer-aided antibody optimization is to identify high-quality candidate antibodies from a large range of possibilities, yet improving the ability of GANs to generate these desired antibodies is a challenge. In this study, we propose and evaluate a new GAN called the Language Model Guided Antibody Generative Adversarial Network (AbGAN-LMG). This GAN uses a language model as an input, harnessing such models' powerful representational capabilities to improve the GAN's generation of high-quality antibodies. We conducted a comprehensive evaluation of the antibody libraries and sequences generated by AbGAN-LMG for COVID-19 (SARS-CoV-2) and Middle East Respiratory Syndrome (MERS-CoV). Results indicate that AbGAN-LMG has learned the fundamental characteristics of antibodies and that it improved the diversity of the generated libraries. Additionally, when generating sequences using AZD-8895 as the target antibody for optimization, over 50% of the generated sequences exhibited better developability than AZD-8895 itself. Through molecular docking, we identified 70 antibodies that demonstrated higher affinity for the wild-type receptor-binding domain (RBD) of SARS-CoV-2 compared to AZD-8895. In conclusion, AbGAN-LMG demonstrates that language models used in conjunction with GANs can enable the generation of higher-quality libraries and candidate sequences, thereby improving the efficiency of antibody optimization. AbGAN-LMG is available at http://39.102.71.224:88/.
Collapse
Affiliation(s)
- Wenbin Zhao
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Jing Li
- Beijing Institute of Microbiology and Epidemiology, State Key Laboratory of Pathogen and Biosecurity, Beijing 100071, China
| | - Guangyu Zhao
- Beijing Institute of Microbiology and Epidemiology, State Key Laboratory of Pathogen and Biosecurity, Beijing 100071, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing 100850, China
| |
Collapse
|
31
|
Umerenkov D, Nikolaev F, Shashkova TI, Strashnov PV, Sindeeva M, Shevtsov A, Ivanisenko NV, Kardymon OL. PROSTATA: a framework for protein stability assessment using transformers. Bioinformatics 2023; 39:btad671. [PMID: 37935419 PMCID: PMC10651431 DOI: 10.1093/bioinformatics/btad671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 10/25/2023] [Accepted: 11/02/2023] [Indexed: 11/09/2023] Open
Abstract
MOTIVATION Accurate prediction of change in protein stability due to point mutations is an attractive goal that remains unachieved. Despite the high interest in this area, little consideration has been given to the transformer architecture, which is dominant in many fields of machine learning. RESULTS In this work, we introduce PROSTATA, a predictive model built in a knowledge-transfer fashion on a new curated dataset. PROSTATA demonstrates advantage over existing solutions based on neural networks. We show that the large improvement margin is due to both the architecture of the model and the quality of the new training dataset. This work opens up opportunities to develop new lightweight and accurate models for protein stability assessment. AVAILABILITY AND IMPLEMENTATION PROSTATA is available at https://github.com/AIRI-Institute/PROSTATA and https://prostata.airi.net.
Collapse
Affiliation(s)
| | | | | | - Pavel V Strashnov
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Department of Computer Design and Technology, Bauman Moscow State Technical University, Moscow 105005, Russia
| | | | - Andrey Shevtsov
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Regulatory Transcriptomics and Epigenomics Group, Institute of Bioengineering, Research Center of Biotechnology RAS, Moscow 117036, Russia
| | - Nikita V Ivanisenko
- Bioinformatics Group, AIRI, Moscow 121170, Russia
- Laboratory of Computational Proteomics, Institute of Cytology and Genetics SB RAS, Novosibirsk 630090, Russia
| | | |
Collapse
|
32
|
Yang J, Ducharme J, Johnston KE, Li FZ, Yue Y, Arnold FH. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering. ACS Synth Biol 2023; 12:2444-2454. [PMID: 37524064 DOI: 10.1021/acssynbio.3c00301] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method that directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with the potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy to use, generalizable, and scalable. With accompanying software (https://github.com/jsunn-y/DeCOIL), DeCOIL can be readily implemented to generate desired informed libraries.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Julie Ducharme
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, California 91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
33
|
Han W, Chen N, Xu X, Sahil A, Zhou J, Li Z, Zhong H, Gao E, Zhang R, Wang Y, Sun S, Cheung PPH, Gao X. Predicting the antigenic evolution of SARS-COV-2 with deep learning. Nat Commun 2023; 14:3478. [PMID: 37311849 PMCID: PMC10261845 DOI: 10.1038/s41467-023-39199-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 05/31/2023] [Indexed: 06/15/2023] Open
Abstract
The relentless evolution of SARS-CoV-2 poses a significant threat to public health, as it adapts to immune pressure from vaccines and natural infections. Gaining insights into potential antigenic changes is critical but challenging due to the vast sequence space. Here, we introduce the Machine Learning-guided Antigenic Evolution Prediction (MLAEP), which combines structure modeling, multi-task learning, and genetic algorithms to predict the viral fitness landscape and explore antigenic evolution via in silico directed evolution. By analyzing existing SARS-CoV-2 variants, MLAEP accurately infers variant order along antigenic evolutionary trajectories, correlating with corresponding sampling time. Our approach identified novel mutations in immunocompromised COVID-19 patients and emerging variants like XBB1.5. Additionally, MLAEP predictions were validated through in vitro neutralizing antibody binding assays, demonstrating that the predicted variants exhibited enhanced immune evasion. By profiling existing variants and predicting potential antigenic changes, MLAEP aids in vaccine development and enhances preparedness against future SARS-CoV-2 variants.
Collapse
Affiliation(s)
- Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Ningning Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Xinzhou Xu
- Department of Chemical Pathology, Faculty of Medicine, Chinese University of Hong Kong, Hong Kong, China
- Li Ka Shing Institute of Health Sciences, Chinese University of Hong Kong, Hong Kong, China
| | - Adil Sahil
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Huawen Zhong
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | | | - Yu Wang
- Syneron Technology, Guangzhou, 510000, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Peter Pak-Hang Cheung
- Department of Chemical Pathology, Faculty of Medicine, Chinese University of Hong Kong, Hong Kong, China.
- Li Ka Shing Institute of Health Sciences, Chinese University of Hong Kong, Hong Kong, China.
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia.
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia.
| |
Collapse
|
34
|
Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun 2023; 14:2389. [PMID: 37185622 PMCID: PMC10129313 DOI: 10.1038/s41467-023-38063-x] [Citation(s) in RCA: 76] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 04/14/2023] [Indexed: 05/17/2023] Open
Abstract
Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558 million natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under 25 s). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold's capabilities, we predicted structures for 1.4 million paired antibody sequences, providing structural insights to 500-fold more antibodies than have experimentally determined structures.
Collapse
Affiliation(s)
- Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Lee-Shin Chu
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Jeffrey J Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA.
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA.
| |
Collapse
|
35
|
Li M, Kang L, Xiong Y, Wang YG, Fan G, Tan P, Hong L. SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering. J Cheminform 2023; 15:12. [PMID: 36737798 PMCID: PMC9898993 DOI: 10.1186/s13321-023-00688-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 01/23/2023] [Indexed: 02/05/2023] Open
Abstract
Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (< 50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.
Collapse
Affiliation(s)
- Mingchen Li
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200240, China
| | - Liqi Kang
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
- School of Physics and Astronomy & School of Pharmacy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yu Guang Wang
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200240, China
| | - Pan Tan
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
| | - Liang Hong
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
- School of Physics and Astronomy & School of Pharmacy, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
36
|
Yang KK, Zanichelli N, Yeh H. Masked inverse folding with sequence transfer for protein representation learning. Protein Eng Des Sel 2023; 36:gzad015. [PMID: 37883472 DOI: 10.1093/protein/gzad015] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023] Open
Abstract
Self-supervised pretraining on protein sequences has led to state-of-the art performance on protein function and fitness prediction. However, sequence-only methods ignore the rich information contained in experimental and predicted protein structures. Meanwhile, inverse folding methods reconstruct a protein's amino-acid sequence given its structure, but do not take advantage of sequences that do not have known structures. In this study, we train a masked inverse folding protein masked language model parameterized as a structured graph neural network. During pretraining, this model learns to reconstruct corrupted sequences conditioned on the backbone structure. We then show that using the outputs from a pretrained sequence-only protein masked language model as input to the inverse folding model further improves pretraining perplexity. We evaluate both of these models on downstream protein engineering tasks and analyze the effect of using information from experimental or predicted structures on performance.
Collapse
Affiliation(s)
- Kevin K Yang
- Microsoft Research, 1 Memorial Drive, Cambridge, MA, USA
| | | | - Hugh Yeh
- Pritzker School of Medicine, University of Chicago, 924 E 57th Street, Chicago, IL, USA
| |
Collapse
|
37
|
Olenyi T, Marquet C, Heinzinger M, Kröger B, Nikolova T, Bernhofer M, Sändig P, Schütze K, Littmann M, Mirdita M, Steinegger M, Dallago C, Rost B. LambdaPP: Fast and accessible protein-specific phenotype predictions. Protein Sci 2023; 32:e4524. [PMID: 36454227 PMCID: PMC9793974 DOI: 10.1002/pro.4524] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 11/09/2022] [Accepted: 11/21/2022] [Indexed: 12/04/2022]
Abstract
The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.
Collapse
Affiliation(s)
- Tobias Olenyi
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Céline Marquet
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Benjamin Kröger
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Tiha Nikolova
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Michael Bernhofer
- TUM Graduate SchoolCenter of Doctoral Studies in Informatics and its Applications (CeDoSIA)GarchingGermany
| | - Philip Sändig
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Konstantin Schütze
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Maria Littmann
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
| | - Milot Mirdita
- School of Biological SciencesSeoul National UniversitySeoulSouth Korea
| | - Martin Steinegger
- School of Biological SciencesSeoul National UniversitySeoulSouth Korea
- Korea Artificial Intelligence InstituteSeoul National UniversitySeoulSouth Korea
- Korea Institute of Molecular Biology and GeneticsSeoul National UniversitySeoulSouth Korea
| | - Christian Dallago
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- VantAINew YorkUSA
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of InformaticsBioinformatics‐ & Computational Biology—i12GarchingGermany
- Institute for Advanced Study (TUM‐IAS)Lichtenbergstr. 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (WZW)FreisingGermany
| |
Collapse
|
38
|
Burnim AA, Xu D, Spence MA, Jackson CJ, Ando N. Analysis of insertions and extensions in the functional evolution of the ribonucleotide reductase family. Protein Sci 2022; 31:e4483. [PMID: 36307939 PMCID: PMC9669993 DOI: 10.1002/pro.4483] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 10/22/2022] [Indexed: 12/14/2022]
Abstract
Ribonucleotide reductases (RNRs) are used by all free-living organisms and many viruses to catalyze an essential step in the de novo biosynthesis of DNA precursors. RNRs are remarkably diverse by primary sequence and cofactor requirement, while sharing a conserved fold and radical-based mechanism for nucleotide reduction. In this work, we expand on our recent phylogenetic inference of the entire RNR family and describe the evolutionarily relatedness of insertions and extensions around the structurally homologous catalytic barrel. Using evo-velocity and sequence similarity network (SSN) analyses, we show that the N-terminal regulatory motif known as the ATP-cone domain was likely inherited from an ancestral RNR. By combining SSN analysis with AlphaFold2 predictions, we also show that the C-terminal extensions of class II RNRs can contain folded domains that share homology with an Fe-S cluster assembly protein. Finally, using sequence analysis and AlphaFold2, we show that the sequence motif of a catalytically essential insertion known as the finger loop is tightly coupled to the catalytic mechanism. Based on these results, we propose an evolutionary model for the diversification of the RNR family.
Collapse
Affiliation(s)
- Audrey A. Burnim
- Department of Chemistry and Chemical BiologyCornell UniversityIthacaNew YorkUSA
| | - Da Xu
- Department of Chemistry and Chemical BiologyCornell UniversityIthacaNew YorkUSA
| | - Matthew A. Spence
- Research School of ChemistryAustralian National UniversityCanberraAustralian Capital TerritoryAustralia
| | - Colin J. Jackson
- Research School of ChemistryAustralian National UniversityCanberraAustralian Capital TerritoryAustralia
- Australian Research Council Centre of Excellence for Innovations in Peptide and Protein ScienceAustralian National UniversityCanberraAustralian Capital TerritoryAustralia
- Australian Research Council Centre of Excellence in Synthetic BiologyAustralian National UniversityCanberraAustralian Capital TerritoryAustralia
| | - Nozomi Ando
- Department of Chemistry and Chemical BiologyCornell UniversityIthacaNew YorkUSA
| |
Collapse
|
39
|
Kabir A, Shehu A. GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules 2022; 12:1709. [PMID: 36421723 PMCID: PMC9687818 DOI: 10.3390/biom12111709] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/14/2022] [Accepted: 11/15/2022] [Indexed: 09/19/2023] Open
Abstract
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
Collapse
Affiliation(s)
- Anowarul Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA
- Center for Advancing Human-Machine Partnerships, George Mason University, Fairfax, VA 22030, USA
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
| |
Collapse
|
40
|
Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun 2022; 13:6298. [PMID: 36273003 PMCID: PMC9588007 DOI: 10.1038/s41467-022-34032-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/07/2022] [Indexed: 12/25/2022] Open
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| |
Collapse
|
41
|
Burnim AA, Spence MA, Xu D, Jackson CJ, Ando N. Comprehensive phylogenetic analysis of the ribonucleotide reductase family reveals an ancestral clade. eLife 2022; 11:79790. [PMID: 36047668 PMCID: PMC9531940 DOI: 10.7554/elife.79790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 08/31/2022] [Indexed: 11/30/2022] Open
Abstract
Ribonucleotide reductases (RNRs) are used by all free-living organisms and many viruses to catalyze an essential step in the de novo biosynthesis of DNA precursors. RNRs are remarkably diverse by primary sequence and cofactor requirement, while sharing a conserved fold and radical-based mechanism for nucleotide reduction. Here, we structurally aligned the diverse RNR family by the conserved catalytic barrel to reconstruct the first large-scale phylogeny consisting of 6779 sequences that unites all extant classes of the RNR family and performed evo-velocity analysis to independently validate our evolutionary model. With a robust phylogeny in-hand, we uncovered a novel, phylogenetically distinct clade that is placed as ancestral to the classes I and II RNRs, which we have termed clade Ø. We employed small-angle X-ray scattering (SAXS), cryogenic-electron microscopy (cryo-EM), and AlphaFold2 to investigate a member of this clade from Synechococcus phage S-CBP4 and report the most minimal RNR architecture to-date. Based on our analyses, we propose an evolutionary model of diversification in the RNR family and delineate how our phylogeny can be used as a roadmap for targeted future study. Billions of years ago, the Earth’s atmosphere had very little oxygen. It was only after some bacteria and early plants evolved to harness energy from sunlight that oxygen began to fill the Earth’s environment. Oxygen is highly reactive and can interfere with enzymes and other molecules that are essential to life. Organisms living at this point in history therefore had to adapt to survive in this new oxygen-rich world. An ancient family of enzymes known as ribonucleotide reductases are used by all free-living organisms and many viruses to repair and replicate their DNA. Because of their essential role in managing DNA, these enzymes have been around on Earth for billions of years. Understanding how they evolved could therefore shed light on how nature adapted to increasing oxygen levels and other environmental changes at the molecular level. One approach to study how proteins evolved is to use computational analysis to construct a phylogenetic tree. This reveals how existing members of a family are related to one another based on the chain of molecules (known as amino acids) that make up each protein. Despite having similar structures and all having the same function, ribonucleotide reductases have remarkably diverse sequences of amino acids. This makes it computationally very demanding to build a phylogenetic tree. To overcome this, Burnim, Spence, Xu et al. created a phylogenetic tree using structural information from a part of the enzyme that is relatively similar in many modern-day ribonucleotide reductases. The final result took seven continuous months on a supercomputer to generate, and includes over 6,000 members of the enzyme family. The phylogenetic tree revealed a new distinct group of ribonucleotide reductases that may explain how one adaptation to increasing levels of oxygen emerged in some family members, while another adaptation emerged in others. The approach used in this work also opens up a new way to study how other highly diverse enzymes and other protein families evolved, potentially revealing new insights about our planet’s past.
Collapse
Affiliation(s)
- Audrey A Burnim
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, United States
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra, Australia
| | - Da Xu
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, United States
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra, Australia
| | - Nozomi Ando
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, United States
| |
Collapse
|