1
|
Kuang Z, Yan X, Yuan Y, Wang R, Zhu H, Wang Y, Li J, Ye J, Yue H, Yang X. Advances in stress-tolerance elements for microbial cell factories. Synth Syst Biotechnol 2024; 9:793-808. [PMID: 39072145 PMCID: PMC11277822 DOI: 10.1016/j.synbio.2024.06.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 06/10/2024] [Accepted: 06/27/2024] [Indexed: 07/30/2024] Open
Abstract
Microorganisms, particularly extremophiles, have evolved multiple adaptation mechanisms to address diverse stress conditions during survival in unique environments. Their responses to environmental coercion decide not only survival in severe conditions but are also an essential factor determining bioproduction performance. The design of robust cell factories should take the balance of their growing and bioproduction into account. Thus, mining and redesigning stress-tolerance elements to optimize the performance of cell factories under various extreme conditions is necessary. Here, we reviewed several stress-tolerance elements, including acid-tolerant elements, saline-alkali-resistant elements, thermotolerant elements, antioxidant elements, and so on, providing potential materials for the construction of cell factories and the development of synthetic biology. Strategies for mining and redesigning stress-tolerance elements were also discussed. Moreover, several applications of stress-tolerance elements were provided, and perspectives and discussions for potential strategies for screening stress-tolerance elements were made.
Collapse
Affiliation(s)
- Zheyi Kuang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Xiaofang Yan
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yanfei Yuan
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Ruiqi Wang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Haifan Zhu
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Youyang Wang
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Jianfeng Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Jianwen Ye
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Haitao Yue
- School of Intelligence Science and Technology, Xinjiang University, Urumqi, 830017, China
- Laboratory of Synthetic Biology, School of Life Science and Technology, Xinjiang University, Urumqi, 830017, China
| | - Xiaofeng Yang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
2
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
3
|
Zhang R, Chai N, Liu T, Zheng Z, Lin Q, Xie X, Wen J, Yang Z, Liu YG, Zhu Q. The type V effectors for CRISPR/Cas-mediated genome engineering in plants. Biotechnol Adv 2024; 74:108382. [PMID: 38801866 DOI: 10.1016/j.biotechadv.2024.108382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 05/07/2024] [Accepted: 05/24/2024] [Indexed: 05/29/2024]
Abstract
A plethora of CRISPR effectors, such as Cas3, Cas9, and Cas12a, are commonly employed as gene editing tools. Among these, Cas12 effectors developed based on Class II type V proteins exhibit distinct characteristics compared to Class II type VI and type II effectors, such as their ability to generate non-allelic DNA double-strand breaks, their compact structures, and the presence of a single RuvC-like nuclease domain. Capitalizing on these advantages, Cas12 family proteins have been increasingly explored and utilized in recent years. However, the characteristics and applications of different subfamilies within the type V protein family have not been systematically summarized. In this review, we focus on the characteristics of type V effector (CRISPR/Cas12) proteins and the current methods used to discover new effector proteins. We also summarize recent modifications based on engineering of type V effectors. In addition, we introduce the applications of type V effectors for gene editing in animals and plants, including the development of base editors, tools for regulating gene expression, methods for gene targeting, and biosensors. We emphasize the prospects for development and application of CRISPR/Cas12 effectors with the goal of better utilizing toolkits based on this protein family for crop improvement and enhanced agricultural production.
Collapse
Affiliation(s)
- Ruixiang Zhang
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China
| | - Nan Chai
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China
| | - Taoli Liu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China
| | - Zhiye Zheng
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China
| | - Qiupeng Lin
- College of Agriculture, South China Agricultural University, Guangzhou 510642, China
| | - Xianrong Xie
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China; College of Agriculture, South China Agricultural University, Guangzhou 510642, China
| | - Jun Wen
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China
| | - Zi Yang
- College of Natural & Agricultural Sciences, University of California, Riverside, 900 University Ave, Riverside, CA 92507, USA
| | - Yao-Guang Liu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China; College of Agriculture, South China Agricultural University, Guangzhou 510642, China.
| | - Qinlong Zhu
- State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources, Guangdong Laboratory for Lingnan Modern Agriculture, College of Life Sciences, South China Agricultural University, Guangzhou 510642, China; College of Agriculture, South China Agricultural University, Guangzhou 510642, China.
| |
Collapse
|
4
|
Gong X, Zhang J, Gan Q, Teng Y, Hou J, Lyu Y, Liu Z, Wu Z, Dai R, Zou Y, Wang X, Zhu D, Zhu H, Liu T, Yan Y. Advancing microbial production through artificial intelligence-aided biology. Biotechnol Adv 2024; 74:108399. [PMID: 38925317 DOI: 10.1016/j.biotechadv.2024.108399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 05/20/2024] [Accepted: 06/23/2024] [Indexed: 06/28/2024]
Abstract
Microbial cell factories (MCFs) have been leveraged to construct sustainable platforms for value-added compound production. To optimize metabolism and reach optimal productivity, synthetic biology has developed various genetic devices to engineer microbial systems by gene editing, high-throughput protein engineering, and dynamic regulation. However, current synthetic biology methodologies still rely heavily on manual design, laborious testing, and exhaustive analysis. The emerging interdisciplinary field of artificial intelligence (AI) and biology has become pivotal in addressing the remaining challenges. AI-aided microbial production harnesses the power of processing, learning, and predicting vast amounts of biological data within seconds, providing outputs with high probability. With well-trained AI models, the conventional Design-Build-Test (DBT) cycle has been transformed into a multidimensional Design-Build-Test-Learn-Predict (DBTLP) workflow, leading to significantly improved operational efficiency and reduced labor consumption. Here, we comprehensively review the main components and recent advances in AI-aided microbial production, focusing on genome annotation, AI-aided protein engineering, artificial functional protein design, and AI-enabled pathway prediction. Finally, we discuss the challenges of integrating novel AI techniques into biology and propose the potential of large language models (LLMs) in advancing microbial production.
Collapse
Affiliation(s)
- Xinyu Gong
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jianli Zhang
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Qi Gan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Yuxi Teng
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jixin Hou
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Yanjun Lyu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Zhengliang Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Zihao Wu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Runpeng Dai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yusong Zou
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Xianqiao Wang
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Dajiang Zhu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Tianming Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Yajun Yan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA.
| |
Collapse
|
5
|
Peng S, Rajjou L. Advancing plant biology through deep learning-powered natural language processing. PLANT CELL REPORTS 2024; 43:208. [PMID: 39102077 DOI: 10.1007/s00299-024-03294-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 07/19/2024] [Indexed: 08/06/2024]
Abstract
The application of deep learning methods, specifically the utilization of Large Language Models (LLMs), in the field of plant biology holds significant promise for generating novel knowledge on plant cell systems. The LLM framework exhibits exceptional potential, particularly with the development of Protein Language Models (PLMs), allowing for in-depth analyses of nucleic acid and protein sequences. This analytical capacity facilitates the discernment of intricate patterns and relationships within biological data, encompassing multi-scale information within DNA or protein sequences. The contribution of PLMs extends beyond mere sequence patterns and structure--function recognition; it also supports advancements in genetic improvements for agriculture. The integration of deep learning approaches into the domain of plant sciences offers opportunities for major breakthroughs in basic research across multi-scale plant traits. Consequently, the strategic application of deep learning methodologies, particularly leveraging the potential of LLMs, will undoubtedly play a pivotal role in advancing plant sciences, plant production, plant uses and propelling the trajectory toward sustainable agroecological and agro-food transitions.
Collapse
Affiliation(s)
- Shuang Peng
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France
| | - Loïc Rajjou
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin for Plant Sciences (IJPB), 78000, Versailles, France.
| |
Collapse
|
6
|
Dickson A, Mofrad MRK. Fine-tuning protein embeddings for functional similarity evaluation. Bioinformatics 2024; 40:btae445. [PMID: 38985218 PMCID: PMC11299545 DOI: 10.1093/bioinformatics/btae445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/11/2024] Open
Abstract
MOTIVATION Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks. RESULTS We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering. AVAILABILITY AND IMPLEMENTATION github.com/mofradlab/go_metric.
Collapse
Affiliation(s)
- Andrew Dickson
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| | - Mohammad R K Mofrad
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| |
Collapse
|
7
|
Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, Hong L. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminform 2024; 16:92. [PMID: 39095917 PMCID: PMC11297785 DOI: 10.1186/s13321-024-00884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/13/2024] [Indexed: 08/04/2024] Open
Abstract
Protein language models (PLMs) play a dominant role in protein representation learning. Most existing PLMs regard proteins as sequences of 20 natural amino acids. The problem with this representation method is that it simply divides the protein sequence into sequences of individual amino acids, ignoring the fact that certain residues often occur together. Therefore, it is inappropriate to view amino acids as isolated tokens. Instead, the PLMs should recognize the frequently occurring combinations of amino acids as a single token. In this study, we use the byte-pair-encoding algorithm and unigram to construct advanced residue vocabularies for protein sequence tokenization, and we have shown that PLMs pre-trained using these advanced vocabularies exhibit superior performance on downstream tasks when compared to those trained with simple vocabularies. Furthermore, we introduce PETA, a comprehensive benchmark for systematically evaluating PLMs. We find that vocabularies comprising 50 and 200 elements achieve optimal performance. Our code, model weights, and datasets are available at https://github.com/ginnm/ProteinPretraining . SCIENTIFIC CONTRIBUTION: This study introduces advanced protein sequence tokenization analysis, leveraging the byte-pair-encoding algorithm and unigram. By recognizing frequently occurring combinations of amino acids as single tokens, our proposed method enhances the performance of PLMs on downstream tasks. Additionally, we present PETA, a new comprehensive benchmark for the systematic evaluation of PLMs, demonstrating that vocabularies of 50 and 200 elements offer optimal performance.
Collapse
Affiliation(s)
- Yang Tan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Mingchen Li
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
| | - Ziyi Zhou
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Pan Tan
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
| | - Huiqun Yu
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
| | - Liang Hong
- Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China.
- Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China.
| |
Collapse
|
8
|
Jiang H, Jude KM, Wu K, Fallas J, Ueda G, Brunette TJ, Hicks DR, Pyles H, Yang A, Carter L, Lamb M, Li X, Levine PM, Stewart L, Garcia KC, Baker D. De novo design of buttressed loops for sculpting protein functions. Nat Chem Biol 2024; 20:974-980. [PMID: 38816644 PMCID: PMC11288887 DOI: 10.1038/s41589-024-01632-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2023] [Accepted: 04/29/2024] [Indexed: 06/01/2024]
Abstract
In natural proteins, structured loops have central roles in molecular recognition, signal transduction and enzyme catalysis. However, because of the intrinsic flexibility and irregularity of loop regions, organizing multiple structured loops at protein functional sites has been very difficult to achieve by de novo protein design. Here we describe a solution to this problem that designs tandem repeat proteins with structured loops (9-14 residues) buttressed by extensive hydrogen bonding interactions. Experimental characterization shows that the designs are monodisperse, highly soluble, folded and thermally stable. Crystal structures are in close agreement with the design models, with the loops structured and buttressed as designed. We demonstrate the functionality afforded by loop buttressing by designing and characterizing binders for extended peptides in which the loops form one side of an extended binding pocket. The ability to design multiple structured loops should contribute generally to efforts to design new protein functions.
Collapse
Affiliation(s)
- Hanlun Jiang
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Kevin M Jude
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA, USA
| | - Kejia Wu
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Biological Physics, Structure and Design Graduate Program, University of Washington, Seattle, WA, USA
| | - Jorge Fallas
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - George Ueda
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - T J Brunette
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Derrick R Hicks
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Harley Pyles
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Aerin Yang
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA, USA
| | - Lauren Carter
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Mila Lamb
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Xinting Li
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Paul M Levine
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Lance Stewart
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - K Christopher Garcia
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA, USA.
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA, USA.
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA, USA.
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA.
- Institute for Protein Design, University of Washington, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
9
|
Bashour H, Smorodina E, Pariset M, Zhong J, Akbar R, Chernigovskaya M, Lê Quý K, Snapkow I, Rawat P, Krawczyk K, Sandve GK, Gutierrez-Marcos J, Gutierrez DNZ, Andersen JT, Greiff V. Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability. Commun Biol 2024; 7:922. [PMID: 39085379 PMCID: PMC11291509 DOI: 10.1038/s42003-024-06561-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 07/05/2024] [Indexed: 08/02/2024] Open
Abstract
Designing effective monoclonal antibody (mAb) therapeutics faces a multi-parameter optimization challenge known as "developability", which reflects an antibody's ability to progress through development stages based on its physicochemical properties. While natural antibodies may provide valuable guidance for mAb selection, we lack a comprehensive understanding of natural developability parameter (DP) plasticity (redundancy, predictability, sensitivity) and how the DP landscapes of human-engineered and natural antibodies relate to one another. These gaps hinder fundamental developability profile cartography. To chart natural and engineered DP landscapes, we computed 40 sequence- and 46 structure-based DPs of over two million native and human-engineered single-chain antibody sequences. We find lower redundancy among structure-based compared to sequence-based DPs. Sequence DP sensitivity to single amino acid substitutions varied by antibody region and DP, and structure DP values varied across the conformational ensemble of antibody structures. We show that sequence DPs are more predictable than structure-based ones across different machine-learning tasks and embeddings, indicating a constrained sequence-based design space. Human-engineered antibodies localize within the developability and sequence landscapes of natural antibodies, suggesting that human-engineered antibodies explore mere subspaces of the natural one. Our work quantifies the plasticity of antibody developability, providing a fundamental resource for multi-parameter therapeutic mAb design.
Collapse
Affiliation(s)
- Habib Bashour
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
- School of Life Sciences, University of Warwick, Coventry, UK.
| | - Eva Smorodina
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | - Jahn Zhong
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Division of Genetics, Department Biology, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany
| | - Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Maria Chernigovskaya
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Khang Lê Quý
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Igor Snapkow
- Department of Chemical Toxicology, Norwegian Institute of Public Health, Oslo, Norway
| | - Puneet Rawat
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | | | | | | | - Jan Terje Andersen
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Department of Pharmacology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Precision Immunotherapy Alliance (PRIMA), University of Oslo, Oslo, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
10
|
Roche R, Tarafder S, Bhattacharya D. Single-sequence protein-RNA complex structure prediction by geometric attention-enabled pairing of biological language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.27.605468. [PMID: 39091736 PMCID: PMC11291176 DOI: 10.1101/2024.07.27.605468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Ground-breaking progress has been made in structure prediction of biomolecular assemblies, including the recent breakthrough of AlphaFold 3. However, it remains challenging for AlphaFold 3 and other state-of-the-art deep learning-based methods to accurately predict protein-RNA complex structures, in part due to the limited availability of evolutionary and structural information related to protein-RNA interactions that are used as inputs to the existing approaches. Here, we introduce ProRNA3D-single, a new deep-learning framework for protein-RNA complex structure prediction with only single-sequence input. Using a novel geometric attention-enabled pairing of biological language models of protein and RNA, a previously unexplored avenue, ProRNA3D-single enables the prediction of interatomic protein-RNA interaction maps, which are then transformed into multi-scale geometric restraints for modeling 3D structures of protein-RNA complexes via geometry optimization. Benchmark tests show that ProRNA3D-single convincingly outperforms current state-of-the-art methods including AlphaFold 3, particularly when evolutionary information is limited; and exhibits remarkable robustness and performance resilience by attaining better accuracy with only single-sequence input than what most methods can achieve even with explicit evolutionary information. Freely available at https://github.com/Bhattacharya-Lab/ProRNA3D-single , ProRNA3D-single should be broadly useful for modeling 3D structures of protein-RNA complexes at scale, regardless of the availability of evolutionary information.
Collapse
|
11
|
Hu Y, Pan D, Xu F, Huang B, Chen X, Lin S. Gene synthesis design: a pythonic approach. PeerJ 2024; 12:e17750. [PMID: 39076781 PMCID: PMC11285356 DOI: 10.7717/peerj.17750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Accepted: 06/24/2024] [Indexed: 07/31/2024] Open
Abstract
Researchers often need to synthesize genes of interest in this era of synthetic biology. Gene synthesis by PCR assembly of multiple DNA fragments is a quick and economical method that is widely applied. Up to now, there have been a few software solutions for designing fragments in gene synthesis. However, some of these software solutions use programming languages that are not popular now, other software products are commercial or require users to visit servers. In this study, we propose a Python program to design DNA fragments for gene synthesis. The algorithm is designed to meet the experimental needs. Also, the source code with detailed annotation is freely available for all users. Furthermore, the feasibility of the algorithm and the program is validated by experiments. Our program can be useful for the design of gene synthesis in the labs and help the study of gene structure and function.
Collapse
Affiliation(s)
- Yunzhuo Hu
- Agricultural Product Quality Institute, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
- College of Agronomy, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Danni Pan
- Agricultural Product Quality Institute, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
- College of Agronomy, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Fei Xu
- Agricultural Product Quality Institute, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
- College of Agronomy, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Bifang Huang
- College of Life Science, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Xuanyang Chen
- Agricultural Product Quality Institute, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
- College of Agronomy, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| | - Shiqiang Lin
- Agricultural Product Quality Institute, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
- College of Life Science, Fujian Agriculture and Forestry University, Fuzhou, Fujian, China
| |
Collapse
|
12
|
Cobley JN, Margaritelis NV, Chatzinikolaou PN, Nikolaidis MG, Davison GW. Ten "Cheat Codes" for Measuring Oxidative Stress in Humans. Antioxidants (Basel) 2024; 13:877. [PMID: 39061945 PMCID: PMC11273696 DOI: 10.3390/antiox13070877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 07/17/2024] [Accepted: 07/18/2024] [Indexed: 07/28/2024] Open
Abstract
Formidable and often seemingly insurmountable conceptual, technical, and methodological challenges hamper the measurement of oxidative stress in humans. For instance, fraught and flawed methods, such as the thiobarbituric acid reactive substances assay kits for lipid peroxidation, rate-limit progress. To advance translational redox research, we present ten comprehensive "cheat codes" for measuring oxidative stress in humans. The cheat codes include analytical approaches to assess reactive oxygen species, antioxidants, oxidative damage, and redox regulation. They provide essential conceptual, technical, and methodological information inclusive of curated "do" and "don't" guidelines. Given the biochemical complexity of oxidative stress, we present a research question-grounded decision tree guide for selecting the most appropriate cheat code(s) to implement in a prospective human experiment. Worked examples demonstrate the benefits of the decision tree-based cheat code selection tool. The ten cheat codes define an invaluable resource for measuring oxidative stress in humans.
Collapse
Affiliation(s)
- James N. Cobley
- The University of Dundee, Dundee DD1 4HN, UK
- Ulster University, Belfast BT15 1ED, Northern Ireland, UK;
| | - Nikos V. Margaritelis
- Aristotle University of Thessaloniki, 62122 Serres, Greece; (N.V.M.); (P.N.C.); (M.G.N.)
| | | | - Michalis G. Nikolaidis
- Aristotle University of Thessaloniki, 62122 Serres, Greece; (N.V.M.); (P.N.C.); (M.G.N.)
| | | |
Collapse
|
13
|
Bhat S, Palepu K, Hong L, Mao J, Ye T, Iyer R, Zhao L, Chen T, Vincoff S, Watson R, Wang T, Srijay D, Kavirayuni VS, Kholina K, Goel S, Vure P, Desphande AJ, Soderling SH, DeLisa MP, Chatterjee P. De Novo Design of Peptide Binders to Conformationally Diverse Targets with Contrastive Language Modeling. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.26.546591. [PMID: 39091799 PMCID: PMC11291000 DOI: 10.1101/2023.06.26.546591] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Designing binders to target undruggable proteins presents a formidable challenge in drug discovery, requiring innovative approaches to overcome the lack of putative binding sites. Recently, generative models have been trained to design binding proteins via three-dimensional structures of target proteins, but as a result, struggle to design binders to disordered or conformationally unstable targets. In this work, we provide a generalizable algorithmic framework to design short, target-binding linear peptides, requiring only the amino acid sequence of the target protein. To do this, we propose a process to generate naturalistic peptide candidates through Gaussian perturbation of the peptidic latent space of the ESM-2 protein language model, and subsequently screen these novel linear sequences for target-selective interaction activity via a CLIP-based contrastive learning architecture. By integrating these generative and discriminative steps, we create a Pep tide Pr ioritization via CLIP ( PepPrCLIP ) pipeline and validate highly-ranked, target-specific peptides experimentally, both as inhibitory peptides and as fusions to E3 ubiquitin ligase domains, demonstrating functionally potent binding and degradation of conformationally diverse protein targets in vitro . Overall, our design strategy provides a modular toolkit for designing short binding linear peptides to any target protein without the reliance on stable and ordered tertiary structure, enabling generation of programmable modulators to undruggable and disordered proteins such as transcription factors and fusion oncoproteins.
Collapse
|
14
|
Catacutan DB, Alexander J, Arnold A, Stokes JM. Machine learning in preclinical drug discovery. Nat Chem Biol 2024:10.1038/s41589-024-01679-1. [PMID: 39030362 DOI: 10.1038/s41589-024-01679-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 06/13/2024] [Indexed: 07/21/2024]
Abstract
Drug-discovery and drug-development endeavors are laborious, costly and time consuming. These programs can take upward of 12 years and cost US $2.5 billion, with a failure rate of more than 90%. Machine learning (ML) presents an opportunity to improve the drug-discovery process. Indeed, with the growing abundance of public and private large-scale biological and chemical datasets, ML techniques are becoming well positioned as useful tools that can augment the traditional drug-development process. In this Perspective, we discuss the integration of algorithmic methods throughout the preclinical phases of drug discovery. Specifically, we highlight an array of ML-based efforts, across diverse disease areas, to accelerate initial hit discovery, mechanism-of-action (MOA) elucidation and chemical property optimization. With advances in the application of ML across diverse therapeutic areas, we posit that fully ML-integrated drug-discovery pipelines will define the future of drug-development programs.
Collapse
Affiliation(s)
- Denise B Catacutan
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jeremie Alexander
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Autumn Arnold
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jonathan M Stokes
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada.
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada.
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada.
| |
Collapse
|
15
|
Jiang K, Yan Z, Di Bernardo M, Sgrizzi SR, Villiger L, Kayabolen A, Kim B, Carscadden JK, Hiraizumi M, Nishimasu H, Gootenberg JS, Abudayyeh OO. Rapid protein evolution by few-shot learning with a protein language model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.17.604015. [PMID: 39071429 PMCID: PMC11275896 DOI: 10.1101/2024.07.17.604015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Directed evolution of proteins is critical for applications in basic biological research, therapeutics, diagnostics, and sustainability. However, directed evolution methods are labor intensive, cannot efficiently optimize over multiple protein properties, and are often trapped by local maxima. In silico-directed evolution methods incorporating protein language models (PLMs) have the potential to accelerate this engineering process, but current approaches fail to generalize across diverse protein families. We introduce EVOLVEpro, a few-shot active learning framework to rapidly improve protein activity using a combination of PLMs and protein activity predictors, achieving improved activity with as few as four rounds of evolution. EVOLVEpro substantially enhances the efficiency and effectiveness of in silico protein evolution, surpassing current state-of-the-art methods and yielding proteins with up to 100-fold improvement of desired properties. We showcase EVOLVEpro for five proteins across three applications: T7 RNA polymerase for RNA production, a miniature CRISPR nuclease, a prime editor, and an integrase for genome editing, and a monoclonal antibody for epitope binding. These results demonstrate the advantages of few-shot active learning with small amounts of experimental data over zero-shot predictions. EVOLVEpro paves the way for broader applications of AI-guided protein engineering in biology and medicine.
Collapse
Affiliation(s)
- Kaiyi Jiang
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
- Department of Bioengineering Massachusetts Institute of Technology Cambridge, 02139 MA, USA
| | - Zhaoqing Yan
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| | - Matteo Di Bernardo
- Department of Bioengineering Massachusetts Institute of Technology Cambridge, 02139 MA, USA
| | - Samantha R. Sgrizzi
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| | - Lukas Villiger
- Department of Dermatology and Allergology Kantonspital St. Gallen St. Gallen, 9000, Switzerland
| | - Alisan Kayabolen
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| | - Byungji Kim
- Koch Institute for Integrative Cancer Research At MIT Massachusetts Institute of Technology Cambridge, 02139 MA, USA
| | - Josephine K. Carscadden
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| | - Masahiro Hiraizumi
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | - Hiroshi Nishimasu
- Department of Chemistry and Biotechnology, Graduate School of Engineering, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
- Structural Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
- Inamori Research Institute for Science 620 Suiginya-cho, Shimogyo-ku, Kyoto 600-8411, Japan
| | - Jonathan S. Gootenberg
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| | - Omar O. Abudayyeh
- Department of Medicine Division of Engineering in Medicine Brigham and Women’s Hospital Harvard Medical School Boston, 02115 MA, USA
- Gene and Cell Therapy Institute Mass General Brigham Cambridge, 02139 MA, USA
- Center for Virology and Vaccine Research Beth Israel Deaconess Medical Center Harvard Medical School Boston, 02115 MA, USA
| |
Collapse
|
16
|
Zhang H, Zhou Y, Zhang Z, Sun H, Pan Z, Mou M, Zhang W, Ye Q, Hou T, Li H, Hsieh CY, Zhu F. Large Language Model-Based Natural Language Encoding Could Be All You Need for Drug Biomedical Association Prediction. Anal Chem 2024. [PMID: 39011990 DOI: 10.1021/acs.analchem.4c01793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Analyzing drug-related interactions in the field of biomedicine has been a critical aspect of drug discovery and development. While various artificial intelligence (AI)-based tools have been proposed to analyze drug biomedical associations (DBAs), their feature encoding did not adequately account for crucial biomedical functions and semantic concepts, thereby still hindering their progress. Since the advent of ChatGPT by OpenAI in 2022, large language models (LLMs) have demonstrated rapid growth and significant success across various applications. Herein, LEDAP was introduced, which uniquely leveraged LLM-based biotext feature encoding for predicting drug-disease associations, drug-drug interactions, and drug-side effect associations. Benefiting from the large-scale knowledgebase pre-training, LLMs had great potential in drug development analysis owing to their holistic understanding of natural language and human topics. LEDAP illustrated its notable competitiveness in comparison with other popular DBA analysis tools. Specifically, even in simple conjunction with classical machine learning methods, LLM-based feature representations consistently enabled satisfactory performance across diverse DBA tasks like binary classification, multiclass classification, and regression. Our findings underpinned the considerable potential of LLMs in drug development research, indicating a catalyst for further progress in related fields.
Collapse
Affiliation(s)
- Hanyu Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, State Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| | - Yuan Zhou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Zhichao Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Huaicheng Sun
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Wei Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Qing Ye
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Honglin Li
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| | - Chang-Yu Hsieh
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, State Key Laboratory of Advanced Drug Delivery and Release Systems, Zhejiang University, Hangzhou 310058, China
| |
Collapse
|
17
|
Kantroo P, Wagner GP, Machta BB. Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602754. [PMID: 39026871 PMCID: PMC11257618 DOI: 10.1101/2024.07.09.602754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
Collapse
Affiliation(s)
- Pranav Kantroo
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| | - Günter P. Wagner
- Emeritus, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT-06520, USA
- Department of Evolutionary Biology, University of Vienna, Djerassi Platz 1, A-1030 Vienna, Austria
- Hagler Institute for Advanced Studies, Texas A&M, College Station, TX-77843, USA
| | - Benjamin B. Machta
- Department of Physics, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| |
Collapse
|
18
|
Zhou J, Huang M. Navigating the landscape of enzyme design: from molecular simulations to machine learning. Chem Soc Rev 2024. [PMID: 38990263 DOI: 10.1039/d4cs00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Global environmental issues and sustainable development call for new technologies for fine chemical synthesis and waste valorization. Biocatalysis has attracted great attention as the alternative to the traditional organic synthesis. However, it is challenging to navigate the vast sequence space to identify those proteins with admirable biocatalytic functions. The recent development of deep-learning based structure prediction methods such as AlphaFold2 reinforced by different computational simulations or multiscale calculations has largely expanded the 3D structure databases and enabled structure-based design. While structure-based approaches shed light on site-specific enzyme engineering, they are not suitable for large-scale screening of potential biocatalysts. Effective utilization of big data using machine learning techniques opens up a new era for accelerated predictions. Here, we review the approaches and applications of structure-based and machine-learning guided enzyme design. We also provide our view on the challenges and perspectives on effectively employing enzyme design approaches integrating traditional molecular simulations and machine learning, and the importance of database construction and algorithm development in attaining predictive ML models to explore the sequence fitness landscape for the design of admirable biocatalysts.
Collapse
Affiliation(s)
- Jiahui Zhou
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| | - Meilan Huang
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| |
Collapse
|
19
|
Kantroo P, Wagner GP, Machta BB. Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation. ARXIV 2024:arXiv:2407.07265v1. [PMID: 39040648 PMCID: PMC11261985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
Collapse
Affiliation(s)
- Pranav Kantroo
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| | - Günter P. Wagner
- Emeritus, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT-06520, USA
- Department of Evolutionary Biology, University of Vienna, Djerassi Platz 1, A-1030 Vienna, Austria
- Hagler Institute for Advanced Studies, Texas A&M, College Station, TX-77843, USA
| | - Benjamin B. Machta
- Department of Physics, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| |
Collapse
|
20
|
Wardman JF, Withers SG. Carbohydrate-active enzyme (CAZyme) discovery and engineering via (Ultra)high-throughput screening. RSC Chem Biol 2024; 5:595-616. [PMID: 38966674 PMCID: PMC11221537 DOI: 10.1039/d4cb00024b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 05/16/2024] [Indexed: 07/06/2024] Open
Abstract
Carbohydrate-active enzymes (CAZymes) constitute a diverse set of enzymes that catalyze the assembly, degradation, and modification of carbohydrates. These enzymes have been fashioned into potent, selective catalysts by millennia of evolution, and yet are also highly adaptable and readily evolved in the laboratory. To identify and engineer CAZymes for different purposes, (ultra)high-throughput screening campaigns have been frequently utilized with great success. This review provides an overview of the different approaches taken in screening for CAZymes and how mechanistic understandings of CAZymes can enable new approaches to screening. Within, we also cover how cutting-edge techniques such as microfluidics, advances in computational approaches and synthetic biology, as well as novel assay designs are leading the field towards more informative and effective screening approaches.
Collapse
Affiliation(s)
- Jacob F Wardman
- Department of Biochemistry and Molecular Biology, University of British Columbia Vancouver BC V6T 1Z3 Canada
- Michael Smith Laboratories, University of British Columbia Vancouver BC V6T 1Z4 Canada
| | - Stephen G Withers
- Department of Biochemistry and Molecular Biology, University of British Columbia Vancouver BC V6T 1Z3 Canada
- Michael Smith Laboratories, University of British Columbia Vancouver BC V6T 1Z4 Canada
- Department of Chemistry, University of British Columbia Vancouver BC V6T 1Z1 Canada
| |
Collapse
|
21
|
Wang J, Watson JL, Lisanza SL. Protein Design Using Structure-Prediction Networks: AlphaFold and RoseTTAFold as Protein Structure Foundation Models. Cold Spring Harb Perspect Biol 2024; 16:a041472. [PMID: 38438190 PMCID: PMC11216169 DOI: 10.1101/cshperspect.a041472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2024]
Abstract
Designing proteins with tailored structures and functions is a long-standing goal in bioengineering. Recently, deep learning advances have enabled protein structure prediction at near-experimental accuracy, which has catalyzed progress in protein design as well. We review recent studies that use structure-prediction neural networks to design proteins, via approaches such as activation maximization, inpainting, or denoising diffusion. These methods have led to major improvements over previous methods in wet-lab success rates for designing protein binders, metalloproteins, enzymes, and oligomeric assemblies. These results show that structure-prediction models are a powerful foundation for developing protein-design tools and suggest that continued improvement of their accuracy and generality will be key to unlocking the full potential of protein design.
Collapse
Affiliation(s)
- Jue Wang
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
- Institute for Protein Design, University of Washington, Seattle, Washington 98195, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, Washington 98195, USA
- DeepMind, London EC4A 3BF, United Kingdom
| | - Joseph L Watson
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
- Institute for Protein Design, University of Washington, Seattle, Washington 98195, USA
| | - Sidney L Lisanza
- Department of Biochemistry, University of Washington, Seattle, Washington 98195, USA
- Institute for Protein Design, University of Washington, Seattle, Washington 98195, USA
- Graduate Program in Biological Physics, Structure and Design, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
22
|
Si Y, Zou J, Gao Y, Chuai G, Liu Q, Chen L. Foundation models in molecular biology. BIOPHYSICS REPORTS 2024; 10:135-151. [PMID: 39027316 PMCID: PMC11252241 DOI: 10.52601/bpr.2024.240006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 03/04/2024] [Indexed: 07/20/2024] Open
Abstract
Determining correlations between molecules at various levels is an important topic in molecular biology. Large language models have demonstrated a remarkable ability to capture correlations from large amounts of data in the field of natural language processing as well as image generation, and correlations captured from data using large language models can also be applicable to solving a wide range of specific tasks, hence large language models are also referred to as foundation models. The massive amount of data that exists in the field of molecular biology provides an excellent basis for the development of foundation models, and the recent emergence of foundation models in the field of molecular biology has really pushed the entire field forward. We summarize the foundation models developed based on RNA sequence data, DNA sequence data, protein sequence data, single-cell transcriptome data, and spatial transcriptome data respectively, and further discuss the research directions for the development of foundation models in molecular biology.
Collapse
Affiliation(s)
- Yunda Si
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
| | - Jiawei Zou
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Yicheng Gao
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Guohui Chuai
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Luonan Chen
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
23
|
Li H, Jiang L, Yang K, Shang S, Li M, Lv Z. iNP_ESM: Neuropeptide Identification Based on Evolutionary Scale Modeling and Unified Representation Embedding Features. Int J Mol Sci 2024; 25:7049. [PMID: 39000158 PMCID: PMC11240975 DOI: 10.3390/ijms25137049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 06/17/2024] [Accepted: 06/25/2024] [Indexed: 07/16/2024] Open
Abstract
Neuropeptides are biomolecules with crucial physiological functions. Accurate identification of neuropeptides is essential for understanding nervous system regulatory mechanisms. However, traditional analysis methods are expensive and laborious, and the development of effective machine learning models continues to be a subject of current research. Hence, in this research, we constructed an SVM-based machine learning neuropeptide predictor, iNP_ESM, by integrating protein language models Evolutionary Scale Modeling (ESM) and Unified Representation (UniRep) for the first time. Our model utilized feature fusion and feature selection strategies to improve prediction accuracy during optimization. In addition, we validated the effectiveness of the optimization strategy with UMAP (Uniform Manifold Approximation and Projection) visualization. iNP_ESM outperforms existing models on a variety of machine learning evaluation metrics, with an accuracy of up to 0.937 in cross-validation and 0.928 in independent testing, demonstrating optimal neuropeptide recognition capabilities. We anticipate improved neuropeptide data in the future, and we believe that the iNP_ESM model will have broader applications in the research and clinical treatment of neurological diseases.
Collapse
Affiliation(s)
- Honghao Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Liangzhen Jiang
- College of Food and Biological Engineering, Chengdu University, Chengdu 610106, China
- Country Key Laboratory of Coarse Cereal Processing, Ministry of Agriculture and Rural Affairs, Chengdu 610106, China
| | - Kaixiang Yang
- College of Software Engineering, Sichuan University, Chengdu 610041, China
| | - Shulin Shang
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Mingxin Li
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| | - Zhibin Lv
- College of Biomedical Engineering, Sichuan University, Chengdu 610041, China
| |
Collapse
|
24
|
Chen H, Fan X, Zhu S, Pei Y, Zhang X, Zhang X, Liu L, Qian F, Tian B. Accurate prediction of CDR-H3 loop structures of antibodies with deep learning. eLife 2024; 12:RP91512. [PMID: 38921957 PMCID: PMC11208048 DOI: 10.7554/elife.91512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/27/2024] Open
Abstract
Accurate prediction of the structurally diverse complementarity determining region heavy chain 3 (CDR-H3) loop structure remains a primary and long-standing challenge for antibody modeling. Here, we present the H3-OPT toolkit for predicting the 3D structures of monoclonal antibodies and nanobodies. H3-OPT combines the strengths of AlphaFold2 with a pre-trained protein language model and provides a 2.24 Å average RMSDCα between predicted and experimentally determined CDR-H3 loops, thus outperforming other current computational methods in our non-redundant high-quality dataset. The model was validated by experimentally solving three structures of anti-VEGF nanobodies predicted by H3-OPT. We examined the potential applications of H3-OPT through analyzing antibody surface properties and antibody-antigen interactions. This structural prediction tool can be used to optimize antibody-antigen binding and engineer therapeutic antibodies with biophysical properties for specialized drug administration route.
Collapse
Affiliation(s)
- Hedi Chen
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Xiaoyu Fan
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Shuqian Zhu
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Yuchan Pei
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua UniversityBeijingChina
| | - Xiaochun Zhang
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Xiaonan Zhang
- Department of Natural Language Processing, Baidu International Technology (Shenzhen) Co LtdShenzhenChina
| | - Lihang Liu
- Department of Natural Language Processing, Baidu International Technology (Shenzhen) Co LtdShenzhenChina
| | - Feng Qian
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Boxue Tian
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| |
Collapse
|
25
|
Kim HJ, Yang JH, Chang DG, Lenke LG, Pizones J, Castelein R, Watanabe K, Trobisch PD, Mundis GM, Suh SW, Suk SI. Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis. J Med Internet Res 2024; 26:e52001. [PMID: 38924787 PMCID: PMC11237793 DOI: 10.2196/52001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/15/2024] [Accepted: 04/26/2024] [Indexed: 06/28/2024] Open
Abstract
BACKGROUND Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as "Gemini"; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy. OBJECTIVE The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery. METHODS In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors. RESULTS The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively. CONCLUSIONS Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.
Collapse
Affiliation(s)
- Hong Jin Kim
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| | - Jae Hyuk Yang
- Department of Orthopedic Surgery, Korea University Anam Hospital, College of Medicine, Korea University, Seoul, Republic of Korea
| | - Dong-Gune Chang
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| | - Lawrence G Lenke
- Department of Orthopedic Surgery, The Daniel and Jane Och Spine Hospital, Columbia University, New York, NY, United States
| | - Javier Pizones
- Department of Orthopedic Surgery, Hospital Universitario La Paz, Madrid, Spain
| | - René Castelein
- Department of Orthopedic Surgery, University Medical Centre Utrecht, Utrecht, Netherlands
| | - Kota Watanabe
- Department of Orthopedic Surgery, Keio University School of Medicine, Tokyo, Japan
| | - Per D Trobisch
- Department of Spine Surgery, Eifelklinik St. Brigida, Simmerath, Germany
| | - Gregory M Mundis
- Department of Orthopaedic Surgery, Scripps Clinic, La Jolla, CA, United States
| | - Seung Woo Suh
- Department of Orthopedic Surgery, Korea University Guro Hospital, College of Medicine, Korea University, Seoul, Republic of Korea
| | - Se-Il Suk
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| |
Collapse
|
26
|
Sledzieski S, Kshirsagar M, Baek M, Dodhia R, Lavista Ferres J, Berger B. Democratizing protein language models with parameter-efficient fine-tuning. Proc Natl Acad Sci U S A 2024; 121:e2405840121. [PMID: 38900798 PMCID: PMC11214071 DOI: 10.1073/pnas.2405840121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 05/09/2024] [Indexed: 06/22/2024] Open
Abstract
Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.
Collapse
Affiliation(s)
- Samuel Sledzieski
- AI for Good Research Lab, Microsoft Corporation, Redmond, WA98052
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | | | - Minkyung Baek
- Department of Biological Sciences, Seoul National University, Seoul08826, South Korea
| | - Rahul Dodhia
- AI for Good Research Lab, Microsoft Corporation, Redmond, WA98052
| | | | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA02139
| |
Collapse
|
27
|
Sela M, Church JR, Schapiro I, Schneidman-Duhovny D. RhoMax: Computational Prediction of Rhodopsin Absorption Maxima Using Geometric Deep Learning. J Chem Inf Model 2024; 64:4630-4639. [PMID: 38829021 PMCID: PMC11200256 DOI: 10.1021/acs.jcim.4c00467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/17/2024] [Indexed: 06/05/2024]
Abstract
Microbial rhodopsins (MRs) are a diverse and abundant family of photoactive membrane proteins that serve as model systems for biophysical techniques. Optogenetics utilizes genetic engineering to insert specialized proteins into specific neurons or brain regions, allowing for manipulation of their activity through light and enabling the mapping and control of specific brain areas in living organisms. The obstacle of optogenetics lies in the fact that light has a limited ability to penetrate biological tissues, particularly blue light in the visible spectrum. Despite this challenge, most optogenetic systems rely on blue light due to the scarcity of red-shifted opsins. Finding additional red-shifted rhodopsins would represent a major breakthrough in overcoming the challenge of limited light penetration in optogenetics. However, determining the wavelength absorption maxima for rhodopsins based on their protein sequence is a significant hurdle. Current experimental methods are time-consuming, while computational methods lack accuracy. The paper introduces a new computational approach called RhoMax that utilizes structure-based geometric deep learning to predict the absorption wavelength of rhodopsins solely based on their sequences. The method takes advantage of AlphaFold2 for accurate modeling of rhodopsin structures. Once trained on a balanced train set, RhoMax rapidly and precisely predicted the maximum absorption wavelength of more than half of the sequences in our test set with an accuracy of 0.03 eV. By leveraging computational methods for absorption maxima determination, we can drastically reduce the time needed for designing new red-shifted microbial rhodopsins, thereby facilitating advances in the field of optogenetics.
Collapse
Affiliation(s)
- Meitar Sela
- The
Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Jonathan R. Church
- Fritz
Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Igor Schapiro
- Fritz
Haber Center for Molecular Dynamics Research, Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| | - Dina Schneidman-Duhovny
- The
Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| |
Collapse
|
28
|
Fram B, Su Y, Truebridge I, Riesselman AJ, Ingraham JB, Passera A, Napier E, Thadani NN, Lim S, Roberts K, Kaur G, Stiffler MA, Marks DS, Bahl CD, Khan AR, Sander C, Gauthier NP. Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Nat Commun 2024; 15:5141. [PMID: 38902262 PMCID: PMC11190266 DOI: 10.1038/s41467-024-49119-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 05/24/2024] [Indexed: 06/22/2024] Open
Abstract
A major challenge in protein design is to augment existing functional proteins with multiple property enhancements. Altering several properties likely necessitates numerous primary sequence changes, and novel methods are needed to accurately predict combinations of mutations that maintain or enhance function. Models of sequence co-variation (e.g., EVcouplings), which leverage extensive information about various protein properties and activities from homologous protein sequences, have proven effective for many applications including structure determination and mutation effect prediction. We apply EVcouplings to computationally design variants of the model protein TEM-1 β-lactamase. Nearly all the 14 experimentally characterized designs were functional, including one with 84 mutations from the nearest natural homolog. The designs also had large increases in thermostability, increased activity on multiple substrates, and nearly identical structure to the wild type enzyme. This study highlights the efficacy of evolutionary models in guiding large sequence alterations to generate functional diversity for protein design applications.
Collapse
Affiliation(s)
- Benjamin Fram
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Yang Su
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Ian Truebridge
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Adam J Riesselman
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Program in Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - John B Ingraham
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Alessandro Passera
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Campus-Vienna-Biocenter 1, 1030, Vienna, Austria
| | - Eve Napier
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
| | - Nicole N Thadani
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Apriori Bio, Cambridge, MA, USA
| | - Samuel Lim
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Kristen Roberts
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Gurleen Kaur
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Michael A Stiffler
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Dyno Therapeutics, 343 Arsenal Street, Watertown, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christopher D Bahl
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Amir R Khan
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
- Division of Newborn Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicholas P Gauthier
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
29
|
Huang D, Xie J. EMPDTA: An End-to-End Multimodal Representation Learning Framework with Pocket Online Detection for Drug-Target Affinity Prediction. Molecules 2024; 29:2912. [PMID: 38930976 PMCID: PMC11206982 DOI: 10.3390/molecules29122912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 06/15/2024] [Accepted: 06/17/2024] [Indexed: 06/28/2024] Open
Abstract
Accurately predicting drug-target interactions is a critical yet challenging task in drug discovery. Traditionally, pocket detection and drug-target affinity prediction have been treated as separate aspects of drug-target interaction, with few methods combining these tasks within a unified deep learning system to accelerate drug development. In this study, we propose EMPDTA, an end-to-end framework that integrates protein pocket prediction and drug-target affinity prediction to provide a comprehensive understanding of drug-target interactions. The EMPDTA framework consists of three main modules: pocket online detection, multimodal representation learning for affinity prediction, and multi-task joint training. The performance and potential of the proposed framework have been validated across diverse benchmark datasets, achieving robust results in both tasks. Furthermore, the visualization results of the predicted pockets demonstrate accurate pocket detection, confirming the effectiveness of our framework.
Collapse
Affiliation(s)
| | - Jiang Xie
- School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China;
| |
Collapse
|
30
|
Calvanese F, Lambert CN, Nghe P, Zamponi F, Weigt M. Towards parsimonious generative modeling of RNA families. Nucleic Acids Res 2024; 52:5465-5477. [PMID: 38661206 PMCID: PMC11162787 DOI: 10.1093/nar/gkae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Revised: 03/05/2024] [Accepted: 04/05/2024] [Indexed: 04/26/2024] Open
Abstract
Generative probabilistic models emerge as a new paradigm in data-driven, evolution-informed design of biomolecular sequences. This paper introduces a novel approach, called Edge Activation Direct Coupling Analysis (eaDCA), tailored to the characteristics of RNA sequences, with a strong emphasis on simplicity, efficiency, and interpretability. eaDCA explicitly constructs sparse coevolutionary models for RNA families, achieving performance levels comparable to more complex methods while utilizing a significantly lower number of parameters. Our approach demonstrates efficiency in generating artificial RNA sequences that closely resemble their natural counterparts in both statistical analyses and SHAPE-MaP experiments, and in predicting the effect of mutations. Notably, eaDCA provides a unique feature: estimating the number of potential functional sequences within a given RNA family. For example, in the case of cyclic di-AMP riboswitches (RF00379), our analysis suggests the existence of approximately 1039 functional nucleotide sequences. While huge compared to the known <4000 natural sequences, this number represents only a tiny fraction of the vast pool of nearly 1082 possible nucleotide sequences of the same length (136 nucleotides). These results underscore the promise of sparse and interpretable generative models, such as eaDCA, in enhancing our understanding of the expansive RNA sequence space.
Collapse
Affiliation(s)
- Francesco Calvanese
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Camille N Lambert
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Philippe Nghe
- Laboratoire de Biophysique et Evolution, UMR CNRS-ESPCI 8231 Chimie Biologie Innovation, PSL University, Paris, France
| | - Francesco Zamponi
- Dipartimento di Fisica, Sapienza Università di Roma, Rome, Italy
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative – LCQB, Paris, France
| |
Collapse
|
31
|
Zhai J, Gokaslan A, Schiff Y, Berthel A, Liu ZY, Miller ZR, Scheben A, Stitzer MC, Romay MC, Buckler ES, Kuleshov V. Cross-species modeling of plant genomes at single nucleotide resolution using a pre-trained DNA language model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.596709. [PMID: 38895432 PMCID: PMC11185591 DOI: 10.1101/2024.06.04.596709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Understanding the function and fitness effects of diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation, thus expected to offer better cross-species prediction through fine-tuning on limited labeled data compared to supervised deep learning models. We introduce PlantCaduceus, a plant DNA LM based on the Caduceus and Mamba architectures, pre-trained on a carefully curated dataset consisting of 16 diverse Angiosperm genomes. Fine-tuning PlantCaduceus on limited labeled Arabidopsis data for four tasks involving transcription and translation modeling demonstrated high transferability to maize that diverged 160 million years ago, outperforming the best baseline model by 1.45-fold to 7.23-fold. PlantCaduceus also enables genome-wide deleterious mutation identification without multiple sequence alignment (MSA). PlantCaduceus demonstrated a threefold enrichment of rare alleles in prioritized deleterious mutations compared to MSA-based methods and matched state-of-the-art protein LMs. PlantCaduceus is a versatile pre-trained DNA LM expected to accelerate plant genomics and crop breeding applications.
Collapse
Affiliation(s)
- Jingjing Zhai
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Aaron Gokaslan
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Yair Schiff
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| | - Ana Berthel
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Zong-Yan Liu
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Zachary R. Miller
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
| | - Armin Scheben
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY USA 11724
| | | | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY USA 14853
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY USA 14853
- USDA-ARS; Ithaca, NY, USA 14853
| | - Volodymyr Kuleshov
- Department of Computer Science, Cornell University, Ithaca, NY, USA 14853
| |
Collapse
|
32
|
Kalantar M, Kalanther I, Kumar S, Buxton EK, Raeeszadeh-Sarmazdeh M. Elucidating key determinants of engineered scFv antibody in MMP-9 binding using high throughput screening and machine learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.597476. [PMID: 38895413 PMCID: PMC11185642 DOI: 10.1101/2024.06.04.597476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
An imbalance in matrix metalloproteinase-9 (MMP-9) regulation can lead to numerous diseases, including neurological disorders, cancer, and pre-term labor. Engineering single-chain antibody fragments (scFvs) Targeting MMP-9 to develop novel therapeutics for such diseases is desirable. We screened a synthetic scFv antibody library displayed on the yeast surface for binding improvement to MMP-9 using FACS (fluorescent-activated cell sorting). The scFv antibody clones isolated after FACS showed improvement in binding to MMP-9 compared to the endogenous inhibitor. To understand molecular determinants of binding between engineered scFv antibody variants and MMP-9, next-generation DNA sequencing, and computational protein structure analysis were used. Additionally, a deep-learning language model was trained on the synthetic library to predict the binding of scFv variants using their CDR-H3 sequences.
Collapse
|
33
|
Meador K, Castells-Graells R, Aguirre R, Sawaya MR, Arbing MA, Sherman T, Senarathne C, Yeates TO. A suite of designed protein cages using machine learning and protein fragment-based protocols. Structure 2024; 32:751-765.e11. [PMID: 38513658 PMCID: PMC11162342 DOI: 10.1016/j.str.2024.02.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 01/22/2024] [Accepted: 02/23/2024] [Indexed: 03/23/2024]
Abstract
Designed protein cages and related materials provide unique opportunities for applications in biotechnology and medicine, but their creation remains challenging. Here, we apply computational approaches to design a suite of tetrahedrally symmetric, self-assembling protein cages. For the generation of docked conformations, we emphasize a protein fragment-based approach, while for sequence design of the de novo interface, a comparison of knowledge-based and machine learning protocols highlights the power and increased experimental success achieved using ProteinMPNN. An analysis of design outcomes provides insights for improving interface design protocols, including prioritizing fragment-based motifs, balancing interface hydrophobicity and polarity, and identifying preferred polar contact patterns. In all, we report five structures for seven protein cages, along with two structures of intermediate assemblies, with the highest resolution reaching 2.0 Å using cryo-EM. This set of designed cages adds substantially to the body of available protein nanoparticles, and to methodologies for their creation.
Collapse
Affiliation(s)
- Kyle Meador
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA
| | | | - Roman Aguirre
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA
| | - Michael R Sawaya
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095, USA
| | - Mark A Arbing
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095, USA
| | - Trent Sherman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA
| | - Chethaka Senarathne
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA
| | - Todd O Yeates
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095, USA; UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA 90095, USA.
| |
Collapse
|
34
|
Vincoff S, Goel S, Kholina K, Pulugurta R, Vure P, Chatterjee P. FusOn-pLM: A Fusion Oncoprotein-Specific Language Model via Focused Probabilistic Masking. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.03.597245. [PMID: 38895377 PMCID: PMC11185609 DOI: 10.1101/2024.06.03.597245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Fusion oncoproteins, a class of chimeric proteins arising from chromosomal translocations, drive and sustain various cancers, particularly those impacting children. Unfortunately, due to their intrinsically disordered nature, large size, and lack of well-defined, druggable pockets, they have been historically challenging to target therapeutically: neither small molecule-based methods nor structure-based approaches for binder design are strong options for this class of molecules. Recently, protein language models (pLMs) have demonstrated success at representing protein sequences with information-rich embeddings, enabling downstream design applications from sequence alone. However, no current pLM has been trained on fusion oncoprotein sequences and thus may not produce optimal representations for these proteins. In this work, we introduce FusOn-pLM, a novel pLM that fine-tunes the state-of-the-art ESM-2 model on fusion oncoprotein sequences. We specifically introduce a novel masked language modeling (MLM) strategy, employing a binding-site probability predictor to focus masking on key amino acid residues, thereby generating more optimal fusion oncoprotein-aware embeddings. Our model improves performance on both fusion oncoprotein-specific benchmarks and disorder prediction tasks in comparison to baseline ESM-2 representations, as well as manually-constructed biophysical embeddings, motivating downstream usage of FusOn-pLM embeddings for therapeutic design tasks targeting these fusions. We have made our model publicly available to the community at https://huggingface.co/ChatterjeeLab/FusOn-pLM.
Collapse
Affiliation(s)
| | - Shrey Goel
- Department of Computer Science, Duke University
| | | | | | - Pranay Vure
- Department of Biomedical Engineering, Duke University
| | - Pranam Chatterjee
- Department of Biomedical Engineering, Duke University
- Department of Computer Science, Duke University
- Department of Biostatistics and Bioinformatics, Duke University
| |
Collapse
|
35
|
Xu K, Feng H, Zhang H, He C, Kang H, Yuan T, Shi L, Zhou C, Hua G, Cao Y, Zuo Z, Zuo E. Structure-guided discovery of highly efficient cytidine deaminases with sequence-context independence. Nat Biomed Eng 2024:10.1038/s41551-024-01220-8. [PMID: 38831042 DOI: 10.1038/s41551-024-01220-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 04/20/2024] [Indexed: 06/05/2024]
Abstract
The applicability of cytosine base editors is hindered by their dependence on sequence context and by off-target effects. Here, by using AlphaFold2 to predict the three-dimensional structure of 1,483 cytidine deaminases and by experimentally characterizing representative deaminases (selected from each structural cluster after categorizing them via partitional clustering), we report the discovery of a few deaminases with high editing efficiencies, diverse editing windows and increased ratios of on-target to off-target effects. Specifically, several deaminases induced C-to-T conversions with comparable efficiency at AC/TC/CC/GC sites, the deaminases could introduce stop codons in single-copy and multi-copy genes in mammalian cells without double-strand breaks, and some residue conversions at predicted DNA-interacting sites reduced off-target effects. Structure-based generative machine learning could be further leveraged to expand the applicability of base editors in gene therapies.
Collapse
Affiliation(s)
- Kui Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Hu Feng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Haihang Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Chenfei He
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Huifang Kang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Tanglong Yuan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Lei Shi
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Chikai Zhou
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Guoying Hua
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yaqi Cao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Zhenrui Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Erwei Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen Chinese Academy of Agricultural Sciences, Shenzhen, China.
| |
Collapse
|
36
|
Han Y, Zhang H, Zeng Z, Liu Z, Lu D, Liu Z. Descriptor-augmented machine learning for enzyme-chemical interaction predictions. Synth Syst Biotechnol 2024; 9:259-268. [PMID: 38450325 PMCID: PMC10915406 DOI: 10.1016/j.synbio.2024.02.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 02/21/2024] [Accepted: 02/22/2024] [Indexed: 03/08/2024] Open
Abstract
Descriptors play a pivotal role in enzyme design for the greener synthesis of biochemicals, as they could characterize enzymes and chemicals from the physicochemical and evolutionary perspective. This study examined the effects of various descriptors on the performance of Random Forest model used for enzyme-chemical relationships prediction. We curated activity data of seven specific enzyme families from the literature and developed the pipeline for evaluation the machine learning model performance using 10-fold cross-validation. The influence of protein and chemical descriptors was assessed in three scenarios, which were predicting the activity of unknown relations between known enzymes and known chemicals (new relationship evaluation), predicting the activity of novel enzymes on known chemicals (new enzyme evaluation), and predicting the activity of new chemicals on known enzymes (new chemical evaluation). The results showed that protein descriptors significantly enhanced the classification performance of model on new enzyme evaluation in three out of the seven datasets with the greatest number of enzymes, whereas chemical descriptors appear no effect. A variety of sequence-based and structure-based protein descriptors were constructed, among which the esm-2 descriptor achieved the best results. Using enzyme families as labels showed that descriptors could cluster proteins well, which could explain the contributions of descriptors to the machine learning model. As a counterpart, in the new chemical evaluation, chemical descriptors made significant improvement in four out of the seven datasets, while protein descriptors appear no effect. We attempted to evaluate the generalization ability of the model by correlating the statistics of the datasets with the performance of the models. The results showed that datasets with higher sequence similarity were more likely to get better results in the new enzyme evaluation and datasets with more enzymes were more likely beneficial from the protein descriptor strategy. This work provides guidance for the development of machine learning models for specific enzyme families.
Collapse
Affiliation(s)
- Yilei Han
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| | - Haoye Zhang
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Zheni Zeng
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Zhiyuan Liu
- Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
| | - Diannan Lu
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| | - Zheng Liu
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China
| |
Collapse
|
37
|
Chen Z, Wang R, Guo J, Wang X. The role and future prospects of artificial intelligence algorithms in peptide drug development. Biomed Pharmacother 2024; 175:116709. [PMID: 38713945 DOI: 10.1016/j.biopha.2024.116709] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 05/01/2024] [Accepted: 05/02/2024] [Indexed: 05/09/2024] Open
Abstract
Peptide medications have been more well-known in recent years due to their many benefits, including low side effects, high biological activity, specificity, effectiveness, and so on. Over 100 peptide medications have been introduced to the market to treat a variety of illnesses. Most of these peptide medications are developed on the basis of endogenous peptides or natural peptides, which frequently required expensive, time-consuming, and extensive tests to confirm. As artificial intelligence advances quickly, it is now possible to build machine learning or deep learning models that screen a large number of candidate sequences for therapeutic peptides. Therapeutic peptides, such as those with antibacterial or anticancer properties, have been developed by the application of artificial intelligence algorithms.The process of finding and developing peptide drugs is outlined in this review, along with a few related cases that were helped by AI and conventional methods. These resources will open up new avenues for peptide drug development and discovery, helping to meet the pressing needs of clinical patients for disease treatment. Although peptide drugs are a new class of biopharmaceuticals that distinguish them from chemical and small molecule drugs, their clinical purpose and value cannot be ignored. However, the traditional peptide drug research and development has a long development cycle and high investment, and the creation of peptide medications will be substantially hastened by the AI-assisted (AI+) mode, offering a new boost for combating diseases.
Collapse
Affiliation(s)
- Zhiheng Chen
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100083, China.
| | - Ruoxi Wang
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100083, China.
| | - Junqi Guo
- School of Biological Science and Medical Engineering, Beihang University, Beijing 100083, China.
| | - Xiaogang Wang
- Guangdong Provincial Key Laboratory of Bone and Joint Degenerative Diseases, The Third Affiliated Hospital of Southern Medical University, Guangzhou, Guangdong 510630, China.
| |
Collapse
|
38
|
Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large language models for science and medicine. Eur J Clin Invest 2024; 54:e14183. [PMID: 38381530 DOI: 10.1111/eci.14183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
Large language models (LLMs) are a type of machine learning model that learn statistical patterns over text, such as predicting the next words in a sequence of text. Both general purpose and task-specific LLMs have demonstrated potential across diverse applications. Science and medicine have many data types that are highly suitable for LLMs, such as scientific texts (publications, patents and textbooks), electronic medical records, large databases of DNA and protein sequences and chemical compounds. Carefully validated systems that can understand and reason across all these modalities may maximize benefits. Despite the inevitable limitations and caveats of any new technology and some uncertainties specific to LLMs, LLMs have the potential to be transformative in science and medicine.
Collapse
Affiliation(s)
- Amalio Telenti
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, California, USA
- Vir Biotechnology, Inc., San Francisco, California, USA
| | | | - Brian L Hie
- FAIR, Meta, Menlo Park, California, USA
- Department of Chemical Engineering, Stanford University, Stanford, California, USA
| | - Cyrus Maher
- Vir Biotechnology, Inc., San Francisco, California, USA
| | - Suchi Saria
- Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, Maryland, USA
| | - John P A Ioannidis
- Department of Medicine, Stanford University, Stanford, California, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA
- Department of Statistics, Stanford University, Stanford, California, USA
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
| |
Collapse
|
39
|
Su Z, Dhusia K, Wu Y. Encoding the space of protein-protein binding interfaces by artificial intelligence. Comput Biol Chem 2024; 110:108080. [PMID: 38643609 DOI: 10.1016/j.compbiolchem.2024.108080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 04/03/2024] [Accepted: 04/17/2024] [Indexed: 04/23/2024]
Abstract
The physical interactions between proteins are largely determined by the structural properties at their binding interfaces. It was found that the binding interfaces in distinctive protein complexes are highly similar. The structural properties underlying different binding interfaces could be further captured by artificial intelligence. In order to test this hypothesis, we broke protein-protein binding interfaces into pairs of interacting fragments. We employed a generative model to encode these interface fragment pairs in a low-dimensional latent space. After training, new conformations of interface fragment pairs were generated. We found that, by only using a small number of interface fragment pairs that were generated by artificial intelligence, we were able to guide the assembly of protein complexes into their native conformations. These results demonstrate that the conformational space of fragment pairs at protein-protein binding interfaces is highly degenerate. Features in this degenerate space can be well characterized by artificial intelligence. In summary, our machine learning method will be potentially useful to search for and predict the conformations of unknown protein-protein interactions.
Collapse
Affiliation(s)
- Zhaoqian Su
- Data Science Institute, Vanderbilt University, 1001 19th Ave S, Nashville, TN 37212, USA
| | - Kalyani Dhusia
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Yinghao Wu
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA.
| |
Collapse
|
40
|
Winnifrith A, Outeiral C, Hie BL. Generative artificial intelligence for de novo protein design. Curr Opin Struct Biol 2024; 86:102794. [PMID: 38663170 DOI: 10.1016/j.sbi.2024.102794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 01/31/2024] [Accepted: 02/19/2024] [Indexed: 05/19/2024]
Abstract
Engineering new molecules with desirable functions and properties has the potential to extend our ability to engineer proteins beyond what nature has so far evolved. Advances in the so-called 'de novo' design problem have recently been brought forward by developments in artificial intelligence. Generative architectures, such as language models and diffusion processes, seem adept at generating novel, yet realistic proteins that display desirable properties and perform specified functions. State-of-the-art design protocols now achieve experimental success rates nearing 20%, thus widening the access to de novo designed proteins. Despite extensive progress, there are clear field-wide challenges, for example, in determining the best in silico metrics to prioritise designs for experimental testing, and in designing proteins that can undergo large conformational changes or be regulated by post-translational modifications. With an increase in the number of models being developed, this review provides a framework to understand how these tools fit into the overall process of de novo protein design. Throughout, we highlight the power of incorporating biochemical knowledge to improve performance and interpretability.
Collapse
Affiliation(s)
- Adam Winnifrith
- Department of Biochemistry, University of Oxford, South Parks Rd, Oxford, OX1 3QU, United Kingdom; Evolvere Biosciences, Innovation Building, Old Road Campus, Oxford, OX3 7FZ, United Kingdom.
| | - Carlos Outeiral
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford OX1 3LB, United Kingdom.
| | - Brian L Hie
- Department of Chemical Engineering, Stanford University, 443 Via Ortega, Stanford, CA 94305, USA; Stanford Data Science, 475 Via Ortega, Stanford CA 94305, USA; Arc Institute, 3181 Porter Dr, Palo Alto, CA, USA.
| |
Collapse
|
41
|
Jones AA, Snow CD. Porous protein crystals: synthesis and applications. Chem Commun (Camb) 2024; 60:5790-5803. [PMID: 38756076 DOI: 10.1039/d4cc00183d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/18/2024]
Abstract
Large-pore protein crystals (LPCs) are an emerging class of biomaterials. The inherent diversity of proteins translates to a diversity of crystal lattice structures, many of which display large pores and solvent channels. These pores can, in turn, be functionalized via directed evolution and rational redesign based on the known crystal structures. LPCs possess extremely high solvent content, as well as extremely high surface area to volume ratios. Because of these characteristics, LPCs continue to be explored in diverse applications including catalysis, targeted therapeutic delivery, templating of nanostructures, structural biology. This Feature review article will describe several of the existing platforms in detail, with particular focus on LPC synthesis approaches and reported applications.
Collapse
Affiliation(s)
- Alec Arthur Jones
- School of Biomedical Engineering, Colorado State University, Fort Collins, CO 80523-1301, USA.
| | - Christopher D Snow
- School of Biomedical Engineering, Colorado State University, Fort Collins, CO 80523-1301, USA.
- Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO 80523-1301, USA
| |
Collapse
|
42
|
Aguilera-Puga MDC, Plisson F. Structure-aware machine learning strategies for antimicrobial peptide discovery. Sci Rep 2024; 14:11995. [PMID: 38796582 PMCID: PMC11127937 DOI: 10.1038/s41598-024-62419-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 05/16/2024] [Indexed: 05/28/2024] Open
Abstract
Machine learning models are revolutionizing our approaches to discovering and designing bioactive peptides. These models often need protein structure awareness, as they heavily rely on sequential data. The models excel at identifying sequences of a particular biological nature or activity, but they frequently fail to comprehend their intricate mechanism(s) of action. To solve two problems at once, we studied the mechanisms of action and structural landscape of antimicrobial peptides as (i) membrane-disrupting peptides, (ii) membrane-penetrating peptides, and (iii) protein-binding peptides. By analyzing critical features such as dipeptides and physicochemical descriptors, we developed models with high accuracy (86-88%) in predicting these categories. However, our initial models (1.0 and 2.0) exhibited a bias towards α-helical and coiled structures, influencing predictions. To address this structural bias, we implemented subset selection and data reduction strategies. The former gave three structure-specific models for peptides likely to fold into α-helices (models 1.1 and 2.1), coils (1.3 and 2.3), or mixed structures (1.4 and 2.4). The latter depleted over-represented structures, leading to structure-agnostic predictors 1.5 and 2.5. Additionally, our research highlights the sensitivity of important features to different structure classes across models.
Collapse
Affiliation(s)
- Mariana D C Aguilera-Puga
- Department of Biotechnology and Biochemistry, Center for Research and Advanced Studies of the National Polytechnic Institute (CINVESTAV-IPN), Irapuato Unit, 36824, Irapuato, Guanajuato, Mexico
| | - Fabien Plisson
- Department of Biotechnology and Biochemistry, Center for Research and Advanced Studies of the National Polytechnic Institute (CINVESTAV-IPN), Irapuato Unit, 36824, Irapuato, Guanajuato, Mexico.
| |
Collapse
|
43
|
Jin R, Ye Q, Wang J, Cao Z, Jiang D, Wang T, Kang Y, Xu W, Hsieh CY, Hou T. AttABseq: an attention-based deep learning prediction method for antigen-antibody binding affinity changes based on protein sequences. Brief Bioinform 2024; 25:bbae304. [PMID: 38960407 PMCID: PMC11221889 DOI: 10.1093/bib/bbae304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 04/15/2024] [Accepted: 06/11/2024] [Indexed: 07/05/2024] Open
Abstract
The optimization of therapeutic antibodies through traditional techniques, such as candidate screening via hybridoma or phage display, is resource-intensive and time-consuming. In recent years, computational and artificial intelligence-based methods have been actively developed to accelerate and improve the development of therapeutic antibodies. In this study, we developed an end-to-end sequence-based deep learning model, termed AttABseq, for the predictions of the antigen-antibody binding affinity changes connected with antibody mutations. AttABseq is a highly efficient and generic attention-based model by utilizing diverse antigen-antibody complex sequences as the input to predict the binding affinity changes of residue mutations. The assessment on the three benchmark datasets illustrates that AttABseq is 120% more accurate than other sequence-based models in terms of the Pearson correlation coefficient between the predicted and experimental binding affinity changes. Moreover, AttABseq also either outperforms or competes favorably with the structure-based approaches. Furthermore, AttABseq consistently demonstrates robust predictive capabilities across a diverse array of conditions, underscoring its remarkable capacity for generalization across a wide spectrum of antigen-antibody complexes. It imposes no constraints on the quantity of altered residues, rendering it particularly applicable in scenarios where crystallographic structures remain unavailable. The attention-based interpretability analysis indicates that the causal effects of point mutations on antibody-antigen binding affinity changes can be visualized at the residue level, which might assist automated antibody sequence optimization. We believe that AttABseq provides a fiercely competitive answer to therapeutic antibody optimization.
Collapse
Affiliation(s)
- Ruofan Jin
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
- College of Life Science, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Qing Ye
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Jike Wang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Zheng Cao
- College of Computer Science and Technology, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Dejun Jiang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Tianyue Wang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Yu Kang
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Wanting Xu
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Chang-Yu Hsieh
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| | - Tingjun Hou
- College of Pharmaceutical Science, Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Zhejiang University, Yuhangtang Road 866, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
44
|
Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Brief Bioinform 2024; 25:bbae245. [PMID: 38797969 PMCID: PMC11128484 DOI: 10.1093/bib/bbae245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 04/08/2024] [Accepted: 05/07/2024] [Indexed: 05/29/2024] Open
Abstract
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% nonredundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. The BALMFold structure prediction server is freely available at https://beamlab-sh.com/models/BALMFold.
Collapse
Affiliation(s)
- Hongtai Jing
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
| | - Zhengtao Gao
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Sheng Xu
- Shanghai AI Laboratory, Shanghai 200232, China
| | - Tao Shen
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Zelixir Biotech, Shanghai 201206, China
| | - Zhangzhi Peng
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shwai He
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Tao You
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shuang Ye
- Department of Gynecologic Oncology, Fudan University Shanghai Cancer Center, Shanghai 200032, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Wei Lin
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
- Shanghai AI Laboratory, Shanghai 200232, China
- School of Mathematical Sciences and Shanghai Center for Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Shanghai AI Laboratory, Shanghai 200232, China
| |
Collapse
|
45
|
Song C, Zhang L. Intelligent Design of Antithrombotic Peptide Targeting Collagen. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2024; 40:9661-9668. [PMID: 38664943 DOI: 10.1021/acs.langmuir.4c00543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Binding of blood components to collagen was proved to be a key step in thrombus formation. Intelligent Design of Protein Matcher (IDProMat), a neural network model, was then developed based on the principle of seq2seq to design an antithrombotic peptide targeting collagen. The encoding and decoding of peptide sequence data and the interaction patterns of peptide chains at the interface were studied, and then, IDProMat was applied to the design of peptides to cover collagen. The 99.3% decrease in seq2seq loss and 58.3% decrease in MLP loss demonstrated that IDProMat learned the interaction patterns between residues at the binding interface. An efficient peptide, LRWNSYY, was then designed using this model. Validations on its binding on collagen and its inhibition of platelet adhesion were obtained using docking, MD simulations, and experimental approaches.
Collapse
Affiliation(s)
- Changwei Song
- Department of Biochemical Engineering and Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (MOE), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, People's Republic of China
| | - Lin Zhang
- Department of Biochemical Engineering and Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (MOE), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, People's Republic of China
| |
Collapse
|
46
|
Omelchenko AA, Siwek JC, Chhibbar P, Arshad S, Nazarali I, Nazarali K, Rosengart A, Rahimikollu J, Tilstra J, Shlomchik MJ, Koes DR, Joglekar AV, Das J. Sliding Window INteraction Grammar (SWING): a generalized interaction language model for peptide and protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.01.592062. [PMID: 38746274 PMCID: PMC11092674 DOI: 10.1101/2024.05.01.592062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
The explosion of sequence data has allowed the rapid growth of protein language models (pLMs). pLMs have now been employed in many frameworks including variant-effect and peptide-specificity prediction. Traditionally, for protein-protein or peptide-protein interactions (PPIs), corresponding sequences are either co-embedded followed by post-hoc integration or the sequences are concatenated prior to embedding. Interestingly, no method utilizes a language representation of the interaction itself. We developed an interaction LM (iLM), which uses a novel language to represent interactions between protein/peptide sequences. Sliding Window Interaction Grammar (SWING) leverages differences in amino acid properties to generate an interaction vocabulary. This vocabulary is the input into a LM followed by a supervised prediction step where the LM's representations are used as features. SWING was first applied to predicting peptide:MHC (pMHC) interactions. SWING was not only successful at generating Class I and Class II models that have comparable prediction to state-of-the-art approaches, but the unique Mixed Class model was also successful at jointly predicting both classes. Further, the SWING model trained only on Class I alleles was predictive for Class II, a complex prediction task not attempted by any existing approach. For de novo data, using only Class I or Class II data, SWING also accurately predicted Class II pMHC interactions in murine models of SLE (MRL/lpr model) and T1D (NOD model), that were validated experimentally. To further evaluate SWING's generalizability, we tested its ability to predict the disruption of specific protein-protein interactions by missense mutations. Although modern methods like AlphaMissense and ESM1b can predict interfaces and variant effects/pathogenicity per mutation, they are unable to predict interaction-specific disruptions. SWING was successful at accurately predicting the impact of both Mendelian mutations and population variants on PPIs. This is the first generalizable approach that can accurately predict interaction-specific disruptions by missense mutations with only sequence information. Overall, SWING is a first-in-class generalizable zero-shot iLM that learns the language of PPIs.
Collapse
Affiliation(s)
- Alisa A. Omelchenko
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jane C. Siwek
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Prabal Chhibbar
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Integrative systems biology PhD program, School of Medicine, University of Pittsburgh, PA, USA
| | - Sanya Arshad
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Iliyan Nazarali
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Kiran Nazarali
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - AnnaElaine Rosengart
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Javad Rahimikollu
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
- The joint CMU-Pitt PhD program in computational biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jeremy Tilstra
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Division of Rheumatology and Clinical Immunology, Department of Medicine, School of Medicine, University of Pittsburgh, PA, USA
| | - Mark J. Shlomchik
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - David R. Koes
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Alok V. Joglekar
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| | - Jishnu Das
- Center for Systems immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, PA, USA
| |
Collapse
|
47
|
Peng D, Zheng L, Liu D, Han C, Wang X, Yang Y, Song L, Zhao M, Wei Y, Li J, Ye X, Wei Y, Feng Z, Huang X, Chen M, Gou Y, Xue Y, Zhang L. Large-language models facilitate discovery of the molecular signatures regulating sleep and activity. Nat Commun 2024; 15:3685. [PMID: 38693116 PMCID: PMC11063160 DOI: 10.1038/s41467-024-48005-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 04/17/2024] [Indexed: 05/03/2024] Open
Abstract
Sleep, locomotor and social activities are essential animal behaviors, but their reciprocal relationships and underlying mechanisms remain poorly understood. Here, we elicit information from a cutting-edge large-language model (LLM), generative pre-trained transformer (GPT) 3.5, which interprets 10.2-13.8% of Drosophila genes known to regulate the 3 behaviors. We develop an instrument for simultaneous video tracking of multiple moving objects, and conduct a genome-wide screen. We have identified 758 fly genes that regulate sleep and activities, including mre11 which regulates sleep only in the presence of conspecifics, and NELF-B which regulates sleep regardless of whether conspecifics are present. Based on LLM-reasoning, an educated signal web is modeled for understanding of potential relationships between its components, presenting comprehensive molecular signatures that control sleep, locomotor and social activities. This LLM-aided strategy may also be helpful for addressing other complex scientific questions.
Collapse
Affiliation(s)
- Di Peng
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Liubin Zheng
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Dan Liu
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Cheng Han
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Xin Wang
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yan Yang
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Li Song
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Miaoying Zhao
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yanfeng Wei
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Jiayi Li
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Xiaoxue Ye
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yuxiang Wei
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Zihao Feng
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Xinhe Huang
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Miaomiao Chen
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yujie Gou
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yu Xue
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
- Nanjing University Institute of Artificial Intelligence Biomedicine, Nanjing, Jiangsu, 210031, China.
| | - Luoying Zhang
- Key Laboratory of Molecular Biophysics of Ministry of Education, Hubei Bioinformatics and Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
- Hubei Province Key Laboratory of Oral and Maxillofacial Development and Regeneration, Wuhan, Hubei, 430022, China.
| |
Collapse
|
48
|
Callaway E. 'ChatGPT for CRISPR' creates new gene-editing tools. Nature 2024; 629:272. [PMID: 38684833 DOI: 10.1038/d41586-024-01243-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]
|
49
|
Harrigan WL, Ferrell BD, Wommack KE, Polson SW, Schreiber ZD, Belcaid M. Improvements in viral gene annotation using large language models and soft alignments. BMC Bioinformatics 2024; 25:165. [PMID: 38664627 PMCID: PMC11046836 DOI: 10.1186/s12859-024-05779-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
Collapse
Affiliation(s)
- William L Harrigan
- Hawai'i Institute of Marine Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - Barbra D Ferrell
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - K Eric Wommack
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Shawn W Polson
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Zachary D Schreiber
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Mahdi Belcaid
- Department of Computer Science, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA.
| |
Collapse
|
50
|
Assessing the laboratory performance of AI-generated enzymes. Nat Biotechnol 2024:10.1038/s41587-024-02239-7. [PMID: 38653799 DOI: 10.1038/s41587-024-02239-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/25/2024]
|