151
|
Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst 2023; 14:979-989.e4. [PMID: 37909045 PMCID: PMC11018345 DOI: 10.1016/j.cels.2023.10.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 06/14/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023]
Abstract
Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA
| | - Jeffrey J Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA; Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
152
|
Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, Madani A. ProGen2: Exploring the boundaries of protein language models. Cell Syst 2023; 14:968-978.e3. [PMID: 37909046 DOI: 10.1016/j.cels.2023.10.002] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 05/01/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023]
Abstract
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
Collapse
Affiliation(s)
| | - Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA; Profluent Bio, Berkeley, CA, USA
| | - Eli N Weinstein
- Data Science Institute, Columbia University, New York, NY, USA
| | | | - Ali Madani
- Salesforce Research, Palo Alto, CA, USA; Profluent Bio, Berkeley, CA, USA.
| |
Collapse
|
153
|
Khakzad H, Igashov I, Schneuing A, Goverde C, Bronstein M, Correia B. A new age in protein design empowered by deep learning. Cell Syst 2023; 14:925-939. [PMID: 37972559 DOI: 10.1016/j.cels.2023.10.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 06/22/2023] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Collapse
Affiliation(s)
- Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France; École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Ilia Igashov
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Arne Schneuing
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Bruno Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
154
|
Kim GB, Kim JY, Lee JA, Norsigian CJ, Palsson BO, Lee SY. Functional annotation of enzyme-encoding genes using deep learning with transformer layers. Nat Commun 2023; 14:7370. [PMID: 37963869 PMCID: PMC10645960 DOI: 10.1038/s41467-023-43216-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 11/03/2023] [Indexed: 11/16/2023] Open
Abstract
Functional annotation of open reading frames in microbial genomes remains substantially incomplete. Enzymes constitute the most prevalent functional gene class in microbial genomes and can be described by their specific catalytic functions using the Enzyme Commission (EC) number. Consequently, the ability to predict EC numbers could substantially reduce the number of un-annotated genes. Here we present a deep learning model, DeepECtransformer, which utilizes transformer layers as a neural network architecture to predict EC numbers. Using the extensively studied Escherichia coli K-12 MG1655 genome, DeepECtransformer predicted EC numbers for 464 un-annotated genes. We experimentally validated the enzymatic activities predicted for three proteins (YgfF, YciO, and YjdM). Further examination of the neural network's reasoning process revealed that the trained neural network relies on functional motifs of enzymes to predict EC numbers. Thus, DeepECtransformer is a method that facilitates the functional annotation of uncharacterized genes.
Collapse
Affiliation(s)
- Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
- Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST, Daejeon, 34141, Republic of Korea
- KAIST Institute for the BioCentury and KAIST Institute for Artificial Intelligence, KAIST, Daejeon, 34141, Republic of Korea
| | - Ji Yeon Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
- Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST, Daejeon, 34141, Republic of Korea
- KAIST Institute for the BioCentury and KAIST Institute for Artificial Intelligence, KAIST, Daejeon, 34141, Republic of Korea
| | - Jong An Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea
- Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST, Daejeon, 34141, Republic of Korea
- KAIST Institute for the BioCentury and KAIST Institute for Artificial Intelligence, KAIST, Daejeon, 34141, Republic of Korea
| | - Charles J Norsigian
- Division of Biological Sciences, University of California San Diego, La Jolla, CA, 92093, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA, 92093, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, CA, 92093, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, 92093, USA
- Novo Nordisk Foundation Center for Biosustainability, 2800, Kongens Lyngby, Denmark
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea.
- Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST, Daejeon, 34141, Republic of Korea.
- KAIST Institute for the BioCentury and KAIST Institute for Artificial Intelligence, KAIST, Daejeon, 34141, Republic of Korea.
- BioProcess Engineering Research Center and BioInformatics Research Center, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
155
|
Rix G, Williams RL, Spinner H, Hu VJ, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1-million-times the genomic mutation rate. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.13.566922. [PMID: 38014077 PMCID: PMC10680746 DOI: 10.1101/2023.11.13.566922] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
When nature maintains or evolves a gene's function over millions of years at scale, it produces a diversity of homologous sequences whose patterns of conservation and change contain rich structural, functional, and historical information about the gene. However, natural gene diversity likely excludes vast regions of functional sequence space and includes phylogenetic and evolutionary eccentricities, limiting what information we can extract. We introduce an accessible experimental approach for compressing long-term gene evolution to laboratory timescales, allowing for the direct observation of extensive adaptation and divergence followed by inference of structural, functional, and environmental constraints for any selectable gene. To enable this approach, we developed a new orthogonal DNA replication (OrthoRep) system that durably hypermutates chosen genes at a rate of >10 -4 substitutions per base in vivo . When OrthoRep was used to evolve a conditionally essential maladapted enzyme, we obtained thousands of unique multi-mutation sequences with many pairs >60 amino acids apart (>15% divergence), revealing known and new factors influencing enzyme adaptation. The fitness of evolved sequences was not predictable by advanced machine learning models trained on natural variation. We suggest that OrthoRep supports the prospective and systematic discovery of constraints shaping gene evolution, uncovering of new regions in fitness landscapes, and general applications in biomolecular engineering.
Collapse
|
156
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
157
|
Markus B, C GC, Andreas K, Arkadij K, Stefan L, Gustav O, Elina S, Radka S. Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design. ACS Catal 2023; 13:14454-14469. [PMID: 37942268 PMCID: PMC10629211 DOI: 10.1021/acscatal.3c03417] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 11/10/2023]
Abstract
Emerging computational tools promise to revolutionize protein engineering for biocatalytic applications and accelerate the development timelines previously needed to optimize an enzyme to its more efficient variant. For over a decade, the benefits of predictive algorithms have helped scientists and engineers navigate the complexity of functional protein sequence space. More recently, spurred by dramatic advances in underlying computational tools, the promise of faster, cheaper, and more accurate enzyme identification, characterization, and engineering has catapulted terms such as artificial intelligence and machine learning to the must-have vocabulary in the field. This Perspective aims to showcase the current status of applications in pharmaceutical industry and also to discuss and celebrate the innovative approaches in protein science by highlighting their potential in selected recent developments and offering thoughts on future opportunities for biocatalysis. It also critically assesses the technology's limitations, unanswered questions, and unmet challenges.
Collapse
Affiliation(s)
- Braun Markus
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Gruber Christian C
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Krassnigg Andreas
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Kummer Arkadij
- Moderna,
Inc., 200 Technology
Square, Cambridge, Massachusetts 02139, United States
| | - Lutz Stefan
- Codexis
Inc., 200 Penobscot Drive, Redwood City, California 94063, United States
| | - Oberdorfer Gustav
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Siirola Elina
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| | - Snajdrova Radka
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| |
Collapse
|
158
|
Braet F, Poger D. Let's have a chat about chatbot(s) in (biological) microscopy. J Microsc 2023; 292:59-63. [PMID: 37742291 DOI: 10.1111/jmi.13230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/30/2023] [Accepted: 09/20/2023] [Indexed: 09/26/2023]
Affiliation(s)
- Filip Braet
- School of Medical Sciences (Molecular and Cellular Biomedicine), The University of Sydney, New South Wales, Australia
- Australian Centre for Microscopy and Microanalysis, The University of Sydney, Sydney, New South Wales, Australia
| | - David Poger
- Microscopy Australia, The University of Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
159
|
Romero-Romero S, Lindner S, Ferruz N. Exploring the Protein Sequence Space with Global Generative Models. Cold Spring Harb Perspect Biol 2023; 15:a041471. [PMID: 37848247 PMCID: PMC10626256 DOI: 10.1101/cshperspect.a041471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
Recent advancements in specialized large-scale architectures for training images and language have profoundly impacted the field of computer vision and natural language processing (NLP). Language models, such as the recent ChatGPT and GPT-4, have demonstrated exceptional capabilities in processing, translating, and generating human language. These breakthroughs have also been reflected in protein research, leading to the rapid development of numerous new methods in a short time, with unprecedented performance. Several of these models have been developed with the goal of generating sequences in novel regions of the protein space. In this work, we provide an overview of the use of protein generative models, reviewing (1) language models for the design of novel artificial proteins, (2) works that use non-transformer architectures, and (3) applications in directed evolution approaches.
Collapse
Affiliation(s)
| | | | - Noelia Ferruz
- Barcelona Institute of Molecular Biology, 08028 Barcelona, Spain
| |
Collapse
|
160
|
Capponi S, Daniels KG. Harnessing the power of artificial intelligence to advance cell therapy. Immunol Rev 2023; 320:147-165. [PMID: 37415280 DOI: 10.1111/imr.13236] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 06/17/2023] [Indexed: 07/08/2023]
Abstract
Cell therapies are powerful technologies in which human cells are reprogrammed for therapeutic applications such as killing cancer cells or replacing defective cells. The technologies underlying cell therapies are increasing in effectiveness and complexity, making rational engineering of cell therapies more difficult. Creating the next generation of cell therapies will require improved experimental approaches and predictive models. Artificial intelligence (AI) and machine learning (ML) methods have revolutionized several fields in biology including genome annotation, protein structure prediction, and enzyme design. In this review, we discuss the potential of combining experimental library screens and AI to build predictive models for the development of modular cell therapy technologies. Advances in DNA synthesis and high-throughput screening techniques enable the construction and screening of libraries of modular cell therapy constructs. AI and ML models trained on this screening data can accelerate the development of cell therapies by generating predictive models, design rules, and improved designs.
Collapse
Affiliation(s)
- Sara Capponi
- Department of Functional Genomics and Cellular Engineering, IBM Almaden Research Center, San Jose, California, USA
- Center for Cellular Construction, San Francisco, California, USA
| | - Kyle G Daniels
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, California, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
161
|
Zhang Y, Guan J, Li C, Wang Z, Deng Z, Gasser RB, Song J, Ou HY. DeepSecE: A Deep-Learning-Based Framework for Multiclass Prediction of Secreted Proteins in Gram-Negative Bacteria. RESEARCH (WASHINGTON, D.C.) 2023; 6:0258. [PMID: 37886621 PMCID: PMC10599158 DOI: 10.34133/research.0258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/08/2023] [Indexed: 10/28/2023]
Abstract
Proteins secreted by Gram-negative bacteria are tightly linked to the virulence and adaptability of these microbes to environmental changes. Accurate identification of such secreted proteins can facilitate the investigations of infections and diseases caused by these bacterial pathogens. However, current bioinformatic methods for predicting bacterial secreted substrate proteins have limited computational efficiency and application scope on a genome-wide scale. Here, we propose a novel deep-learning-based framework-DeepSecE-for the simultaneous inference of multiple distinct groups of secreted proteins produced by Gram-negative bacteria. DeepSecE remarkably improves their classification from nonsecreted proteins using a pretrained protein language model and transformer, achieving a macro-average accuracy of 0.883 on 5-fold cross-validation. Performance benchmarking suggests that DeepSecE achieves competitive performance with the state-of-the-art binary predictors specialized for individual types of secreted substrates. The attention mechanism corroborates salient patterns and motifs at the N or C termini of the protein sequences. Using this pipeline, we further investigate the genome-wide prediction of novel secreted proteins and their taxonomic distribution across ~1,000 Gram-negative bacterial genomes. The present analysis demonstrates that DeepSecE has major potential for the discovery of disease-associated secreted proteins in a diverse range of Gram-negative bacteria. An online web server of DeepSecE is also publicly available to predict and explore various secreted substrate proteins via the input of bacterial genome sequences.
Collapse
Affiliation(s)
- Yumeng Zhang
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jiahao Guan
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
| | - Zhikang Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute,
Monash University, Melbourne, VIC 3800, Australia
| | - Zixin Deng
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| | - Robin B. Gasser
- Melbourne Veterinary School, Faculty of Science,
The University of Melbourne, Parkville, VIC 3010, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,
Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute,
Monash University, Melbourne, VIC 3800, Australia
- Melbourne Veterinary School, Faculty of Science,
The University of Melbourne, Parkville, VIC 3010, Australia
| | - Hong-Yu Ou
- State Key Laboratory of Microbial Metabolism, Joint International Laboratory on Metabolic & Developmental Sciences, School of Life Sciences and Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Key Laboratory of Veterinary Biotechnology,
Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
162
|
Thurimella K, Mohamed AMT, Graham DB, Owens RM, La Rosa SL, Plichta DR, Bacallado S, Xavier RJ. Protein Language Models Uncover Carbohydrate-Active Enzyme Function in Metagenomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.23.563620. [PMID: 37961379 PMCID: PMC10634757 DOI: 10.1101/2023.10.23.563620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
In metagenomics, the pool of uncharacterized microbial enzymes presents a challenge for functional annotation. Among these, carbohydrate-active enzymes (CAZymes) stand out due to their pivotal roles in various biological processes related to host health and nutrition. Here, we present CAZyLingua, the first tool that harnesses protein language model embeddings to build a deep learning framework that facilitates the annotation of CAZymes in metagenomic datasets. Our benchmarking results showed on average a higher F1 score (reflecting an average of precision and recall) on the annotated genomes of Bacteroides thetaiotaomicron, Eggerthella lenta and Ruminococcus gnavus compared to the traditional sequence homology-based method in dbCAN2. We applied our tool to a paired mother/infant longitudinal dataset and revealed unannotated CAZymes linked to microbial development during infancy. When applied to metagenomic datasets derived from patients affected by fibrosis-prone diseases such as Crohn's disease and IgG4-related disease, CAZyLingua uncovered CAZymes associated with disease and healthy states. In each of these metagenomic catalogs, CAZyLingua discovered new annotations that were previously overlooked by traditional sequence homology tools. Overall, the deep learning model CAZyLingua can be applied in combination with existing tools to unravel intricate CAZyme evolutionary profiles and patterns, contributing to a more comprehensive understanding of microbial metabolic dynamics.
Collapse
Affiliation(s)
- Kumar Thurimella
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
- School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Ahmed M. T. Mohamed
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Daniel B. Graham
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Róisín M. Owens
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
| | - Sabina Leanti La Rosa
- Faculty of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, Norway
| | - Damian R. Plichta
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| | - Sergio Bacallado
- Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, UK
| | - Ramnik J. Xavier
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology and Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
163
|
Lopez-Martinez E, Manteca A, Ferruz N, Cortajarena AL. Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries. ACS Synth Biol 2023; 12:2812-2818. [PMID: 37703075 PMCID: PMC10594869 DOI: 10.1021/acssynbio.3c00201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Indexed: 09/14/2023]
Abstract
Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.
Collapse
Affiliation(s)
- Elena Lopez-Martinez
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
| | - Aitor Manteca
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
| | - Noelia Ferruz
- Molecular
Biology Institute of Barcelona (IBMB-CSIC), Barcelona Science Park, Baldiri Reixac, 15-21, 08028, Barcelona, Spain
| | - Aitziber L. Cortajarena
- Centre
for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain
- IKERBASQUE, Basque
Foundation for Science, Plaza Euskadi 5, 48009 Bilbao, Spain
| |
Collapse
|
164
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
165
|
Meador K, Castells-Graells R, Aguirre R, Sawaya MR, Arbing MA, Sherman T, Senarathne C, Yeates TO. A Suite of Designed Protein Cages Using Machine Learning Algorithms and Protein Fragment-Based Protocols. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.09.561468. [PMID: 37873110 PMCID: PMC10592684 DOI: 10.1101/2023.10.09.561468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Designed protein cages and related materials provide unique opportunities for applications in biotechnology and medicine, while methods for their creation remain challenging and unpredictable. In the present study, we apply new computational approaches to design a suite of new tetrahedrally symmetric, self-assembling protein cages. For the generation of docked poses, we emphasize a protein fragment-based approach, while for de novo interface design, a comparison of computational protocols highlights the power and increased experimental success achieved using the machine learning program ProteinMPNN. In relating information from docking and design, we observe that agreement between fragment-based sequence preferences and ProteinMPNN sequence inference correlates with experimental success. Additional insights for designing polar interactions are highlighted by experimentally testing larger and more polar interfaces. In all, using X-ray crystallography and cryo-EM, we report five structures for seven protein cages, with atomic resolution in the best case reaching 2.0 Å. We also report structures of two incompletely assembled protein cages, providing unique insights into one type of assembly failure. The new set of designed cages and their structures add substantially to the body of available protein nanoparticles, and to methodologies for their creation.
Collapse
Affiliation(s)
- Kyle Meador
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA 90095
| | | | - Roman Aguirre
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA 90095
| | - Michael R. Sawaya
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA, USA 90095
| | - Mark A. Arbing
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA, USA 90095
| | - Trent Sherman
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA 90095
| | - Chethaka Senarathne
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA 90095
| | - Todd O. Yeates
- Department of Chemistry and Biochemistry, University of California, Los Angeles, CA, USA 90095
- UCLA-DOE Institute for Genomics and Proteomics, Los Angeles, CA, USA 90095
| |
Collapse
|
166
|
Williams DO, Fadda E. Can ChatGPT pass Glycobiology? Glycobiology 2023; 33:606-614. [PMID: 37531256 DOI: 10.1093/glycob/cwad064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 06/30/2023] [Accepted: 07/12/2023] [Indexed: 08/04/2023] Open
Abstract
The release of text-generating applications based on interactive Large Language Models (LLMs) in late 2022 triggered an unprecedented and ever-growing interest worldwide. The almost instantaneous success of LLMs stimulated lively discussions in public media and in academic fora alike not only on the value and potentials of such tools in all areas of knowledge and information acquisition and distribution but also on the dangers posed by their uncontrolled and indiscriminate use. This conversation is now particularly active in the higher education sector, where LLMs are seen as a potential threat to academic integrity at all levels, from facilitating cheating by students in assignments to plagiarizing academic writing in the case of researchers and administrators. Within this framework, we are interested in testing the boundaries of the LLM ChatGPT (www.openai.com) in areas of our scientific interest and expertise and in analyzing the results from different perspectives, i.e. of a final year BSc student, of a research scientist, and of a lecturer in higher education. To this end, in this paper, we present and discuss a systematic evaluation on how ChatGPT addresses progressively complex scientific writing tasks and exam-type questions in Carbohydrate Chemistry and Glycobiology. The results of this project allowed us to gain insight on: (i) the strengths and limitations of the ChatGPT model to provide relevant and (most importantly) correct scientific information, (ii) the format(s) and complexity of the query required to obtain the desired output, and (iii) strategies to integrate LLMs in teaching and learning.
Collapse
Affiliation(s)
| | - Elisa Fadda
- Department of Chemistry
- Hamilton Institute, Maynooth University, Maynooth, co. Kildare, Ireland
| |
Collapse
|
167
|
Xie WJ, Liu D, Wang X, Zhang A, Wei Q, Nandi A, Dong S, Warshel A. Enhancing Luciferase Activity and Stability through Generative Modeling of Natural Enzyme Sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.18.558367. [PMID: 37786693 PMCID: PMC10541610 DOI: 10.1101/2023.09.18.558367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
The availability of natural protein sequences synergized with generative artificial intelligence (AI) provides new paradigms to create enzymes. Although active enzyme variants with numerous mutations have been produced using generative models, their performance often falls short compared to their wild-type counterparts. Additionally, in practical applications, choosing fewer mutations that can rival the efficacy of extensive sequence alterations is usually more advantageous. Pinpointing beneficial single mutations continues to be a formidable task. In this study, using the generative maximum entropy model to analyze Renilla luciferase homologs, and in conjunction with biochemistry experiments, we demonstrated that natural evolutionary information could be used to predictively improve enzyme activity and stability by engineering the active center and protein scaffold, respectively. The success rate of designed single mutants is ~50% to improve either luciferase activity or stability. These finding highlights nature's ingenious approach to evolving proficient enzymes, wherein diverse evolutionary pressures are preferentially applied to distinct regions of the enzyme, ultimately culminating in an overall high performance. We also reveal an evolutionary preference in Renilla luciferase towards emitting blue light that holds advantages in terms of water penetration compared to other light spectrum. Taken together, our approach facilitates navigation through enzyme sequence space and offers effective strategies for computer-aided rational enzyme engineering.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Dangliang Liu
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Xiaoya Wang
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Aoxuan Zhang
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| | - Qijia Wei
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Ashim Nandi
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| | - Suwei Dong
- State Key Laboratory of Natural and Biomimetic Drugs, Chemical Biology Center, and School of Pharmaceutical Sciences, Peking University, Beijing, China
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
168
|
Feng X, Chang R, Zhu H, Yang Y, Ji Y, Liu D, Qin H, Yin J, Rong H. Engineering Proteins for Cell Entry. Mol Pharm 2023; 20:4868-4882. [PMID: 37708383 DOI: 10.1021/acs.molpharmaceut.3c00467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/16/2023]
Abstract
Proteins are essential for life, as they participate in all vital processes in the body. In the past decade, delivery of active proteins to specific cells and organs has attracted increasing interest. However, most proteins cannot enter the cytoplasm due to the cell membrane acting as a natural barrier. To overcome this challenge, various proteins have been engineered to acquire cell-penetrating capacity by mimicking or modifying natural shuttling proteins. In this review, we provide an overview of the different types of engineered cell-penetrating proteins such as cell-penetrating peptides, supercharged proteins, receptor-binding proteins, and bacterial toxins. We also discuss some strategies for improving endosomal escape such as pore formation, the proton sponge effect, and hijacking intracellular trafficking pathways. Finally, we introduce some novel methods and technologies for designing and detecting engineered cell-penetrating proteins.
Collapse
Affiliation(s)
- Xiaoyu Feng
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Ruilong Chang
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Haichao Zhu
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Yifan Yang
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Yue Ji
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Dingkang Liu
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Hai Qin
- Department of Clinical Laboratory, Beijing Jishuitan Hospital Guizhou Hospital, No. 206, Sixian Street, Baiyun District, Guiyang, Guizhou 550014, China
| | - Jun Yin
- Jiangsu Key Laboratory of Druggability of Biopharmaceuticals and State Key Laboratory of Natural Medicines, School of Life Science and Technology, China Pharmaceutical University, Nanjing 210009, China
| | - Haibo Rong
- Jiangsu Cancer Hospital & Jiangsu Institute of Cancer Research, The Affiliated Cancer Hospital of Nanjing Medical University, Nanjing 210009, China
| |
Collapse
|
169
|
Zhou Y, Huang Z, Li W, Wei J, Jiang Q, Yang W, Huang J. Deep learning in preclinical antibody drug discovery and development. Methods 2023; 218:57-71. [PMID: 37454742 DOI: 10.1016/j.ymeth.2023.07.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 03/20/2023] [Accepted: 07/10/2023] [Indexed: 07/18/2023] Open
Abstract
Antibody drugs have become a key part of biotherapeutics. Patients suffering from various diseases have benefited from antibody therapies. However, its development process is rather long, expensive and risky. To speed up the process, reduce cost and improve success rate, artificial intelligence, especially deep learning methods, have been widely used in all aspects of preclinical antibody drug development, from library generation to hit identification, developability screening, lead selection and optimization. In this review, we systematically summarize antibody encodings, deep learning architectures and models used in preclinical antibody drug discovery and development. We also critically discuss challenges and opportunities, problems and possible solutions, current applications and future directions of deep learning in antibody drug development.
Collapse
Affiliation(s)
- Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jinyi Wei
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wei Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
170
|
D'Alessandro W, Lloyd HR, Sharadin N. Large Language Models and Biorisk. THE AMERICAN JOURNAL OF BIOETHICS : AJOB 2023; 23:115-118. [PMID: 37812092 DOI: 10.1080/15265161.2023.2250333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
|
171
|
Gomes B, Ashley EA. Artificial Intelligence in Molecular Medicine. Reply. N Engl J Med 2023; 389:1252. [PMID: 37754302 DOI: 10.1056/nejmc2308776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
|
172
|
Faiz M, Khan SJ, Azim F, Ejaz N. Disclosing the locale of transmembrane proteins within cellular alcove by machine learning approach: systematic review and meta analysis. J Biomol Struct Dyn 2023:1-16. [PMID: 37768108 DOI: 10.1080/07391102.2023.2260490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023]
Abstract
Protein subcellular localization is a promising research question in Proteomics and associated fields, including Biological Sciences, Biomedical Engineering, Computational Biology, Bioinformatics, Proteomics, Artificial Intelligence, and Biophysics. However, computational techniques are preferred to explore this attribute for a massive number of proteins. The byproduct of this conjunction yields diversified location identifiers of proteins. These protein subcellular localization identifiers are unique regarding the database used, organisms, Machine Learning Technique, and accuracy. Despite the availability of these identifiers, the majority of the work has been done on the subcellular localization of proteins and, less work has been done specifically on locations of transmembrane proteins. This systematic review accounts for computational techniques implemented on transmembrane protein localization. Moreover, a literature search on PubMed, Science Direct, and IEEE Databases disclosed no systematic review or meta-analysis on the cell's transmembrane protein locale. A Systematic review was formed under the guidelines of PRISMA by using Science Direct, PubMed, and IEEE Databases. Journal publications from 2000 to 2023 were taken into consideration and screened. This review has focused only on computational studies rather than experimental techniques. 1004 studies were reviewed and were categorized as relevant and non-relevant according to inclusion and exclusion criteria. All the screening was done through Endnote after importing citations. This systematic review characterizes the gap in targeting the locale of the transmembrane protein and will aid researchers in exploring its new horizons.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Mehwish Faiz
- Department of Biomedical Engineering, Ziauddin University (FESTM), Karachi, Pakistan
- Department of Electrical Engineering, Ziauddin University, (FESTM), Karachi, Pakistan
| | - Saad Jawaid Khan
- Department of Biomedical Engineering, Ziauddin University (FESTM), Karachi, Pakistan
| | - Fahad Azim
- Department of Electrical Engineering, Ziauddin University, (FESTM), Karachi, Pakistan
| | - Nazia Ejaz
- Balochistan University of Engineering and Technology, Khuzdar, Pakistan
| |
Collapse
|
173
|
Liu T, Gao H, Ren X, Xu G, Liu B, Wu N, Luo H, Wang Y, Tu T, Yao B, Guan F, Teng Y, Huang H, Tian J. Protein-protein interaction and site prediction using transfer learning. Brief Bioinform 2023; 24:bbad376. [PMID: 37870286 DOI: 10.1093/bib/bbad376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/14/2023] [Accepted: 10/02/2023] [Indexed: 10/24/2023] Open
Abstract
The advanced language models have enabled us to recognize protein-protein interactions (PPIs) and interaction sites using protein sequences or structures. Here, we trained the MindSpore ProteinBERT (MP-BERT) model, a Bidirectional Encoder Representation from Transformers, using protein pairs as inputs, making it suitable for identifying PPIs and their respective interaction sites. The pretrained model (MP-BERT) was fine-tuned as MPB-PPI (MP-BERT on PPI) and demonstrated its superiority over the state-of-the-art models on diverse benchmark datasets for predicting PPIs. Moreover, the model's capability to recognize PPIs among various organisms was evaluated on multiple organisms. An amalgamated organism model was designed, exhibiting a high level of generalization across the majority of organisms and attaining an accuracy of 92.65%. The model was also customized to predict interaction site propensity by fine-tuning it with PPI site data as MPB-PPISP. Our method facilitates the prediction of both PPIs and their interaction sites, thereby illustrating the potency of transfer learning in dealing with the protein pair task.
Collapse
Affiliation(s)
- Tuoyu Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Han Gao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Xiaopu Ren
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Guoshun Xu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Bo Liu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Ningfeng Wu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Huiying Luo
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Yuan Wang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Tao Tu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Bin Yao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Feifei Guan
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| | - Yue Teng
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Academy of Military Medical Sciences, Beijing 100071, China
| | - Huoqing Huang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
| | - Jian Tian
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing 100193, China
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China
| |
Collapse
|
174
|
Mantena S, Pillai PP, Petros BA, Welch NL, Myhrvold C, Sabeti PC, Metsky HC. Model-directed generation of CRISPR-Cas13a guide RNAs designs artificial sequences that improve nucleic acid detection. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.20.557569. [PMID: 37786711 PMCID: PMC10541601 DOI: 10.1101/2023.09.20.557569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
Generating maximally-fit biological sequences has the potential to transform CRISPR guide RNA design as it has other areas of biomedicine. Here, we introduce model-directed exploration algorithms (MEAs) for designing maximally-fit, artificial CRISPR-Cas13a guides-with multiple mismatches to any natural sequence-that are tailored for desired properties around nucleic acid diagnostics. We find that MEA-designed guides offer more sensitive detection of diverse pathogens and discrimination of pathogen variants compared to guides derived directly from natural sequences, and illuminate interpretable design principles that broaden Cas13a targeting.
Collapse
Affiliation(s)
- Sreekar Mantena
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | | | - Brittany A. Petros
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Health Sciences and Technology, Harvard Medical School and Massachusetts Institute of Technology, Cambridge, MA, USA
- Harvard/Massachusetts Institute of Technology, MD-PhD Program, Boston, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Cameron Myhrvold
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
| | - Pardis C. Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | | |
Collapse
|
175
|
Roche R, Moussad B, Shuvo MH, Tarafder S, Bhattacharya D. EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.14.557719. [PMID: 37745556 PMCID: PMC10515942 DOI: 10.1101/2023.09.14.557719] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein-nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein-nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein-DNA and protein-RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Bernard Moussad
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Sumit Tarafder
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| | - Debswapna Bhattacharya
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States of America
| |
Collapse
|
176
|
Wang Y, Lv H, Lei R, Yeung YH, Shen IR, Choi D, Teo QW, Tan TJ, Gopal AB, Chen X, Graham CS, Wu NC. An explainable language model for antibody specificity prediction using curated influenza hemagglutinin antibodies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.11.557288. [PMID: 37745338 PMCID: PMC10515799 DOI: 10.1101/2023.09.11.557288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and inaccessibility of datasets for model training. In this study, we curated a dataset of >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM captured key sequence motifs of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of antibody response to influenza virus, but also provides an invaluable resource for applying deep learning to antibody research.
Collapse
Affiliation(s)
- Yiquan Wang
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Huibin Lv
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Ruipeng Lei
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Yuen-Hei Yeung
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR, China
| | - Ivana R. Shen
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Danbi Choi
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Qi Wen Teo
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Timothy J.C. Tan
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Akshita B. Gopal
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Xin Chen
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Claire S. Graham
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Nicholas C. Wu
- Department of Biochemistry, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
- Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
177
|
Chakraborty C, Bhattacharya M, Lee SS. Artificial intelligence enabled ChatGPT and large language models in drug target discovery, drug discovery, and development. MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 33:866-868. [PMID: 37680991 PMCID: PMC10481150 DOI: 10.1016/j.omtn.2023.08.009] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Affiliation(s)
- Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha 756020, India
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon-si, Gangwon-do 24252, Republic of Korea
| |
Collapse
|
178
|
Pan Y, Ren H, Lan L, Li Y, Huang T. Review of Predicting Synergistic Drug Combinations. Life (Basel) 2023; 13:1878. [PMID: 37763281 PMCID: PMC10533134 DOI: 10.3390/life13091878] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 08/31/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023] Open
Abstract
The prediction of drug combinations is of great clinical significance. In many diseases, such as high blood pressure, diabetes, and stomach ulcers, the simultaneous use of two or more drugs has shown clear efficacy. It has greatly reduced the progression of drug resistance. This review presents the latest applications of methods for predicting the effects of drug combinations and the bioactivity databases commonly used in drug combination prediction. These studies have played a significant role in developing precision therapy. We first describe the concept of synergy. we study various publicly available databases for drug combination prediction tasks. Next, we introduce five algorithms applied to drug combinatorial prediction, which include traditional machine learning methods, deep learning methods, mathematical methods, systems biology methods and search algorithms. In the end, we sum up the difficulties encountered in prediction models.
Collapse
Affiliation(s)
- Yichen Pan
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; (Y.P.); (H.R.)
| | - Haotian Ren
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; (Y.P.); (H.R.)
| | - Liang Lan
- Department of Interactive Media, Hong Kong Baptist University, Hong Kong, China;
| | - Yixue Li
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; (Y.P.); (H.R.)
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
- Guangzhou Laboratory, Guangzhou 510005, China
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai 200433, China
| | - Tao Huang
- Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China; (Y.P.); (H.R.)
| |
Collapse
|
179
|
Zhang H, Zhang L, Lin A, Xu C, Li Z, Liu K, Liu B, Ma X, Zhao F, Jiang H, Chen C, Shen H, Li H, Mathews DH, Zhang Y, Huang L. Algorithm for optimized mRNA design improves stability and immunogenicity. Nature 2023; 621:396-403. [PMID: 37130545 PMCID: PMC10499610 DOI: 10.1038/s41586-023-06127-z] [Citation(s) in RCA: 58] [Impact Index Per Article: 58.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Accepted: 04/25/2023] [Indexed: 05/04/2023]
Abstract
Messenger RNA (mRNA) vaccines are being used to combat the spread of COVID-19 (refs. 1-3), but they still exhibit critical limitations caused by mRNA instability and degradation, which are major obstacles for the storage, distribution and efficacy of the vaccine products4. Increasing secondary structure lengthens mRNA half-life, which, together with optimal codons, improves protein expression5. Therefore, a principled mRNA design algorithm must optimize both structural stability and codon usage. However, owing to synonymous codons, the mRNA design space is prohibitively large-for example, there are around 2.4 × 10632 candidate mRNA sequences for the SARS-CoV-2 spike protein. This poses insurmountable computational challenges. Here we provide a simple and unexpected solution using the classical concept of lattice parsing in computational linguistics, where finding the optimal mRNA sequence is analogous to identifying the most likely sentence among similar-sounding alternatives6. Our algorithm LinearDesign finds an optimal mRNA design for the spike protein in just 11 minutes, and can concurrently optimize stability and codon usage. LinearDesign substantially improves mRNA half-life and protein expression, and profoundly increases antibody titre by up to 128 times in mice compared to the codon-optimization benchmark on mRNA vaccines for COVID-19 and varicella-zoster virus. This result reveals the great potential of principled mRNA design and enables the exploration of previously unreachable but highly stable and efficient designs. Our work is a timely tool for vaccines and other mRNA-based medicines encoding therapeutic proteins such as monoclonal antibodies and anti-cancer drugs7,8.
Collapse
Affiliation(s)
- He Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Liang Zhang
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Ang Lin
- StemiRNA Therapeutics, Shanghai, China
- Vaccine Center, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | | | - Ziyu Li
- Baidu Research USA, Sunnyvale, CA, USA
| | - Kaibo Liu
- Baidu Research USA, Sunnyvale, CA, USA
- School of EECS, Oregon State University, Corvallis, OR, USA
| | - Boxiang Liu
- Baidu Research USA, Sunnyvale, CA, USA
- Department of Pharmacy, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | | - David H Mathews
- Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, USA.
- Center for RNA Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| | - Yujian Zhang
- StemiRNA Therapeutics, Shanghai, China.
- , Gaithersburg, MD, USA.
| | - Liang Huang
- Baidu Research USA, Sunnyvale, CA, USA.
- School of EECS, Oregon State University, Corvallis, OR, USA.
- Coderna.ai, Inc., Sunnyvale, CA, USA.
| |
Collapse
|
180
|
Matthews CJ, Patrick WM. An enzyme-centric approach for constructing an amperometric l-malate biosensor with a long and programmable linear range. Protein Sci 2023; 32:e4743. [PMID: 37515423 PMCID: PMC10451018 DOI: 10.1002/pro.4743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 07/22/2023] [Accepted: 07/26/2023] [Indexed: 07/30/2023]
Abstract
l-Malate is a key flavor enhancer and acidulant in the food and beverage industry, particularly winemaking. Enzyme-based amperometric biosensors offer convenience for monitoring its concentration. However, only a small number of off-the-shelf malate-oxidizing enzymes have been used in previous devices. These typically have linear ranges poorly suited for the l-malate concentrations found in fruit processing and winemaking, making it necessary to use precisely diluted samples. Here, we describe a pipeline of database-mining, gene synthesis, recombinant expression, and spectrophotometric assays to characterize previously untested enzymes for their suitability in biosensors. The pipeline yielded a bespoke biocatalyst-the Ascaris suum malic enzyme carrying mutation R181Q [AsME(R181Q)]. Our first prototype with AsME(R181Q) had an ultra-wide linear range of 50-200 mM l-malate, corresponding to concentrations found in undiluted fruit juices (including grape). Changing the dication from Mg2+ to Mn2+ increased sensitivity five-fold and adding citrate (100 mM) increased it another six-fold, albeit decreasing the linear range to 1-10 mM. To our knowledge, this is the first time an l-malate biosensor with a tuneable combination of sensitivity and linear range has been described. The sensor response was also tested in the presence of various molecules abundant in juices and wines, with ascorbate shown to be a potent interferent. Interference was mitigated by the addition of ascorbate oxidase, allowing for differential measurements on an undiluted, untreated wine sample that corresponded well with commercial l-malate testing kits. Overall, this work demonstrates the power of an enzyme-centric approach for designing electrochemical biosensors with improved operational parameters and novel functionality.
Collapse
Affiliation(s)
- Christopher J. Matthews
- Centre for Biodiscovery, School of Biological SciencesVictoria University of WellingtonWellingtonNew Zealand
| | - Wayne M. Patrick
- Centre for Biodiscovery, School of Biological SciencesVictoria University of WellingtonWellingtonNew Zealand
| |
Collapse
|
181
|
Scalzitti N, Miralavy I, Korenchan DE, Farrar CT, Gilad AA, Banzhaf W. Computational Peptide Discovery with a Genetic Programming Approach. RESEARCH SQUARE 2023:rs.3.rs-3307450. [PMID: 37693481 PMCID: PMC10491332 DOI: 10.21203/rs.3.rs-3307450/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2023]
Abstract
Background The development of peptides for therapeutic targets or biomarkers for disease diagnosis is a challenging task in protein engineering. Current approaches are tedious, often time-consuming and require complex laboratory data due to the vast search space. In silico methods can accelerate research and substantially reduce costs. Evolutionary algorithms are a promising approach for exploring large search spaces and facilitating the discovery of new peptides. Results This study presents the development and use of a variant of the initial POET algorithm, called P O E T R e g e x , which is based on genetic programming, where individuals are represented by a list of regular expressions. The program was trained on a small curated dataset and employed to predict new peptides that can improve the problem of sensitivity in detecting peptides through magnetic resonance imaging using chemical exchange saturation transfer (CEST). The resulting model achieves a performance gain of 20% over the initial POET variant and is able to predict a candidate peptide with a 58% performance increase compared to the gold-standard peptide. Conclusions By combining the power of genetic programming with the flexibility of regular expressions, new potential peptide targets were identified to improve the sensitivity of detection by CEST. This approach provides a promising research direction for the efficient identification of peptides with therapeutic or diagnostic potential.
Collapse
Affiliation(s)
- Nicolas Scalzitti
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Iliya Miralavy
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - David E. Korenchan
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Christian T. Farrar
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Assaf A. Gilad
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Chemical Engineering, Michigan State University, East Lansing, MI, USA
- Department of Radiology, Michigan State University, East Lansing, MI, USA
| | - Wolfgang Banzhaf
- BEACON Center of Evolution in Action, Michigan State University, East Lansing, MI, USA
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| |
Collapse
|
182
|
Jiang H, Jude KM, Wu K, Fallas J, Ueda G, Brunette TJ, Hicks D, Pyles H, Yang A, Carter L, Lamb M, Li X, Levine PM, Stewart L, Garcia KC, Baker D. De novo design of buttressed loops for sculpting protein functions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.22.554384. [PMID: 37662224 PMCID: PMC10473674 DOI: 10.1101/2023.08.22.554384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
In natural proteins, structured loops play central roles in molecular recognition, signal transduction and enzyme catalysis. However, because of the intrinsic flexibility and irregularity of loop regions, organizing multiple structured loops at protein functional sites has been very difficult to achieve by de novo protein design. Here we describe a solution to this problem that generates structured loops buttressed by extensive hydrogen bonding interactions with two neighboring loops and with secondary structure elements. We use this approach to design tandem repeat proteins with buttressed loops ranging from 9 to 14 residues in length. Experimental characterization shows the designs are folded and monodisperse, highly soluble, and thermally stable. Crystal structures are in close agreement with the computational design models, with the loops structured and buttressed by their neighbors as designed. We demonstrate the functionality afforded by loop buttressing by designing and characterizing binders for extended peptides in which the loops form one side of an extended binding pocket. The ability to design multiple structured loops should contribute quite generally to efforts to design new protein functions.
Collapse
Affiliation(s)
- Hanlun Jiang
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Kevin M Jude
- Howard Hughes Medical Institute, Stanford University School of Medicine
| | - Kejia Wu
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
- Biological Physics, Structure and Design Graduate Program, University of Washington
| | - Jorge Fallas
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - George Ueda
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - T J Brunette
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Derrick Hicks
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Harley Pyles
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Aerin Yang
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine
| | - Lauren Carter
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Mila Lamb
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Xinting Li
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Paul M Levine
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - Lance Stewart
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
| | - K Christopher Garcia
- Howard Hughes Medical Institute, Stanford University School of Medicine
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine
- Department of Structural Biology, Stanford University School of Medicine
| | - David Baker
- Department of Biochemistry, University of Washington
- Institute for Protein Design, University of Washington
- Howard Hughes Medical Institute, Stanford University School of Medicine
- Howard Hughes Medical Institute, University of Washington
| |
Collapse
|
183
|
Abstract
A survey of protein databases indicates that the majority of enzymes exist in oligomeric forms, with about half of those found in the UniProt database being homodimeric. Understanding why many enzymes are in their dimeric form is imperative. Recent developments in experimental and computational techniques have allowed for a deeper comprehension of the cooperative interactions between the subunits of dimeric enzymes. This review aims to succinctly summarize these recent advancements by providing an overview of experimental and theoretical methods, as well as an understanding of cooperativity in substrate binding and the molecular mechanisms of cooperative catalysis within homodimeric enzymes. Focus is set upon the beneficial effects of dimerization and cooperative catalysis. These advancements not only provide essential case studies and theoretical support for comprehending dimeric enzyme catalysis but also serve as a foundation for designing highly efficient catalysts, such as dimeric organic catalysts. Moreover, these developments have significant implications for drug design, as exemplified by Paxlovid, which was designed for the homodimeric main protease of SARS-CoV-2.
Collapse
Affiliation(s)
- Ke-Wei Chen
- Lab of Computional Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Tian-Yu Sun
- Shenzhen Bay Laboratory, Shenzhen 518132, China
| | - Yun-Dong Wu
- Lab of Computional Chemistry and Drug Design, State Key Laboratory of Chemical Oncogenomics, Peking University Shenzhen Graduate School, Shenzhen 518055, China
- Shenzhen Bay Laboratory, Shenzhen 518132, China
| |
Collapse
|
184
|
Yu T, Boob AG, Singh N, Su Y, Zhao H. In vitro continuous protein evolution empowered by machine learning and automation. Cell Syst 2023; 14:633-644. [PMID: 37224814 DOI: 10.1016/j.cels.2023.04.006] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/19/2022] [Accepted: 04/20/2023] [Indexed: 05/26/2023]
Abstract
Directed evolution has become one of the most successful and powerful tools for protein engineering. However, the efforts required for designing, constructing, and screening a large library of variants can be laborious, time-consuming, and costly. With the recent advent of machine learning (ML) in the directed evolution of proteins, researchers can now evaluate variants in silico and guide a more efficient directed evolution campaign. Furthermore, recent advancements in laboratory automation have enabled the rapid execution of long, complex experiments for high-throughput data acquisition in both industrial and academic settings, thus providing the means to collect a large quantity of data required to develop ML models for protein engineering. In this perspective, we propose a closed-loop in vitro continuous protein evolution framework that leverages the best of both worlds, ML and automation, and provide a brief overview of the recent developments in the field.
Collapse
Affiliation(s)
- Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA
| | - Aashutosh Girish Boob
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nilmani Singh
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Yufeng Su
- NSF Molecule Maker Lab Institute, Urbana, IL, USA; Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL, USA; NSF Molecule Maker Lab Institute, Urbana, IL, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| |
Collapse
|
185
|
Ahdritz G, Bouatta N, Kadyan S, Jarosch L, Berenberg D, Fisk I, Watkins AM, Ra S, Bonneau R, AlQuraishi M. OpenProteinSet: Training data for structural biology at scale. ARXIV 2023:arXiv:2308.05326v1. [PMID: 37608940 PMCID: PMC10441447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Multiple sequence alignments (MSAs) of proteins encode rich biological information and have been workhorses in bioinformatic methods for tasks like protein design and protein structure prediction for decades. Recent breakthroughs like AlphaFold2 that use transformers to attend directly over large quantities of raw MSAs have reaffirmed their importance. Generation of MSAs is highly computationally intensive, however, and no datasets comparable to those used to train AlphaFold2 have been made available to the research community, hindering progress in machine learning for proteins. To remedy this problem, we introduce OpenProteinSet, an open-source corpus of more than 16 million MSAs, associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. We have previously demonstrated the utility of OpenProteinSet by successfully retraining AlphaFold2 on it. We expect OpenProteinSet to be broadly useful as training and validation data for 1) diverse tasks focused on protein structure, function, and design and 2) large-scale multimodal machine learning research.
Collapse
Affiliation(s)
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School
| | | | | | - Daniel Berenberg
- Prescient Design, Genentech & Department of Computer Science, New York University
| | | | | | | | | | | |
Collapse
|
186
|
Sun Y, Shen Y. Structure-Informed Protein Language Models are Robust Predictors for Variant Effects. RESEARCH SQUARE 2023:rs.3.rs-3219092. [PMID: 37577664 PMCID: PMC10418537 DOI: 10.21203/rs.3.rs-3219092/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family's evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.
Collapse
Affiliation(s)
- Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
- Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, Texas A&M University, Houston, 77030, Texas, USA
| |
Collapse
|
187
|
Ekins S, Brackmann M, Invernizzi C, Lentzos F. Generative Artificial Intelligence-Assisted Protein Design Must Consider Repurposing Potential. GEN BIOTECHNOLOGY 2023; 2:296-300. [PMID: 37928405 PMCID: PMC10623615 DOI: 10.1089/genbio.2023.0025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2023]
Abstract
Generative artificial intelligence software used for chemical and protein design has repurposing potential. We propose careful discussion in the biotech community on security considerations of such technologies and serious consideration of restrictions to control who can access the software and what applications it is used for.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations Pharmaceuticals, Inc., Raleigh, North Carolina, USA
| | - Maximilian Brackmann
- Spiez Laboratory, Federal Department of Defence, Civil Protection and Sports, Spiez, Switzerland
| | - Cédric Invernizzi
- Spiez Laboratory, Federal Department of Defence, Civil Protection and Sports, Spiez, Switzerland
| | - Filippa Lentzos
- Department of War Studies and King's College London, London, United Kingdom
- Department of Global Health and Social Medicine, King's College London, London, United Kingdom
| |
Collapse
|
188
|
Belanger D, Colwell LJ. Hallucinating functional protein sequences. Nat Biotechnol 2023; 41:1073-1074. [PMID: 36702894 DOI: 10.1038/s41587-022-01634-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Affiliation(s)
| | - Lucy J Colwell
- Google Research, Mountain View, CA, USA.
- Department of Chemistry, Cambridge University, Cambridge, UK.
| |
Collapse
|
189
|
Rappazzo CG, Fernández-Quintero ML, Mayer A, Wu NC, Greiff V, Guthmiller JJ. Defining and Studying B Cell Receptor and TCR Interactions. JOURNAL OF IMMUNOLOGY (BALTIMORE, MD. : 1950) 2023; 211:311-322. [PMID: 37459189 PMCID: PMC10495106 DOI: 10.4049/jimmunol.2300136] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Accepted: 04/15/2023] [Indexed: 07/20/2023]
Abstract
BCRs (Abs) and TCRs (or adaptive immune receptors [AIRs]) are the means by which the adaptive immune system recognizes foreign and self-antigens, playing an integral part in host defense, as well as the emergence of autoimmunity. Importantly, the interaction between AIRs and their cognate Ags defies a simple key-in-lock paradigm and is instead a complex many-to-many mapping between an individual's massively diverse AIR repertoire, and a similarly diverse antigenic space. Understanding how adaptive immunity balances specificity with epitopic coverage is a key challenge for the field, and terms such as broad specificity, cross-reactivity, and polyreactivity remain ill-defined and are used inconsistently. In this Immunology Notes and Resources article, a group of experimental, structural, and computational immunologists define commonly used terms associated with AIR binding, describe methodologies to study these binding modes, as well as highlight the implications of these different binding modes for therapeutic design.
Collapse
Affiliation(s)
| | | | - Andreas Mayer
- Division of Infection and Immunity, University College London, London WC1E 6BT, UK
| | - Nicholas C. Wu
- Department of Biochemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, 0372 Oslo, Norway
| | - Jenna J. Guthmiller
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045
| |
Collapse
|
190
|
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023; 29:1930-1940. [PMID: 37460753 DOI: 10.1038/s41591-023-02448-8] [Citation(s) in RCA: 335] [Impact Index Per Article: 335.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 06/08/2023] [Indexed: 08/17/2023]
Abstract
Large language models (LLMs) can respond to free-text queries without being specifically trained in the task in question, causing excitement and concern about their use in healthcare settings. ChatGPT is a generative artificial intelligence (AI) chatbot produced through sophisticated fine-tuning of an LLM, and other tools are emerging through similar developmental processes. Here we outline how LLM applications such as ChatGPT are developed, and we discuss how they are being leveraged in clinical settings. We consider the strengths and limitations of LLMs and their potential to improve the efficiency and effectiveness of clinical, educational and research work in medicine. LLM chatbots have already been deployed in a range of biomedical contexts, with impressive but mixed results. This review acts as a primer for interested clinicians, who will determine if and how LLM technology is used in healthcare for the benefit of patients and practitioners.
Collapse
Affiliation(s)
- Arun James Thirunavukarasu
- University of Cambridge School of Clinical Medicine, Cambridge, UK
- Corpus Christi College, University of Cambridge, Cambridge, UK
| | - Darren Shu Jeng Ting
- Academic Unit of Ophthalmology, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
- Birmingham and Midland Eye Centre, Birmingham, UK
- Academic Ophthalmology, School of Medicine, University of Nottingham, Nottingham, UK
| | - Kabilan Elangovan
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Laura Gutierrez
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Ting Fang Tan
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Department of Ophthalmology and Visual Sciences, Duke-National University of Singapore Medical School, Singapore, Singapore
| | - Daniel Shu Wei Ting
- Artificial Intelligence and Digital Innovation Research Group, Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore.
- Department of Ophthalmology and Visual Sciences, Duke-National University of Singapore Medical School, Singapore, Singapore.
- Byers Eye Institute, Stanford University, Palo Alto, CA, USA.
| |
Collapse
|
191
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. ARXIV 2023:arXiv:2307.14587v1. [PMID: 37547662 PMCID: PMC10402185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI, USA
| |
Collapse
|
192
|
Bao C, Lu C, Lin J, Gough J, Fang H. The dcGO Domain-Centric Ontology Database in 2023: New Website and Extended Annotations for Protein Structural Domains. J Mol Biol 2023; 435:168093. [PMID: 37061086 PMCID: PMC7614987 DOI: 10.1016/j.jmb.2023.168093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 03/24/2023] [Accepted: 04/06/2023] [Indexed: 04/17/2023]
Abstract
Protein structural domains have been less studied than full-length proteins in terms of ontology annotations. The dcGO database has filled this gap by providing mappings from protein domains to ontologies. The dcGO update in 2023 extends annotations for protein domains of multiple definitions (SCOP, Pfam, and InterPro) with commonly used ontologies that are categorised into functions, phenotypes, diseases, drugs, pathways, regulators, and hallmarks. This update adds new dimensions to the utility of both ontology and protein domain resources. A newly designed website at http://www.protdomainonto.pro/dcGO offers a more centralised and user-friendly way to access the dcGO database, with enhanced faceted search returning term- and domain-specific information pages. Users can navigate both ontology terms and annotated domains through improved ontology hierarchy browsing. A newly added facility enables domain-based ontology enrichment analysis.
Collapse
Affiliation(s)
- Chaohui Bao
- Shanghai Institute of Hematology, State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
| | - Chang Lu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK; MRC London Institute of Medical Sciences, Imperial College London, London W12 0HS, UK
| | - James Lin
- High Performance Computing Center, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK
| | - Hai Fang
- Shanghai Institute of Hematology, State Key Laboratory of Medical Genomics, National Research Center for Translational Medicine at Shanghai, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China.
| |
Collapse
|
193
|
Tholen MME, Tas RP, Wang Y, Albertazzi L. Beyond DNA: new probes for PAINT super-resolution microscopy. Chem Commun (Camb) 2023; 59:8332-8342. [PMID: 37306078 PMCID: PMC10318573 DOI: 10.1039/d3cc00757j] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 05/26/2023] [Indexed: 06/13/2023]
Abstract
In the last decade, point accumulation for imaging in nanoscale topography (PAINT) has emerged as a versatile tool for single-molecule localization microscopy (SMLM). Currently, DNA-PAINT is the most widely used, in which a transient stochastically binding DNA docking-imaging pair is used to reconstruct specific characteristics of biological or synthetic materials on a single-molecule level. Slowly, the need for PAINT probes that are not dependent on DNA has emerged. These probes can be based on (i) endogenous interactions, (ii) engineered binders, (iii) fusion proteins, or (iv) synthetic molecules and provide complementary applications for SMLM. Therefore, researchers have been expanding the PAINT toolbox with new probes. In this review, we provide an overview of the currently existing probes that go beyond DNA and their applications and challenges.
Collapse
Affiliation(s)
- Marrit M E Tholen
- Department of Biomedical Engineering, Institute of Complex Molecular Systems, Eindhoven University of Technology, Eindhoven, The Netherlands.
| | - Roderick P Tas
- Department of Chemical Engineering and Chemistry, Laboratory of Self-Organizing Soft Matter, Eindhoven University of Technology, Eindhoven, 5612 AP, The Netherlands
- Institute for Complex Molecular Systems, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands
| | - Yuyang Wang
- Institute for Complex Molecular Systems, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands
- Department of Applied Physics, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands
| | - Lorenzo Albertazzi
- Department of Biomedical Engineering, Institute of Complex Molecular Systems, Eindhoven University of Technology, Eindhoven, The Netherlands.
| |
Collapse
|
194
|
Casadevall G, Duran C, Osuna S. AlphaFold2 and Deep Learning for Elucidating Enzyme Conformational Flexibility and Its Application for Design. JACS AU 2023; 3:1554-1562. [PMID: 37388680 PMCID: PMC10302747 DOI: 10.1021/jacsau.3c00188] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 05/22/2023] [Accepted: 05/22/2023] [Indexed: 07/01/2023]
Abstract
The recent success of AlphaFold2 (AF2) and other deep learning (DL) tools in accurately predicting the folded three-dimensional (3D) structure of proteins and enzymes has revolutionized the structural biology and protein design fields. The 3D structure indeed reveals key information on the arrangement of the catalytic machinery of enzymes and which structural elements gate the active site pocket. However, comprehending enzymatic activity requires a detailed knowledge of the chemical steps involved along the catalytic cycle and the exploration of the multiple thermally accessible conformations that enzymes adopt when in solution. In this Perspective, some of the recent studies showing the potential of AF2 in elucidating the conformational landscape of enzymes are provided. Selected examples of the key developments of AF2-based and DL methods for protein design are discussed, as well as a few enzyme design cases. These studies show the potential of AF2 and DL for allowing the routine computational design of efficient enzymes.
Collapse
Affiliation(s)
- Guillem Casadevall
- Institut
de Química Computacional i Catàlisi (IQCC) and Departament
de Química, Universitat de Girona, Maria Aurèlia Capmany 69, 17003 Girona, Spain
| | - Cristina Duran
- Institut
de Química Computacional i Catàlisi (IQCC) and Departament
de Química, Universitat de Girona, Maria Aurèlia Capmany 69, 17003 Girona, Spain
| | - Sílvia Osuna
- Institut
de Química Computacional i Catàlisi (IQCC) and Departament
de Química, Universitat de Girona, Maria Aurèlia Capmany 69, 17003 Girona, Spain
- ICREA, Passeig Lluís Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
195
|
Li P, Liu ZP. GeoBind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res 2023; 51:e60. [PMID: 37070217 PMCID: PMC10250245 DOI: 10.1093/nar/gkad288] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 03/21/2023] [Accepted: 04/06/2023] [Indexed: 04/19/2023] Open
Abstract
Unveiling the nucleic acid binding sites of a protein helps reveal its regulatory functions in vivo. Current methods encode protein sites from the handcrafted features of their local neighbors and recognize them via a classification, which are limited in expressive ability. Here, we present GeoBind, a geometric deep learning method for predicting nucleic binding sites on protein surface in a segmentation manner. GeoBind takes the whole point clouds of protein surface as input and learns the high-level representation based on the aggregation of their neighbors in local reference frames. Testing GeoBind on benchmark datasets, we demonstrate GeoBind is superior to state-of-the-art predictors. Specific case studies are performed to show the powerful ability of GeoBind to explore molecular surfaces when deciphering proteins with multimer formation. To show the versatility of GeoBind, we further extend GeoBind to five other types of ligand binding sites prediction tasks and achieve competitive performances.
Collapse
Affiliation(s)
- Pengpai Li
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| |
Collapse
|
196
|
Alqahtani T, Badreldin HA, Alrashed M, Alshaya AI, Alghamdi SS, Bin Saleh K, Alowais SA, Alshaya OA, Rahman I, Al Yami MS, Albekairy AM. The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research. Res Social Adm Pharm 2023:S1551-7411(23)00280-2. [PMID: 37321925 DOI: 10.1016/j.sapharm.2023.05.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 05/29/2023] [Accepted: 05/30/2023] [Indexed: 06/17/2023]
Abstract
Artificial Intelligence (AI) has revolutionized various domains, including education and research. Natural language processing (NLP) techniques and large language models (LLMs) such as GPT-4 and BARD have significantly advanced our comprehension and application of AI in these fields. This paper provides an in-depth introduction to AI, NLP, and LLMs, discussing their potential impact on education and research. By exploring the advantages, challenges, and innovative applications of these technologies, this review gives educators, researchers, students, and readers a comprehensive view of how AI could shape educational and research practices in the future, ultimately leading to improved outcomes. Key applications discussed in the field of research include text generation, data analysis and interpretation, literature review, formatting and editing, and peer review. AI applications in academics and education include educational support and constructive feedback, assessment, grading, tailored curricula, personalized career guidance, and mental health support. Addressing the challenges associated with these technologies, such as ethical concerns and algorithmic biases, is essential for maximizing their potential to improve education and research outcomes. Ultimately, the paper aims to contribute to the ongoing discussion about the role of AI in education and research and highlight its potential to lead to better outcomes for students, educators, and researchers.
Collapse
Affiliation(s)
- Tariq Alqahtani
- Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia; King Abdullah International Medical Research Center, Riyadh, Saudi Arabia.
| | - Hisham A Badreldin
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Mohammed Alrashed
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Abdulrahman I Alshaya
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Sahar S Alghamdi
- Department of Pharmaceutical Sciences, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, Saudi Arabia; King Abdullah International Medical Research Center, Riyadh, Saudi Arabia
| | - Khalid Bin Saleh
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Shuroug A Alowais
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Omar A Alshaya
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Ishrat Rahman
- Department of Basic Dental Sciences, College of Dentistry, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia
| | - Majed S Al Yami
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Abdulkareem M Albekairy
- King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Department of Pharmacy Practice, College of Pharmacy, King Saud bin Abdulaziz University for Health Sciences, King Abdullah International Medical Research Center, Riyadh, Saudi Arabia; Pharmaceutical Care Department, King Abdulaziz Medical City, National Guard Health Affairs, Riyadh, Saudi Arabia
| |
Collapse
|
197
|
Abstract
Debate has been simmering for some years regarding the importance of internal thermal motions of enzymes to catalysis. Recent developments in protein design may bring resolution of the more contentious points a little closer.
Collapse
Affiliation(s)
- Jeremy R. H. Tame
- Protein Design Laboratory, Graduate School of Medical Life Science, Yokohama City University, Suehiro 1-7-29, Tsurumi, Yokohama, 230-0045 Japan
| |
Collapse
|
198
|
Mészáros B, Park E, Malinverni D, Sejdiu BI, Immadisetty K, Sandhu M, Lang B, Babu MM. Recent breakthroughs in computational structural biology harnessing the power of sequences and structures. Curr Opin Struct Biol 2023; 80:102608. [PMID: 37182396 DOI: 10.1016/j.sbi.2023.102608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 04/12/2023] [Accepted: 04/17/2023] [Indexed: 05/16/2023]
Abstract
Recent advances in computational approaches and their integration into structural biology enable tackling increasingly complex questions. Here, we discuss several key areas, highlighting breakthroughs and remaining challenges. Theoretical modeling has provided tools to accurately predict and design protein structures on a scale currently difficult to achieve using experimental approaches. Molecular Dynamics simulations have become faster and more precise, delivering actionable information inaccessible by current experimental methods. Virtual screening workflows allow a high-throughput approach to discover ligands that bind and modulate protein function, while Machine Learning methods enable the design of proteins with new functionalities. Integrative structural biology combines several of these approaches, pushing the frontiers of structural and functional characterization to ever larger systems, advancing towards a complete understanding of the living cell. These breakthroughs will accelerate and significantly impact diverse areas of science.
Collapse
Affiliation(s)
- Bálint Mészáros
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA.
| | - Electa Park
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA.
| | - Duccio Malinverni
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA. https://twitter.com/DucMalinverni
| | - Besian I Sejdiu
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA. https://twitter.com/bisejdiu
| | - Kalyan Immadisetty
- Department of Bone Marrow Transplantation & Cellular Therapy, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA. https://twitter.com/k_immadisetty
| | - Manbir Sandhu
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA. https://twitter.com/M5andhu
| | - Benjamin Lang
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA. https://twitter.com/langbnj
| | - M Madan Babu
- Department of Structural Biology and Center of Excellence for Data Driven Discovery, St Jude Children's Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA.
| |
Collapse
|
199
|
Hederman AP, Ackerman ME. Leveraging deep learning to improve vaccine design. Trends Immunol 2023; 44:333-344. [PMID: 37003949 PMCID: PMC10485910 DOI: 10.1016/j.it.2023.03.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/05/2023] [Accepted: 03/05/2023] [Indexed: 04/03/2023]
Abstract
Deep learning has led to incredible breakthroughs in areas of research, from self-driving vehicles to solutions, to formal mathematical proofs. In the biomedical sciences, however, the revolutionary results seen in other fields are only now beginning to be realized. Vaccine research and development efforts represent an application with high public health significance. Protein structure prediction, immune repertoire analysis, and phylogenetics are three principal areas in which deep learning is poised to provide key advances. Here, we opine on some of the current challenges with deep learning and how they are being addressed. Despite the nascent stage of deep learning applications in immunological studies, there is ample opportunity to utilize this new technology to address the most challenging and burdensome infectious diseases confronting global populations.
Collapse
Affiliation(s)
| | - Margaret E Ackerman
- Thayer School of Engineering, Dartmouth College, Hanover, NH, USA; Department of Microbiology and Immunology, Geisel School of Medicine, Hanover, NH, USA.
| |
Collapse
|
200
|
Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun 2023; 14:2389. [PMID: 37185622 PMCID: PMC10129313 DOI: 10.1038/s41467-023-38063-x] [Citation(s) in RCA: 59] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 04/14/2023] [Indexed: 05/17/2023] Open
Abstract
Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558 million natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under 25 s). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold's capabilities, we predicted structures for 1.4 million paired antibody sequences, providing structural insights to 500-fold more antibodies than have experimentally determined structures.
Collapse
Affiliation(s)
- Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Lee-Shin Chu
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Jeffrey J Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA.
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA.
| |
Collapse
|