1
|
van der Flier F, Estell D, Pricelius S, Dankmeyer L, van Stigt Thans S, Mulder H, Otsuka R, Goedegebuur F, Lammerts L, Staphorst D, van Dijk AD, de Ridder D, Redestig H. Enzyme structure correlates with variant effect predictability. Comput Struct Biotechnol J 2024; 23:3489-3497. [PMID: 39435338 PMCID: PMC11491678 DOI: 10.1016/j.csbj.2024.09.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/03/2024] [Accepted: 09/12/2024] [Indexed: 10/23/2024] Open
Abstract
Protein engineering increasingly relies on machine learning models to computationally pre-screen promising novel candidates. Although machine learning approaches have proven effective, their performance on prospective screening data leaves room for improvement; prediction accuracy can vary greatly from one protein variant to the next. So far, it is unclear what characterizes variants that are associated with large prediction error. In order to establish whether structural characteristics influence predictability, we created a novel high-order combinatorial dataset for an enzyme spanning 3,706 variants, that can be partitioned into subsets of variants with mutations at positions exclusively belonging to a particular structural class. By training four different supervised variant effect prediction (VEP) models on structurally partitioned subsets of our data, we found that predictability strongly depended on all four structural characteristics we tested; buriedness, number of contact residues, proximity to the active site and presence of secondary structure elements. These dependencies were also found in several single mutation enzyme variant datasets, albeit with dataset specific directions. Most importantly, we found that these dependencies were similar for all four models we tested, indicating that there are specific structure and function determinants that are insufficiently accounted for by current machine learning algorithms. Overall, our findings suggest that improvements can be made to VEP models by exploring new inductive biases and by leveraging different data modalities of protein variants, and that stratified dataset design can highlight areas of improvement for machine learning guided protein engineering.
Collapse
Affiliation(s)
- Floris van der Flier
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dave Estell
- Health & Biosciences, International Flavors and Fragrances, Palo Alto, 94304 CA, USA
| | - Sina Pricelius
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Lydia Dankmeyer
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Sander van Stigt Thans
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Harm Mulder
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Rei Otsuka
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Frits Goedegebuur
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Laurens Lammerts
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Diego Staphorst
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| | - Aalt D.J. van Dijk
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Dick de Ridder
- Department of Plant Sciences, Wageningen University & Research, Wageningen, 6708 PB, the Netherlands
| | - Henning Redestig
- Health & Biosciences, International Flavors and Fragrances, Oegstgeest, 2342 BG, the Netherlands
| |
Collapse
|
2
|
Lobzaev E, Stracquadanio G. Dirichlet latent modelling enables effective learning and sampling of the functional protein design space. Nat Commun 2024; 15:9309. [PMID: 39468034 PMCID: PMC11519351 DOI: 10.1038/s41467-024-53622-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 10/14/2024] [Indexed: 10/30/2024] Open
Abstract
Engineering proteins with desired functions and biochemical properties is pivotal for biotechnology and drug discovery. While computational methods based on evolutionary information are reducing the experimental burden by designing targeted libraries of functional variants, they still have a low success rate when the desired protein has few or very remote homologous sequences. Here we propose an autoregressive model, called Temporal Dirichlet Variational Autoencoder (TDVAE), which exploits the mathematical properties of the Dirichlet distribution and temporal convolution to efficiently learn high-order information from a functionally related, possibly remotely similar, set of sequences. TDVAE is highly accurate in predicting the effects of amino acid mutations, while being significantly 90% smaller than the other state-of-the-art models. We then use TDVAE to design variants of the human alpha galactosidase enzymes as potential treatment for Fabry disease. Our model builds a library of diverse variants which retain sequence, biochemical and structural properties of the wildtype protein, suggesting they could be suitable for enzyme replacement therapy. Taken together, our results show the importance of accurate sequence modelling and the potential of autoregressive models as protein engineering and analysis tools.
Collapse
Affiliation(s)
- Evgenii Lobzaev
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
- School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom
| | | |
Collapse
|
3
|
Muir DF, Asper GPR, Notin P, Posner JA, Marks DS, Keiser MJ, Pinney MM. Evolutionary-Scale Enzymology Enables Biochemical Constant Prediction Across a Multi-Peaked Catalytic Landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.23.619915. [PMID: 39484523 PMCID: PMC11526920 DOI: 10.1101/2024.10.23.619915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Quantitatively mapping enzyme sequence-catalysis landscapes remains a critical challenge in understanding enzyme function, evolution, and design. Here, we expand an emerging microfluidic platform to measure catalytic constants-k cat and K M-for hundreds of diverse naturally occurring sequences and mutants of the model enzyme Adenylate Kinase (ADK). This enables us to dissect the sequence-catalysis landscape's topology, navigability, and mechanistic underpinnings, revealing distinct catalytic peaks organized by structural motifs. These results challenge long-standing hypotheses in enzyme adaptation, demonstrating that thermophilic enzymes are not slower than their mesophilic counterparts. Combining the rich representations of protein sequences provided by deep-learning models with our custom high-throughput kinetic data yields semi-supervised models that significantly outperform existing models at predicting catalytic parameters of naturally occurring ADK sequences. Our work demonstrates a promising strategy for dissecting sequence-catalysis landscapes across enzymatic evolution and building family-specific models capable of accurately predicting catalytic constants, opening new avenues for enzyme engineering and functional prediction.
Collapse
Affiliation(s)
- Duncan F Muir
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Program in Biophysics, University of California, San Francisco, San Francisco, CA, USA
| | - Garrison P R Asper
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | - Pascal Notin
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Jacob A Posner
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Department of Biology, San Francisco State University, San Francisco, CA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Michael J Keiser
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
- Institute for Neurodegenerative Diseases, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Margaux M Pinney
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Valhalla Fellow, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
4
|
Mohamed Abdul Cader J, Newton MAH, Rahman J, Mohamed Abdul Cader AJ, Sattar A. Ensembling methods for protein-ligand binding affinity prediction. Sci Rep 2024; 14:24447. [PMID: 39424851 PMCID: PMC11489440 DOI: 10.1038/s41598-024-72784-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 09/10/2024] [Indexed: 10/21/2024] Open
Abstract
Protein-ligand binding affinity prediction is a key element of computer-aided drug discovery. Most of the existing deep learning methods for protein-ligand binding affinity prediction utilize single models and suffer from low accuracy and generalization capability. In this paper, we train 13 deep learning models from combinations of 5 input features. Then, we explore all possible ensembles of the trained models to find the best ensembles. Our deep learning models use cross-attention and self-attention layers to extract short and long-range interactions. Our method is named Ensemble Binding Affinity (EBA). EBA extracts information from various models using different combinations of input features, such as simple 1D sequential and structural features of the protein-ligand complexes rather than 3D complex features. EBA is implemented to accurately predict the binding affinity of a protein-ligand complex. One of our ensembles achieves the highest Pearson correlation coefficient (R) value of 0.914 and the lowest root mean square error (RMSE) value of 0.957 on the well-known benchmark test set CASF2016. Our ensembles show significant improvements of more than 15% in R-value and 19% in RMSE on both well-known benchmark CSAR-HiQ test sets over the second-best predictor named CAPLA. Furthermore, the superior performance of the ensembles across all metrics compared to existing state-of-the-art protein-ligand binding affinity prediction methods on all five benchmark test datasets demonstrates the effectiveness and robustness of our approach. Therefore, our approach to improving binding affinity prediction between proteins and ligands can contribute to improving the success rate of potential drugs and accelerate the drug development process.
Collapse
Affiliation(s)
- Jiffriya Mohamed Abdul Cader
- School of Information and Communication Technology, Griffith University, Nathan Campus, Australia.
- Department of IT, Sri Lanka Institute of Advanced Technological Education, Colombo, Sri Lanka.
| | - M A Hakim Newton
- Institute for Integrated and Intelligent Systems (IIIS), Griffith University, Nathan Campus, Australia
- School of Information and Physical Sciences, The University of Newcastle, Callaghan, Australia
| | - Julia Rahman
- School of Information and Communication Technology, Griffith University, Nathan Campus, Australia
- Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh
| | | | - Abdul Sattar
- School of Information and Communication Technology, Griffith University, Nathan Campus, Australia
- Institute for Integrated and Intelligent Systems (IIIS), Griffith University, Nathan Campus, Australia
| |
Collapse
|
5
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
6
|
Valbuena R, Nigam A, Tycko J, Suzuki P, Spees K, Aradhana, Arana S, Du P, Patel RA, Bintu L, Kundaje A, Bassik MC. Prediction and design of transcriptional repressor domains with large-scale mutational scans and deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.21.614253. [PMID: 39386603 PMCID: PMC11463546 DOI: 10.1101/2024.09.21.614253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Regulatory proteins have evolved diverse repressor domains (RDs) to enable precise context-specific repression of transcription. However, our understanding of how sequence variation impacts the functional activity of RDs is limited. To address this gap, we generated a high-throughput mutational scanning dataset measuring the repressor activity of 115,000 variant sequences spanning more than 50 RDs in human cells. We identified thousands of clinical variants with loss or gain of repressor function, including TWIST1 HLH variants associated with Saethre-Chotzen syndrome and MECP2 domain variants associated with Rett syndrome. We also leveraged these data to annotate short linear interacting motifs (SLiMs) that are critical for repression in disordered RDs. Then, we designed a deep learning model called TENet ( T ranscriptional E ffector Net work) that integrates sequence, structure and biochemical representations of sequence variants to accurately predict repressor activity. We systematically tested generalization within and across domains with varying homology using the mutational scanning dataset. Finally, we employed TENet within a directed evolution sequence editing framework to tune the activity of both structured and disordered RDs and experimentally test thousands of designs. Our work highlights critical considerations for future dataset design and model training strategies to improve functional variant prioritization and precision design of synthetic regulatory proteins.
Collapse
|
7
|
Meiri R, Aharoni Lotati SL, Orenstein Y, Papo N. Deep neural networks for predicting the affinity landscape of protein-protein interactions. iScience 2024; 27:110772. [PMID: 39310756 PMCID: PMC11416218 DOI: 10.1016/j.isci.2024.110772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 06/27/2024] [Accepted: 08/15/2024] [Indexed: 09/25/2024] Open
Abstract
Studies determining protein-protein interactions (PPIs) by deep mutational scanning have focused mainly on a narrow range of affinities within complexes and thus include only partial coverage of the mutation space of given proteins. By inserting an affinity-reducing N-terminal alanine in the N-terminal domain of the tissue inhibitor of metalloproteinases-2 (N-TIMP2), we overcame the limitation of its narrow affinity range for matrix metalloproteinase 9 (MMP9CAT). We trained deep neural networks (DNNs) to quantitatively predict the binding affinity of unobserved wild-type variants and variants carrying an N-terminal alanine. Good correlation was obtained between predicted and observed log2 enrichment ratio (ER) values, which also correlated with the affinity of N-TIMP2 variants to MMP9CAT. Our ability to predict affinities of unobserved N-TIMP2 variants was confirmed on an independent dataset of experimentally validated N-TIMP2 proteins. This ability is of significant importance in the field of PPI prediction and for developing therapies targeting these interactions.
Collapse
Affiliation(s)
- Reut Meiri
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Shay-Lee Aharoni Lotati
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, Israel
| | - Niv Papo
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering and the National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel
| |
Collapse
|
8
|
Gantz M, Mathis SV, Nintzel FEH, Lio P, Hollfelder F. On synergy between ultrahigh throughput screening and machine learning in biocatalyst engineering. Faraday Discuss 2024; 252:89-114. [PMID: 39133073 PMCID: PMC11318516 DOI: 10.1039/d4fd00065j] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 04/23/2024] [Indexed: 08/13/2024]
Abstract
Protein design and directed evolution have separately contributed enormously to protein engineering. Without being mutually exclusive, the former relies on computation from first principles, while the latter is a combinatorial approach based on chance. Advances in ultrahigh throughput (uHT) screening, next generation sequencing and machine learning may create alternative routes to engineered proteins, where functional information linked to specific sequences is interpreted and extrapolated in silico. In particular, the miniaturisation of functional tests in water-in-oil emulsion droplets with picoliter volumes and their rapid generation and analysis (>1 kHz) allows screening of >107-membered libraries in a day. Subsequently, decoding the selected clones by short or long-read sequencing methods leads to large sequence-function datasets that may allow extrapolation from experimental directed evolution to further improved mutants beyond the observed hits. In this work, we explore experimental strategies for how to draw up 'fitness landscapes' in sequence space with uHT droplet microfluidics, review the current state of AI/ML in enzyme engineering and discuss how uHT datasets may be combined with AI/ML to make meaningful predictions and accelerate biocatalyst engineering.
Collapse
Affiliation(s)
- Maximilian Gantz
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | - Simon V Mathis
- Department of Computer Science, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
| | - Friederike E H Nintzel
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| | - Pietro Lio
- Department of Computer Science, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
| | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, UK
| |
Collapse
|
9
|
Chen Y, Xu Y, Liu D, Xing Y, Gong H. An end-to-end framework for the prediction of protein structure and fitness from single sequence. Nat Commun 2024; 15:7400. [PMID: 39191788 DOI: 10.1038/s41467-024-51776-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 08/19/2024] [Indexed: 08/29/2024] Open
Abstract
Significant research progress has been made in the field of protein structure and fitness prediction. Particularly, single-sequence-based structure prediction methods like ESMFold and OmegaFold achieve a balance between inference speed and prediction accuracy, showing promise for many downstream prediction tasks. Here, we propose SPIRED, a single-sequence-based structure prediction model that exhibits comparable performance to the state-of-the-art methods but with approximately 5-fold acceleration in inference and at least one order of magnitude reduction in training consumption. By integrating SPIRED with downstream neural networks, we compose an end-to-end framework named SPIRED-Fitness for the rapid prediction of both protein structure and fitness from single sequence with satisfactory accuracy. Moreover, SPIRED-Stab, the derivative of SPIRED-Fitness, achieves state-of-the-art performance in predicting the mutational effects on protein stability.
Collapse
Affiliation(s)
- Yinghui Chen
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Yunxin Xu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Di Liu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Yaoguang Xing
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
10
|
Illig AM, Siedhoff NE, Davari MD, Schwaneberg U. Evolutionary Probability and Stacked Regressions Enable Data-Driven Protein Engineering with Minimized Experimental Effort. J Chem Inf Model 2024; 64:6350-6360. [PMID: 39088689 DOI: 10.1021/acs.jcim.4c00704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/03/2024]
Abstract
Protein engineering through directed evolution and (semi)rational approaches is routinely applied to optimize protein properties for a broad range of applications in industry and academia. The multitude of possible variants, combined with limited screening throughput, hampers efficient protein engineering. Data-driven strategies have emerged as a powerful tool to model the protein fitness landscape that can be explored in silico, significantly accelerating protein engineering campaigns. However, such methods require a certain amount of data, which often cannot be provided, to generate a reliable model of the fitness landscape. Here, we introduce MERGE, a method that combines direct coupling analysis (DCA) and machine learning (ML). MERGE enables data-driven protein engineering when only limited data are available for training, typically ranging from 50 to 500 labeled sequences. Our method demonstrates remarkable performance in predicting a protein's fitness value and rank based on its sequence across diverse proteins and properties. Notably, MERGE outperforms state-of-the-art methods when only small data sets are available for modeling, requiring fewer computational resources, and proving particularly promising for protein engineers who have access to limited amounts of data.
Collapse
Affiliation(s)
| | - Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| | - Mehdi D Davari
- Department of Bioorganic Chemistry, Leibniz Institute of Plant Biochemistry, Weinberg 3, 06120 Halle, Germany
| | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringerweg 3, 52074 Aachen, Germany
| |
Collapse
|
11
|
Lipsh-Sokolik R, Fleishman SJ. Addressing epistasis in the design of protein function. Proc Natl Acad Sci U S A 2024; 121:e2314999121. [PMID: 39133844 PMCID: PMC11348311 DOI: 10.1073/pnas.2314999121] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024] Open
Abstract
Mutations in protein active sites can dramatically improve function. The active site, however, is densely packed and extremely sensitive to mutations. Therefore, some mutations may only be tolerated in combination with others in a phenomenon known as epistasis. Epistasis reduces the likelihood of obtaining improved functional variants and dramatically slows natural and lab evolutionary processes. Research has shed light on the molecular origins of epistasis and its role in shaping evolutionary trajectories and outcomes. In addition, sequence- and AI-based strategies that infer epistatic relationships from mutational patterns in natural or experimental evolution data have been used to design functional protein variants. In recent years, combinations of such approaches and atomistic design calculations have successfully predicted highly functional combinatorial mutations in active sites. These were used to design thousands of functional active-site variants, demonstrating that, while our understanding of epistasis remains incomplete, some of the determinants that are critical for accurate design are now sufficiently understood. We conclude that the space of active-site variants that has been explored by evolution may be expanded dramatically to enhance natural activities or discover new ones. Furthermore, design opens the way to systematically exploring sequence and structure space and mutational impacts on function, deepening our understanding and control over protein activity.
Collapse
Affiliation(s)
- Rosalie Lipsh-Sokolik
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| | - Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot 7610001, Israel
| |
Collapse
|
12
|
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, Arnold FH. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proc Natl Acad Sci U S A 2024; 121:e2400439121. [PMID: 39074291 PMCID: PMC11317637 DOI: 10.1073/pnas.2400439121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 06/17/2024] [Indexed: 07/31/2024] Open
Abstract
Protein engineering often targets binding pockets or active sites which are enriched in epistasis-nonadditive interactions between amino acid substitutions-and where the combined effects of multiple single substitutions are difficult to predict. Few existing sequence-fitness datasets capture epistasis at large scale, especially for enzyme catalysis, limiting the development and assessment of model-guided enzyme engineering approaches. We present here a combinatorially complete, 160,000-variant fitness landscape across four residues in the active site of an enzyme. Assaying the native reaction of a thermostable β-subunit of tryptophan synthase (TrpB) in a nonnative environment yielded a landscape characterized by significant epistasis and many local optima. These effects prevent simulated directed evolution approaches from efficiently reaching the global optimum. There is nonetheless wide variability in the effectiveness of different directed evolution approaches, which together provide experimental benchmarks for computational and machine learning workflows. The most-fit TrpB variants contain a substitution that is nearly absent in natural TrpB sequences-a result that conservation-based predictions would not capture. Thus, although fitness prediction using evolutionary data can enrich in more-active variants, these approaches struggle to identify and differentiate among the most-active variants, even for this near-native function. Overall, this work presents a large-scale testing ground for model-guided enzyme engineering and suggests that efficient navigation of epistatic fitness landscapes can be improved by advances in both machine learning and physical modeling.
Collapse
Affiliation(s)
- Kadina E. Johnston
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Patrick J. Almhjell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Ella J. Watkins-Dulaney
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Grace Liu
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
| | - Nicholas J. Porter
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| | - Frances H. Arnold
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, CA91125
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA91125
| |
Collapse
|
13
|
Freschlin CR, Fahlberg SA, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. Nat Commun 2024; 15:6405. [PMID: 39080282 PMCID: PMC11289474 DOI: 10.1038/s41467-024-50712-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 07/13/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
14
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
15
|
Vornholt T, Mutný M, Schmidt GW, Schellhaas C, Tachibana R, Panke S, Ward TR, Krause A, Jeschek M. Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning. ACS CENTRAL SCIENCE 2024; 10:1357-1370. [PMID: 39071060 PMCID: PMC11273458 DOI: 10.1021/acscentsci.4c00258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/22/2024] [Accepted: 05/02/2024] [Indexed: 07/30/2024]
Abstract
Tailored enzymes are crucial for the transition to a sustainable bioeconomy. However, enzyme engineering is laborious and failure-prone due to its reliance on serendipity. The efficiency and success rates of engineering campaigns may be improved by applying machine learning to map the sequence-activity landscape based on small experimental data sets. Yet, it often proves challenging to reliably model large sequence spaces while keeping the experimental effort tractable. To address this challenge, we present an integrated pipeline combining large-scale screening with active machine learning, which we applied to engineer an artificial metalloenzyme (ArM) catalyzing a new-to-nature hydroamination reaction. Combining lab automation and next-generation sequencing, we acquired sequence-activity data for several thousand ArM variants. We then used Gaussian process regression to model the activity landscape and guide further screening rounds. Critical characteristics of our pipeline include the cost-effective generation of information-rich data sets, the integration of an explorative round to improve the model's performance, and the inclusion of experimental noise. Our approach led to an order-of-magnitude boost in the hit rate while making efficient use of experimental resources. Search strategies like this should find broad utility in enzyme engineering and accelerate the development of novel biocatalysts.
Collapse
Affiliation(s)
- Tobias Vornholt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Mojmír Mutný
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Gregor W. Schmidt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Christian Schellhaas
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Ryo Tachibana
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Sven Panke
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Thomas R. Ward
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Andreas Krause
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Markus Jeschek
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- Institute
of Microbiology, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
| |
Collapse
|
16
|
Zhou Z, Zhang L, Yu Y, Wu B, Li M, Hong L, Tan P. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun 2024; 15:5566. [PMID: 38956442 PMCID: PMC11219809 DOI: 10.1038/s41467-024-49798-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/11/2024] [Indexed: 07/04/2024] Open
Abstract
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering.
Collapse
Affiliation(s)
- Ziyi Zhou
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Liang Zhang
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuanxi Yu
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Banghao Wu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Mingchen Li
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Liang Hong
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
- Zhang Jiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai, 201203, China.
| | - Pan Tan
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
| |
Collapse
|
17
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
18
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
19
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593772. [PMID: 38798671 PMCID: PMC11118547 DOI: 10.1101/2024.05.12.593772] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
20
|
Posfai A, McCandlish DM, Kinney JB. Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.12.593774. [PMID: 38798625 PMCID: PMC11118426 DOI: 10.1101/2024.05.12.593774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Quantitative models that describe how biological sequences encode functional activities are ubiquitous in modern biology. One important aspect of these models is that they commonly exhibit gauge freedoms, i.e., directions in parameter space that do not affect model predictions. In physics, gauge freedoms arise when physical theories are formulated in ways that respect fundamental symmetries. However, the connections that gauge freedoms in models of sequence-function relationships have to the symmetries of sequence space have yet to be systematically studied. Here we study the gauge freedoms of models that respect a specific symmetry of sequence space: the group of position-specific character permutations. We find that gauge freedoms arise when model parameters transform under redundant irreducible matrix representations of this group. Based on this finding, we describe an "embedding distillation" procedure that enables analytic calculation of the number of independent gauge freedoms, as well as efficient computation of a sparse basis for the space of gauge freedoms. We also study how parameter transformation behavior affects parameter interpretability. We find that in many (and possibly all) nontrivial models, the ability to interpret individual model parameters as quantifying intrinsic allelic effects requires that gauge freedoms be present. This finding establishes an incompatibility between two distinct notions of parameter interpretability. Our work thus advances the understanding of symmetries, gauge freedoms, and parameter interpretability in sequence-function relationships.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
21
|
Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large language models for science and medicine. Eur J Clin Invest 2024; 54:e14183. [PMID: 38381530 DOI: 10.1111/eci.14183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
Large language models (LLMs) are a type of machine learning model that learn statistical patterns over text, such as predicting the next words in a sequence of text. Both general purpose and task-specific LLMs have demonstrated potential across diverse applications. Science and medicine have many data types that are highly suitable for LLMs, such as scientific texts (publications, patents and textbooks), electronic medical records, large databases of DNA and protein sequences and chemical compounds. Carefully validated systems that can understand and reason across all these modalities may maximize benefits. Despite the inevitable limitations and caveats of any new technology and some uncertainties specific to LLMs, LLMs have the potential to be transformative in science and medicine.
Collapse
Affiliation(s)
- Amalio Telenti
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, California, USA
- Vir Biotechnology, Inc., San Francisco, California, USA
| | | | - Brian L Hie
- FAIR, Meta, Menlo Park, California, USA
- Department of Chemical Engineering, Stanford University, Stanford, California, USA
| | - Cyrus Maher
- Vir Biotechnology, Inc., San Francisco, California, USA
| | - Suchi Saria
- Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, Maryland, USA
| | - John P A Ioannidis
- Department of Medicine, Stanford University, Stanford, California, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA
- Department of Statistics, Stanford University, Stanford, California, USA
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
| |
Collapse
|
22
|
Petersen BM, Kirby MB, Chrispens KM, Irvin OM, Strawn IK, Haas CM, Walker AM, Baumer ZT, Ulmer SA, Ayala E, Rhodes ER, Guthmiller JJ, Steiner PJ, Whitehead TA. An integrated technology for quantitative wide mutational scanning of human antibody Fab libraries. Nat Commun 2024; 15:3974. [PMID: 38730230 PMCID: PMC11087541 DOI: 10.1038/s41467-024-48072-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Accepted: 04/19/2024] [Indexed: 05/12/2024] Open
Abstract
Antibodies are engineerable quantities in medicine. Learning antibody molecular recognition would enable the in silico design of high affinity binders against nearly any proteinaceous surface. Yet, publicly available experiment antibody sequence-binding datasets may not contain the mutagenic, antigenic, or antibody sequence diversity necessary for deep learning approaches to capture molecular recognition. In part, this is because limited experimental platforms exist for assessing quantitative and simultaneous sequence-function relationships for multiple antibodies. Here we present MAGMA-seq, an integrated technology that combines multiple antigens and multiple antibodies and determines quantitative biophysical parameters using deep sequencing. We demonstrate MAGMA-seq on two pooled libraries comprising mutants of nine different human antibodies spanning light chain gene usage, CDR H3 length, and antigenic targets. We demonstrate the comprehensive mapping of potential antibody development pathways, sequence-binding relationships for multiple antibodies simultaneously, and identification of paratope sequence determinants for binding recognition for broadly neutralizing antibodies (bnAbs). MAGMA-seq enables rapid and scalable antibody engineering of multiple lead candidates because it can measure binding for mutants of many given parental antibodies in a single experiment.
Collapse
Affiliation(s)
- Brian M Petersen
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Monica B Kirby
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Karson M Chrispens
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Olivia M Irvin
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Isabell K Strawn
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Cyrus M Haas
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Alexis M Walker
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Zachary T Baumer
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Sophia A Ulmer
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Edgardo Ayala
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Emily R Rhodes
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Jenna J Guthmiller
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Paul J Steiner
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA
| | - Timothy A Whitehead
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, USA.
| |
Collapse
|
23
|
Zhong G, Zhao Y, Zhuang D, Chung WK, Shen Y. PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.20.581321. [PMID: 38746140 PMCID: PMC11092447 DOI: 10.1101/2024.02.20.581321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate prediction of the functional impact of missense variants is important for disease gene discovery, clinical genetic diagnostics, therapeutic strategies, and protein engineering. Previous efforts have focused on predicting a binary pathogenicity classification, but the functional impact of missense variants is multi-dimensional. Pathogenic missense variants in the same gene may act through different modes of action (i.e., gain/loss-of-function) by affecting different aspects of protein function. They may result in distinct clinical conditions that require different treatments. We developed a new method, PreMode, to perform gene-specific mode-of-action predictions. PreMode models effects of coding sequence variants using SE(3)-equivariant graph neural networks on protein sequences and structures. Using the largest-to-date set of missense variants with known modes of action, we showed that PreMode reached state-of-the-art performance in multiple types of mode-of-action predictions by efficient transfer-learning. Additionally, PreMode's prediction of G/LoF variants in a kinase is consistent with inactive-active conformation transition energy changes. Finally, we show that PreMode enables efficient study design of deep mutational scans and optimization in protein engineering.
Collapse
|
24
|
Guo J, Lin LF, Oraskovich SV, Rivera de Jesús JA, Listgarten J, Schaffer DV. Computationally guided AAV engineering for enhanced gene delivery. Trends Biochem Sci 2024; 49:457-469. [PMID: 38531696 PMCID: PMC11456259 DOI: 10.1016/j.tibs.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/22/2024] [Accepted: 03/01/2024] [Indexed: 03/28/2024]
Abstract
Gene delivery vehicles based on adeno-associated viruses (AAVs) are enabling increasing success in human clinical trials, and they offer the promise of treating a broad spectrum of both genetic and non-genetic disorders. However, delivery efficiency and targeting must be improved to enable safe and effective therapies. In recent years, considerable effort has been invested in creating AAV variants with improved delivery, and computational approaches have been increasingly harnessed for AAV engineering. In this review, we discuss how computationally designed AAV libraries are enabling directed evolution. Specifically, we highlight approaches that harness sequences outputted by next-generation sequencing (NGS) coupled with machine learning (ML) to generate new functional AAV capsids and related regulatory elements, pushing the frontier of what vector engineering and gene therapy may achieve.
Collapse
Affiliation(s)
- Jingxuan Guo
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA
| | - Li F Lin
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA
| | - Sydney V Oraskovich
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA
| | - Julio A Rivera de Jesús
- Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Graduate Program in Bioengineering, University of California, San Francisco and University of California, Berkeley, CA 94720, USA; Department of Neurological Surgery, University of California, San Francisco, CA 94143, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720, USA
| | - David V Schaffer
- California Institute for Quantitative Biosciences (QB3), University of California, Berkeley, CA 94720, USA; Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA 94720, USA; Department of Bioengineering, University of California, Berkeley, CA 94720, USA; Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA.
| |
Collapse
|
25
|
Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol 2024:10.1038/s41587-024-02214-2. [PMID: 38653796 DOI: 10.1038/s41587-024-02214-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/20/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.
Collapse
Affiliation(s)
| | - Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Sandra Viknander
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Clara Goldin
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | | | - Aleksej Zelezniak
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden.
- Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania.
- Randall Centre for Cell & Molecular Biophysics, King's College London, Guy's Campus, London, UK.
| | | |
Collapse
|
26
|
Gelman S, Johnson B, Freschlin C, D'Costa S, Gitter A, Romero PA. Biophysics-based protein language models for protein engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585128. [PMID: 38559182 PMCID: PMC10980077 DOI: 10.1101/2024.03.15.585128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose Mutational Effect Transfer Learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. We finetune METL on experimental sequence-function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL's ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
Collapse
Affiliation(s)
- Sam Gelman
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | - Bryce Johnson
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
| | | | - Sameer D'Costa
- Department of Biochemistry, University of Wisconsin-Madison
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin-Madison
- Morgridge Institute for Research
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison
| | | |
Collapse
|
27
|
Case M, Smith M, Vinh J, Thurber G. Machine learning to predict continuous protein properties from binary cell sorting data and map unseen sequence space. Proc Natl Acad Sci U S A 2024; 121:e2311726121. [PMID: 38451939 PMCID: PMC10945751 DOI: 10.1073/pnas.2311726121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 12/27/2023] [Indexed: 03/09/2024] Open
Abstract
Proteins are a diverse class of biomolecules responsible for wide-ranging cellular functions, from catalyzing reactions to recognizing pathogens. The ability to evolve proteins rapidly and inexpensively toward improved properties is a common objective for protein engineers. Powerful high-throughput methods like fluorescent activated cell sorting and next-generation sequencing have dramatically improved directed evolution experiments. However, it is unclear how to best leverage these data to characterize protein fitness landscapes more completely and identify lead candidates. In this work, we develop a simple yet powerful framework to improve protein optimization by predicting continuous protein properties from simple directed evolution experiments using interpretable, linear machine learning models. Importantly, we find that these models, which use data from simple but imprecise experimental estimates of protein fitness, have predictive capabilities that approach more precise but expensive data. Evaluated across five diverse protein engineering tasks, continuous properties are consistently predicted from readily available deep sequencing data, demonstrating that protein fitness space can be reasonably well modeled by linear relationships among sequence mutations. To prospectively test the utility of this approach, we generated a library of stapled peptides and applied the framework to predict affinity and specificity from simple cell sorting data. We then coupled integer linear programming, a method to optimize protein fitness from linear weights, with mutation scores from machine learning to identify variants in unseen sequence space that have improved and co-optimal properties. This approach represents a versatile tool for improved analysis and identification of protein variants across many domains of protein engineering.
Collapse
Affiliation(s)
- Marshall Case
- Chemical Engineering, University of Michigan, Ann Arbor, MI48109
| | - Matthew Smith
- Chemical Engineering, University of Michigan, Ann Arbor, MI48109
- Biointerfaces Institute, University of Michigan, Ann Arbor, MI48109
| | - Jordan Vinh
- Biomedical Engineering, University of Michigan, Ann Arbor, MI48109
| | - Greg Thurber
- Chemical Engineering, University of Michigan, Ann Arbor, MI48109
- Biomedical Engineering, University of Michigan, Ann Arbor, MI48109
| |
Collapse
|
28
|
Whitehead TA. Deep screening of antibody-antigen affinities. Nat Biomed Eng 2024; 8:203-204. [PMID: 38151637 DOI: 10.1038/s41551-023-01169-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2023]
Affiliation(s)
- Timothy A Whitehead
- Department of Chemical and Biological Engineering, University of Colorado, Boulder, CO, USA.
| |
Collapse
|
29
|
Porebski BT, Balmforth M, Browne G, Riley A, Jamali K, Fürst MJLJ, Velic M, Buchanan A, Minter R, Vaughan T, Holliger P. Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening. Nat Biomed Eng 2024; 8:214-232. [PMID: 37814006 DOI: 10.1038/s41551-023-01093-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Accepted: 08/23/2023] [Indexed: 10/11/2023]
Abstract
Developing therapeutic antibodies is laborious and costly. Here we report a method for antibody discovery that leverages the Illumina HiSeq platform to, within 3 days, screen in the order of 108 antibody-antigen interactions. The method, which we named 'deep screening', involves the clustering and sequencing of antibody libraries, the conversion of the DNA clusters into complementary RNA clusters covalently linked to the instrument's flow-cell surface on the same location, the in situ translation of the clusters into antibodies tethered via ribosome display, and their screening via fluorescently labelled antigens. By using deep screening, we discovered low-nanomolar nanobodies to a model antigen using 4 × 106 unique variants from yeast-display-enriched libraries, and high-picomolar single-chain antibody fragment leads for human interleukin-7 directly from unselected synthetic repertoires. We also leveraged deep screening of a library of 2.4 × 105 sequences of the third complementarity-determining region of the heavy chain of an anti-human epidermal growth factor receptor 2 (HER2) antibody as input for a large language model that generated new single-chain antibody fragment sequences with higher affinity for HER2 than those in the original library.
Collapse
Affiliation(s)
| | | | | | - Aidan Riley
- Biologics Engineering, AstraZeneca, Cambridge, UK
| | | | - Maximillian J L J Fürst
- MRC Laboratory of Molecular Biology, Cambridge, UK
- Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, the Netherlands
| | | | | | - Ralph Minter
- Biologics Engineering, AstraZeneca, Cambridge, UK
- Alchemab Therapeutics, London, UK
| | | | | |
Collapse
|
30
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
31
|
Chu HY, Fong JHC, Thean DGL, Zhou P, Fung FKC, Huang Y, Wong ASL. Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - John H C Fong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Dawn G L Thean
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Peng Zhou
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Frederic K C Fung
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Alan S L Wong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
32
|
Ao YF, Dörr M, Menke MJ, Born S, Heuson E, Bornscheuer UT. Data-Driven Protein Engineering for Improving Catalytic Activity and Selectivity. Chembiochem 2024; 25:e202300754. [PMID: 38029350 DOI: 10.1002/cbic.202300754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/29/2023] [Indexed: 12/01/2023]
Abstract
Protein engineering is essential for altering the substrate scope, catalytic activity and selectivity of enzymes for applications in biocatalysis. However, traditional approaches, such as directed evolution and rational design, encounter the challenge in dealing with the experimental screening process of a large protein mutation space. Machine learning methods allow the approximation of protein fitness landscapes and the identification of catalytic patterns using limited experimental data, thus providing a new avenue to guide protein engineering campaigns. In this concept article, we review machine learning models that have been developed to assess enzyme-substrate-catalysis performance relationships aiming to improve enzymes through data-driven protein engineering. Furthermore, we prospect the future development of this field to provide additional strategies and tools for achieving desired activities and selectivities.
Collapse
Affiliation(s)
- Yu-Fei Ao
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
- Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Zhongguancun North First Street 2, Beijing, 100190, China
- University of Chinese Academy of Sciences, Yuquan Road 19(A), Beijing, 100049, China
| | - Mark Dörr
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| | - Marian J Menke
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| | - Stefan Born
- Technische Universität Berlin, Chair of Bioprocess Engineering, Ackerstraße 76, 13355, Berlin, Germany
| | - Egon Heuson
- Univ. Lille, CNRS, Centrale Lille, Univ. Artois, UMR 8181 UCCS, Unité de Catalyse et Chimie du Solide, 59000, Lille, France
| | - Uwe T Bornscheuer
- Department of Biotechnology and Enzyme Catalysis, Institute of Biochemistry, University of Greifswald, Felix-Hausdorff-Str. 4, 17487, Greifswald, Germany
| |
Collapse
|
33
|
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
34
|
Petersen BM, Kirby MB, Chrispens KM, Irvin OM, Strawn IK, Haas CM, Walker AM, Baumer ZT, Ulmer SA, Ayala E, Rhodes ER, Guthmiller JJ, Steiner PJ, Whitehead TA. An integrated technology for quantitative wide mutational scanning of human antibody Fab libraries. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.16.575852. [PMID: 38293170 PMCID: PMC10827193 DOI: 10.1101/2024.01.16.575852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Antibodies are engineerable quantities in medicine. Learning antibody molecular recognition would enable the in silico design of high affinity binders against nearly any proteinaceous surface. Yet, publicly available experiment antibody sequence-binding datasets may not contain the mutagenic, antigenic, or antibody sequence diversity necessary for deep learning approaches to capture molecular recognition. In part, this is because limited experimental platforms exist for assessing quantitative and simultaneous sequence-function relationships for multiple antibodies. Here we present MAGMA-seq, an integrated technology that combines multiple antigens and multiple antibodies and determines quantitative biophysical parameters using deep sequencing. We demonstrate MAGMA-seq on two pooled libraries comprising mutants of ten different human antibodies spanning light chain gene usage, CDR H3 length, and antigenic targets. We demonstrate the comprehensive mapping of potential antibody development pathways, sequence-binding relationships for multiple antibodies simultaneously, and identification of paratope sequence determinants for binding recognition for broadly neutralizing antibodies (bnAbs). MAGMA-seq enables rapid and scalable antibody engineering of multiple lead candidates because it can measure binding for mutants of many given parental antibodies in a single experiment.
Collapse
Affiliation(s)
- Brian M. Petersen
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Monica B. Kirby
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Karson M. Chrispens
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Olivia M. Irvin
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Isabell K. Strawn
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Cyrus M. Haas
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Alexis M. Walker
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Zachary T. Baumer
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Sophia A. Ulmer
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Edgardo Ayala
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045
| | - Emily R. Rhodes
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Jenna J. Guthmiller
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045
| | - Paul J. Steiner
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| | - Timothy A. Whitehead
- Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO, 80305, USA
| |
Collapse
|
35
|
Ramachandran A, Lumetta SS, Chen D. PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning. PLoS Comput Biol 2024; 20:e1011790. [PMID: 38241392 PMCID: PMC10829978 DOI: 10.1371/journal.pcbi.1011790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 01/31/2024] [Accepted: 12/27/2023] [Indexed: 01/21/2024] Open
Abstract
One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30× larger. Our method forecasts unseen lineages months in advance, whereas models 4× and 30× larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets.
Collapse
Affiliation(s)
- Anand Ramachandran
- University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Steven S. Lumetta
- University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| | - Deming Chen
- University of Illinois at Urbana-Champaign, Urbana, Illinois, United States of America
| |
Collapse
|
36
|
Marchal D, Schulz L, Schuster I, Ivanovska J, Paczia N, Prinz S, Zarzycki J, Erb TJ. Machine Learning-Supported Enzyme Engineering toward Improved CO 2-Fixation of Glycolyl-CoA Carboxylase. ACS Synth Biol 2023; 12:3521-3530. [PMID: 37983631 PMCID: PMC10729300 DOI: 10.1021/acssynbio.3c00403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 11/01/2023] [Accepted: 11/07/2023] [Indexed: 11/22/2023]
Abstract
Glycolyl-CoA carboxylase (GCC) is a new-to-nature enzyme that catalyzes the key reaction in the tartronyl-CoA (TaCo) pathway, a synthetic photorespiration bypass that was recently designed to improve photosynthetic CO2 fixation. GCC was created from propionyl-CoA carboxylase (PCC) through five mutations. However, despite reaching activities of naturally evolved biotin-dependent carboxylases, the quintuple substitution variant GCC M5 still lags behind 4-fold in catalytic efficiency compared to its template PCC and suffers from futile ATP hydrolysis during CO2 fixation. To further improve upon GCC M5, we developed a machine learning-supported workflow that reduces screening efforts for identifying improved enzymes. Using this workflow, we present two novel GCC variants with 2-fold increased carboxylation rate and 60% reduced energy demand, respectively, which are able to address kinetic and thermodynamic limitations of the TaCo pathway. Our work highlights the potential of combining machine learning and directed evolution strategies to reduce screening efforts in enzyme engineering.
Collapse
Affiliation(s)
- Daniel
G. Marchal
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Luca Schulz
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | | | | | - Nicole Paczia
- Core
Facility for Metabolomics and Small Molecule Mass Spectrometry, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Simone Prinz
- Central
Electron Microscopy Facility, Max-Planck-Institute
of Biophysics, Frankfurt 60438, Germany
| | - Jan Zarzycki
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
| | - Tobias J. Erb
- Department
of Biochemistry and Synthetic Metabolism, Max-Planck-Institute for Terrestrial Microbiology, Marburg 35043, Germany
- SYNMIKRO
Center for Synthetic Microbiology, Marburg 35032, Germany
| |
Collapse
|
37
|
Notin P, Kollasch AW, Ritter D, van Niekerk L, Paul S, Spinner H, Rollins N, Shaw A, Weitzman R, Frazer J, Dias M, Franceschi D, Orenbuch R, Gal Y, Marks DS. ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.07.570727. [PMID: 38106144 PMCID: PMC10723403 DOI: 10.1101/2023.12.07.570727] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Ada Shaw
- Applied Mathematics, Harvard University
| | | | | | - Mafalda Dias
- Centre for Genomic Regulation, Universitat Pompeu Fabra
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| | | |
Collapse
|
38
|
Notin P, Marks DS, Weitzman R, Gal Y. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.06.570473. [PMID: 38106034 PMCID: PMC10723423 DOI: 10.1101/2023.12.06.570473] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Collapse
Affiliation(s)
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| |
Collapse
|
39
|
James JK, Norland K, Johar AS, Kullo IJ. Deep generative models of LDLR protein structure to predict variant pathogenicity. J Lipid Res 2023; 64:100455. [PMID: 37821076 PMCID: PMC10696256 DOI: 10.1016/j.jlr.2023.100455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 09/16/2023] [Accepted: 10/05/2023] [Indexed: 10/13/2023] Open
Abstract
The complex structure and function of low density lipoprotein receptor (LDLR) makes classification of protein-coding missense variants challenging. Deep generative models, including Evolutionary model of Variant Effect (EVE), Evolutionary Scale Modeling (ESM), and AlphaFold 2 (AF2), have enabled significant progress in the prediction of protein structure and function. ESM and EVE directly estimate the likelihood of a variant sequence but are purely data-driven and challenging to interpret. AF2 predicts LDLR structures, but variant effects are explicitly modeled by estimating changes in stability. We tested the effectiveness of these models for predicting variant pathogenicity compared to established methods. AF2 produced two distinct conformations based on a novel hinge mechanism. Within ESM's hidden space, benign and pathogenic variants had different distributions. In EVE, these distributions were similar. EVE and ESM were comparable to Polyphen-2, SIFT, REVEL, and Primate AI for predicting binary classifications in ClinVar. However, they were more strongly correlated with experimental measures of LDL uptake. AF2 poorly performed in these tasks. Using the UK Biobank to compare association with clinical phenotypes, ESM and EVE were more strongly associated with serum LDL-C than Polyphen-2. ESM was able to identify variants with more extreme LDL-C levels than EVE and had a significantly stronger association with atherosclerotic cardiovascular disease. In conclusion, AF2 predicted LDLR structures do not accurately model variant pathogenicity. ESM and EVE are competitive with prior scoring methods for prediction based on binary classifications in ClinVar but are superior based on correlations with experimental assays and clinical phenotypes.
Collapse
Affiliation(s)
- Jose K James
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Kristjan Norland
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Angad S Johar
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Iftikhar J Kullo
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA; Gonda Vascular Center, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
40
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
41
|
Nisonoff H, Wang Y, Listgarten J. Coherent Blending of Biophysics-Based Knowledge with Bayesian Neural Networks for Robust Protein Property Prediction. ACS Synth Biol 2023; 12:3242-3251. [PMID: 37888887 DOI: 10.1021/acssynbio.3c00217] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Predicting properties of proteins is of interest for basic biological understanding and protein engineering alike. Increasingly, machine learning (ML) approaches are being used for this task. However, the accuracy of such ML models typically degrades as test proteins stray further from the training data distribution. On the other hand, models that are more data-free, such as biophysics-based models, are typically uniformly accurate over all of the protein space, even if inferior for test points close to the training distribution. Consequently, being able to cohesively blend these two types of information within one model, as appropriate in different parts of the protein space, will improve overall importance. Herein, we tackle just this problem to yield a simple, practical, and scalable approach that can be easily implemented. In particular, we use a Bayesian formulation to integrate biophysical knowledge into neural networks. However, in doing so, a technical challenge arises: Bayesian neural networks (BNNs) enable the user to specify prior information only on the neural network weight parameters, rather than on the function values given to us from a typical biophysics-based model. Consequently, we devise a principled probabilistic method to overcome this challenge. Our approach yields intuitively pleasing results: predictions rely more heavily on the biophysical prior information when the BNN epistemic uncertainty─uncertainty arising from a lack of training data rather than sensor noise─is large and more heavily on the neural network when the epistemic uncertainty is small. We demonstrate this approach on an illustrative synthetic example, on two examples of protein property prediction (fluorescence and binding), and for generality on one small molecule property prediction problem.
Collapse
Affiliation(s)
- Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
| | - Yixin Wang
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109-1107, United States
| | - Jennifer Listgarten
- Center for Computational Biology, University of California, Berkeley, Berkeley, California 94720-3220, United States
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, California 94720-1776, United States
| |
Collapse
|
42
|
Khakzad H, Igashov I, Schneuing A, Goverde C, Bronstein M, Correia B. A new age in protein design empowered by deep learning. Cell Syst 2023; 14:925-939. [PMID: 37972559 DOI: 10.1016/j.cels.2023.10.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 06/22/2023] [Accepted: 10/11/2023] [Indexed: 11/19/2023]
Abstract
The rapid progress in the field of deep learning has had a significant impact on protein design. Deep learning methods have recently produced a breakthrough in protein structure prediction, leading to the availability of high-quality models for millions of proteins. Along with novel architectures for generative modeling and sequence analysis, they have revolutionized the protein design field in the past few years remarkably by improving the accuracy and ability to identify novel protein sequences and structures. Deep neural networks can now learn and extract the fundamental features of protein structures, predict how they interact with other biomolecules, and have the potential to create new effective drugs for treating disease. As their applicability in protein design is rapidly growing, we review the recent developments and technology in deep learning methods and provide examples of their performance to generate novel functional proteins.
Collapse
Affiliation(s)
- Hamed Khakzad
- Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France; École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Ilia Igashov
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Arne Schneuing
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Casper Goverde
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | | | - Bruno Correia
- École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland; Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland.
| |
Collapse
|
43
|
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566287. [PMID: 37987009 PMCID: PMC10659313 DOI: 10.1101/2023.11.08.566287] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.
Collapse
Affiliation(s)
- Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
44
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
45
|
Romero-Romero S, Lindner S, Ferruz N. Exploring the Protein Sequence Space with Global Generative Models. Cold Spring Harb Perspect Biol 2023; 15:a041471. [PMID: 37848247 PMCID: PMC10626256 DOI: 10.1101/cshperspect.a041471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2023]
Abstract
Recent advancements in specialized large-scale architectures for training images and language have profoundly impacted the field of computer vision and natural language processing (NLP). Language models, such as the recent ChatGPT and GPT-4, have demonstrated exceptional capabilities in processing, translating, and generating human language. These breakthroughs have also been reflected in protein research, leading to the rapid development of numerous new methods in a short time, with unprecedented performance. Several of these models have been developed with the goal of generating sequences in novel regions of the protein space. In this work, we provide an overview of the use of protein generative models, reviewing (1) language models for the design of novel artificial proteins, (2) works that use non-transformer architectures, and (3) applications in directed evolution approaches.
Collapse
Affiliation(s)
| | | | - Noelia Ferruz
- Barcelona Institute of Molecular Biology, 08028 Barcelona, Spain
| |
Collapse
|
46
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
47
|
Posani L, Rizzato F, Monasson R, Cocco S. Infer global, predict local: Quantity-relevance trade-off in protein fitness predictions from sequence data. PLoS Comput Biol 2023; 19:e1011521. [PMID: 37883593 PMCID: PMC10645369 DOI: 10.1371/journal.pcbi.1011521] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 11/14/2023] [Accepted: 09/15/2023] [Indexed: 10/28/2023] Open
Abstract
Predicting the effects of mutations on protein function is an important issue in evolutionary biology and biomedical applications. Computational approaches, ranging from graphical models to deep-learning architectures, can capture the statistical properties of sequence data and predict the outcome of high-throughput mutagenesis experiments probing the fitness landscape around some wild-type protein. However, how the complexity of the models and the characteristics of the data combine to determine the predictive performance remains unclear. Here, based on a theoretical analysis of the prediction error, we propose descriptors of the sequence data, characterizing their quantity and relevance relative to the model. Our theoretical framework identifies a trade-off between these two quantities, and determines the optimal subset of data for the prediction task, showing that simple models can outperform complex ones when inferred from adequately-selected sequences. We also show how repeated subsampling of the sequence data is informative about how much epistasis in the fitness landscape is not captured by the computational model. Our approach is illustrated on several protein families, as well as on in silico solvable protein models.
Collapse
Affiliation(s)
- Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Francesca Rizzato
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| | - Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 & PSL Research, Sorbonne Université, Paris, France
| |
Collapse
|
48
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
49
|
Golinski AW, Schmitz ZD, Nielsen GH, Johnson B, Saha D, Appiah S, Hackel BJ, Martiniani S. Predicting and Interpreting Protein Developability Via Transfer of Convolutional Sequence Representation. ACS Synth Biol 2023; 12:2600-2615. [PMID: 37642646 PMCID: PMC10829850 DOI: 10.1021/acssynbio.3c00196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Engineered proteins have emerged as novel diagnostics, therapeutics, and catalysts. Often, poor protein developability─quantified by expression, solubility, and stability─hinders utility. The ability to predict protein developability from amino acid sequence would reduce the experimental burden when selecting candidates. Recent advances in screening technologies enabled a high-throughput (HT) developability dataset for 105 of 1020 possible variants of protein ligand scaffold Gp2. In this work, we evaluate the ability of neural networks to learn a developability representation from a HT dataset and transfer this knowledge to predict recombinant expression beyond observed sequences. The model convolves learned amino acid properties to predict expression levels 44% closer to the experimental variance compared to a non-embedded control. Analysis of learned amino acid embeddings highlights the uniqueness of cysteine, the importance of hydrophobicity and charge, and the unimportance of aromaticity, when aiming to improve the developability of small proteins. We identify clusters of similar sequences with increased recombinant expression through nonlinear dimensionality reduction and we explore the inferred expression landscape via nested sampling. The analysis enables the first direct visualization of the fitness landscape and highlights the existence of evolutionary bottlenecks in sequence space giving rise to competing subpopulations of sequences with different developability. The work advances applied protein engineering efforts by predicting and interpreting protein scaffold expression from a limited dataset. Furthermore, our statistical mechanical treatment of the problem advances foundational efforts to characterize the structure of the protein fitness landscape and the amino acid characteristics that influence protein developability.
Collapse
Affiliation(s)
- Alexander W. Golinski
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Zachary D. Schmitz
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Gregory H. Nielsen
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Bryce Johnson
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Diya Saha
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Sandhya Appiah
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Benjamin J. Hackel
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
| | - Stefano Martiniani
- Department of Chemical Engineering and Materials Science, University of Minnesota, Minneapolis, MN 55455
- Center for Soft Matter Research, Department of Physics, New York University, New York, NY 10003
- Simons Center for Computational Physical Chemistry, Departments of Chemistry, New York University, New York, NY 10003
- Courant Institute of Mathematical Sciences, New York University, New York, NY 10003
| |
Collapse
|
50
|
Chao TH, Rekhi S, Mittal J, Tabor DP. Data-Driven Models for Predicting Intrinsically Disordered Protein Polymer Physics Directly from Composition or Sequence. MOLECULAR SYSTEMS DESIGN & ENGINEERING 2023; 8:1146-1155. [PMID: 38222029 PMCID: PMC10786636 DOI: 10.1039/d3me00053b] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
The molecular-level understanding of intrinsically disordered proteins is challenging due to experimental characterization difficulties. Computational understanding of IDPs also requires fundamental advances, as the leading tools for predicting protein folding (e.g., AlphaFold), typically fail to describe the structural ensembles of IDPs. The focus of this paper is to 1) develop new representations for intrinsically disordered proteins and 2) pair these representations with classical machine learning and deep learning models to predict the radius of gyration and derived scaling exponent of IDPs. Here, we build a new physically-motivated feature called the bag of amino acid interactions representation, which encodes pairwise interactions explicitly into the representation. This feature essentially counts and weights all possible non-bonded interactions in a sequence and thus is, in principle, compatible with arbitrary sequence lengths. To see how well this new feature performs, both categorical and physically-motivated featurization techniques are tested on a computational dataset containing 10,000 sequences simulated at the coarse-grained level. The results indicate that this new feature outperforms the other purely categorical and physically-motivated features and possesses solid extrapolation capabilities. For future use, this feature can potentially provide physical insights into amino acid interactions, including their temperature dependence, and be applied to other protein spaces.
Collapse
Affiliation(s)
- Tzu-Hsuan Chao
- Department of Chemistry, Texas A&M University, PO Box 30012, College Station, TX 77842-3012, USA
| | - Shiv Rekhi
- Department of Chemistry, Texas A&M University, PO Box 30012, College Station, TX 77842-3012, USA
| | - Jeetain Mittal
- Department of Chemistry, Texas A&M University, PO Box 30012, College Station, TX 77842-3012, USA
| | - Daniel P Tabor
- Department of Chemistry, Texas A&M University, PO Box 30012, College Station, TX 77842-3012, USA
| |
Collapse
|