Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Greener JG, Moffat L, Jones DT. Design of metalloproteins and novel protein folds using variational autoencoders. Sci Rep 2018;8:16189. [PMID: 30385875 PMCID: PMC6212568 DOI: 10.1038/s41598-018-34533-1] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 10/19/2018] [Indexed: 12/26/2022] Open

For:	Greener JG, Moffat L, Jones DT. Design of metalloproteins and novel protein folds using variational autoencoders. Sci Rep 2018;8:16189. [PMID: 30385875 PMCID: PMC6212568 DOI: 10.1038/s41598-018-34533-1] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 10/19/2018] [Indexed: 12/26/2022] Open

Number

Cited by Other Article(s)

Ghafarollahi A, Buehler MJ. ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. DIGITAL DISCOVERY 2024;3:1389-1409. [PMID: 38993729 PMCID: PMC11235180 DOI: 10.1039/d4dd00013g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 05/13/2024] [Indexed: 07/13/2024]

Abstract

Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data - natural vibrational frequencies - via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.

Collapse

Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024;30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]

Chu AE, Lu T, Huang PS. Sparks of function by de novo protein design. Nat Biotechnol 2024;42:203-215. [PMID: 38361073 DOI: 10.1038/s41587-024-02133-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 01/09/2024] [Indexed: 02/17/2024]

Guo Z, Liu J, Wang Y, Chen M, Wang D, Xu D, Cheng J. Diffusion models in bioinformatics and computational biology. NATURE REVIEWS BIOENGINEERING 2024;2:136-154. [PMID: 38576453 PMCID: PMC10994218 DOI: 10.1038/s44222-023-00114-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 08/25/2023] [Indexed: 04/06/2024]

Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023;12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]

Abstract

Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.

Collapse

Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023;28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open

Ingraham JB, Baranov M, Costello Z, Barber KW, Wang W, Ismail A, Frappier V, Lord DM, Ng-Thow-Hing C, Van Vlack ER, Tie S, Xue V, Cowles SC, Leung A, Rodrigues JV, Morales-Perez CL, Ayoub AM, Green R, Puentes K, Oplinger F, Panwar NV, Obermeyer F, Root AR, Beam AL, Poelwijk FJ, Grigoryan G. Illuminating protein space with a programmable generative model. Nature 2023;623:1070-1078. [PMID: 37968394 PMCID: PMC10686827 DOI: 10.1038/s41586-023-06728-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 10/06/2023] [Indexed: 11/17/2023]

Mardikoraem M, Wang Z, Pascual N, Woldring D. Generative models for protein sequence modeling: recent advances and future directions. Brief Bioinform 2023;24:bbad358. [PMID: 37864295 PMCID: PMC10589401 DOI: 10.1093/bib/bbad358] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 09/08/2023] [Accepted: 09/12/2023] [Indexed: 10/22/2023] Open

Li Y, Yao Y, Xia Y, Tang M. Searching for protein variants with desired properties using deep generative models. BMC Bioinformatics 2023;24:297. [PMID: 37480001 PMCID: PMC10362698 DOI: 10.1186/s12859-023-05415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 07/17/2023] [Indexed: 07/23/2023] Open

Abstract

BACKGROUND

Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties.

RESULTS

To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence.

CONCLUSION

Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.

Collapse

Dürr SL, Levy A, Rothlisberger U. Metal3D: a general deep learning framework for accurate metal ion location prediction in proteins. Nat Commun 2023;14:2713. [PMID: 37169763 PMCID: PMC10175565 DOI: 10.1038/s41467-023-37870-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 03/29/2023] [Indexed: 05/13/2023] Open

Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023;14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open

Fatouros PR, Roy U, Sur S. Implications of SARS-CoV-2 spike protein interactions with Zn-bound form of ACE2: a computational structural study. Biometals 2023:10.1007/s10534-023-00491-z. [PMID: 36725769 PMCID: PMC9891659 DOI: 10.1007/s10534-023-00491-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 01/13/2023] [Indexed: 02/03/2023]

Clifton BE, Kozome D, Laurino P. Efficient Exploration of Sequence Space by Sequence-Guided Protein Engineering and Design. Biochemistry 2023;62:210-220. [PMID: 35245020 DOI: 10.1021/acs.biochem.1c00757] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]

Soleymani F, Paquet E, Viktor H, Michalowski W, Spinello D. Protein-protein interaction prediction with deep learning: A comprehensive review. Comput Struct Biotechnol J 2022;20:5316-5341. [PMID: 36212542 PMCID: PMC9520216 DOI: 10.1016/j.csbj.2022.08.070] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/29/2022] [Accepted: 08/30/2022] [Indexed: 11/15/2022] Open

Chen JC, Chen JP, Shen MW, Wornow M, Bae M, Yeh WH, Hsu A, Liu DR. Generating experimentally unrelated target molecule-binding highly functionalized nucleic-acid polymers using machine learning. Nat Commun 2022;13:4541. [PMID: 35927274 PMCID: PMC9352670 DOI: 10.1038/s41467-022-31955-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 07/11/2022] [Indexed: 11/09/2022] Open

Affiliation(s)

Jonathan C. Chen grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,3grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA
Jonathan P. Chen grid.512059.aWork conducted at Uber AI Labs, Uber Technologies, Inc., San Francisco, CA USA ,5Meta Platforms, Menlo Park, CA USA
Max W. Shen grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,3grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA ,6grid.116068.80000 0001 2341 2786Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA USA
Michael Wornow grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA
Minwoo Bae grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA
Wei-Hsi Yeh grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,3grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA ,7grid.38142.3c000000041936754XProgram in Speech and Hearing Bioscience and Technology, Harvard Medical School, Boston, MA USA
Alvin Hsu grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,3grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA
David R. Liu grid.66859.340000 0004 0546 1623Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of Harvard and MIT, Cambridge, MA USA ,2grid.38142.3c000000041936754XDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA USA ,3grid.38142.3c000000041936754XHoward Hughes Medical Institute, Harvard University, Cambridge, MA USA

Collapse

Andreini C, Rosato A. Structural Bioinformatics and Deep Learning of Metalloproteins: Recent Advances and Applications. Int J Mol Sci 2022;23:ijms23147684. [PMID: 35887033 PMCID: PMC9323969 DOI: 10.3390/ijms23147684] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/04/2022] [Accepted: 07/06/2022] [Indexed: 02/04/2023] Open

Kucera T, Togninalli M, Meng-Papaxanthos L. Conditional generative modeling for de novo protein design with hierarchical functions. Bioinformatics 2022;38:3454-3461. [PMID: 35639661 PMCID: PMC9237736 DOI: 10.1093/bioinformatics/btac353] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 04/20/2022] [Accepted: 05/20/2022] [Indexed: 11/18/2022] Open

Talluri S. Algorithms for protein design. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022;130:1-38. [PMID: 35534105 DOI: 10.1016/bs.apcsb.2022.01.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022;23:6554124. [PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/11/2022] Open

Yu Y, Wang R, Teo RD. Machine Learning Approaches for Metalloproteins. MOLECULES (BASEL, SWITZERLAND) 2022;27:molecules27041277. [PMID: 35209064 PMCID: PMC8878495 DOI: 10.3390/molecules27041277] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 02/10/2022] [Accepted: 02/11/2022] [Indexed: 01/10/2023]

Lin E, Lin CH, Lane HY. De Novo Peptide and Protein Design Using Generative Adversarial Networks: An Update. J Chem Inf Model 2022;62:761-774. [DOI: 10.1021/acs.jcim.1c01361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Giessel A, Dousis A, Ravichandran K, Smith K, Sur S, McFadyen I, Zheng W, Licht S. Therapeutic enzyme engineering using a generative neural network. Sci Rep 2022;12:1536. [PMID: 35087131 PMCID: PMC8795449 DOI: 10.1038/s41598-022-05195-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 12/15/2021] [Indexed: 12/31/2022] Open

Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022;23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 480] [Impact Index Per Article: 240.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]

Rudden LSP, Hijazi M, Barth P. Deep learning approaches for conformational flexibility and switching properties in protein design. Front Mol Biosci 2022;9:928534. [PMID: 36032687 PMCID: PMC9399439 DOI: 10.3389/fmolb.2022.928534] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 07/15/2022] [Indexed: 11/30/2022] Open

Deep generative modeling for protein design. Curr Opin Struct Biol 2021;72:226-236. [PMID: 34963082 DOI: 10.1016/j.sbi.2021.11.008] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 11/01/2021] [Accepted: 11/22/2021] [Indexed: 11/21/2022]

Thomas M, Boardman A, Garcia-Ortegon M, Yang H, de Graaf C, Bender A. Applications of Artificial Intelligence in Drug Design: Opportunities and Challenges. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021;2390:1-59. [PMID: 34731463 DOI: 10.1007/978-1-0716-1787-8_1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Defresne M, Barbe S, Schiex T. Protein Design with Deep Learning. Int J Mol Sci 2021;22:11741. [PMID: 34769173 PMCID: PMC8584038 DOI: 10.3390/ijms222111741] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/23/2021] [Accepted: 10/26/2021] [Indexed: 12/21/2022] Open

Linder J, Seelig G. Fast activation maximization for molecular sequence design. BMC Bioinformatics 2021;22:510. [PMID: 34670493 PMCID: PMC8527647 DOI: 10.1186/s12859-021-04437-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Accepted: 10/11/2021] [Indexed: 12/26/2022] Open

Trinquier J, Uguzzoni G, Pagnani A, Zamponi F, Weigt M. Efficient generative modeling of protein sequences using simple autoregressive models. Nat Commun 2021;12:5800. [PMID: 34608136 PMCID: PMC8490405 DOI: 10.1038/s41467-021-25756-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 08/23/2021] [Indexed: 02/08/2023] Open

Bitard-Feildel T. Navigating the amino acid sequence space between functional proteins using a deep learning framework. PeerJ Comput Sci 2021;7:e684. [PMID: 34616884 PMCID: PMC8459775 DOI: 10.7717/peerj-cs.684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/30/2021] [Indexed: 06/13/2023]

Abstract

MOTIVATION

Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution.

RESULTS

This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.

Collapse

Takahashi T, Chikenji G, Tokita K. Lattice protein design using Bayesian learning. Phys Rev E 2021;104:014404. [PMID: 34412286 DOI: 10.1103/physreve.104.014404] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 06/11/2021] [Indexed: 01/01/2023]

Mowbray M, Savage T, Wu C, Song Z, Cho BA, Del Rio-Chanona EA, Zhang D. Machine learning for biochemical engineering: A review. Biochem Eng J 2021. [DOI: 10.1016/j.bej.2021.108054] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Principles and Methods in Computational Membrane Protein Design. J Mol Biol 2021;433:167154. [PMID: 34271008 DOI: 10.1016/j.jmb.2021.167154] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2021] [Revised: 07/03/2021] [Accepted: 07/06/2021] [Indexed: 01/13/2023]

Cao Y, Das P, Chenthamarakshan V, Chen PY, Melnyk I, Shen Y. Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design. PROCEEDINGS OF MACHINE LEARNING RESEARCH 2021;139:1261-1271. [PMID: 34423306 PMCID: PMC8375603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]

Osadchy M, Kolodny R. How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. J Phys Chem B 2021;125:6440-6450. [PMID: 34105961 DOI: 10.1021/acs.jpcb.1c02449] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]

Ashraf C, Joshi N, Beck DAC, Pfaendtner J. Data Science in Chemical Engineering: Applications to Molecular Science. Annu Rev Chem Biomol Eng 2021;12:15-37. [PMID: 33710940 DOI: 10.1146/annurev-chembioeng-101220-102232] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Pearce R, Zhang Y. Deep learning techniques have significantly impacted protein structure prediction and protein design. Curr Opin Struct Biol 2021;68:194-207. [PMID: 33639355 PMCID: PMC8222070 DOI: 10.1016/j.sbi.2021.01.007] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 01/09/2021] [Accepted: 01/18/2021] [Indexed: 12/26/2022]

Wu Z, Johnston KE, Arnold FH, Yang KK. Protein sequence design with deep generative models. Curr Opin Chem Biol 2021;65:18-27. [PMID: 34051682 DOI: 10.1016/j.cbpa.2021.04.004] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/02/2021] [Accepted: 04/07/2021] [Indexed: 12/20/2022]

Ward MD, Zimmerman MI, Meller A, Chung M, Swamidass SJ, Bowman GR. Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with DiffNets. Nat Commun 2021;12:3023. [PMID: 34021153 PMCID: PMC8140102 DOI: 10.1038/s41467-021-23246-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 04/16/2021] [Indexed: 12/05/2022] Open

Shin JE, Riesselman AJ, Kollasch AW, McMahon C, Simon E, Sander C, Manglik A, Kruse AC, Marks DS. Protein design and variant prediction using autoregressive generative models. Nat Commun 2021;12:2403. [PMID: 33893299 PMCID: PMC8065141 DOI: 10.1038/s41467-021-22732-w] [Citation(s) in RCA: 120] [Impact Index Per Article: 40.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 03/26/2021] [Indexed: 12/11/2022] Open

DiPrimio DJ, Holland PL. Repurposing metalloproteins as mimics of natural metalloenzymes for small-molecule activation. J Inorg Biochem 2021;219:111430. [PMID: 33873051 DOI: 10.1016/j.jinorgbio.2021.111430] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Revised: 03/11/2021] [Accepted: 03/15/2021] [Indexed: 12/12/2022]

Ferguson AL, Ranganathan R. 100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. ACS Macro Lett 2021;10:327-340. [PMID: 35549066 DOI: 10.1021/acsmacrolett.0c00885] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Norn C, Wicky BIM, Juergens D, Liu S, Kim D, Tischer D, Koepnick B, Anishchenko I, Baker D, Ovchinnikov S. Protein sequence design by conformational landscape optimization. Proc Natl Acad Sci U S A 2021;118:e2017228118. [PMID: 33712545 PMCID: PMC7980421 DOI: 10.1073/pnas.2017228118] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open

Abstract

The protein design problem is to identify an amino acid sequence that folds to a desired structure. Given Anfinsen's thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the desired structure is the lowest energy state. As this calculation involves not only all possible amino acid sequences but also, all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest-energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest-energy conformation for the designed sequence, and typically discarding a large fraction of designed sequences for which this is not the case. Here, we show that by backpropagating gradients through the transform-restrained Rosetta (trRosetta) structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures in a single calculation. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single-point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by conformational landscape optimization with the standard energy-based sequence design methodology in Rosetta and show that the former can result in energy landscapes with fewer alternative energy minima. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low-resolution trRosetta model serves to disfavor alternative states, and the high-resolution Rosetta model serves to create a deep energy minimum at the design target structure.

Collapse

Kopf A, Claassen M. Latent representation learning in biology and translational medicine. PATTERNS (NEW YORK, N.Y.) 2021;2:100198. [PMID: 33748792 PMCID: PMC7961186 DOI: 10.1016/j.patter.2021.100198] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]

Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O, Meskys R, Engqvist MKM, Zelezniak A. Expanding functional protein sequence spaces using generative adversarial networks. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00310-5] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]

Narayanan H, Dingfelder F, Butté A, Lorenzen N, Sokolov M, Arosio P. Machine Learning for Biologics: Opportunities for Protein Engineering, Developability, and Formulation. Trends Pharmacol Sci 2021;42:151-165. [DOI: 10.1016/j.tips.2020.12.004] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 12/10/2020] [Accepted: 12/16/2020] [Indexed: 12/19/2022]

Wittmann BJ, Johnston KE, Wu Z, Arnold FH. Advances in machine learning for directed evolution. Curr Opin Struct Biol 2021;69:11-18. [PMID: 33647531 DOI: 10.1016/j.sbi.2021.01.008] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 01/09/2021] [Accepted: 01/26/2021] [Indexed: 01/11/2023]

Hawkins-Hooker A, Depardieu F, Baur S, Couairon G, Chen A, Bikard D. Generating functional protein variants with variational autoencoders. PLoS Comput Biol 2021;17:e1008736. [PMID: 33635868 PMCID: PMC7946179 DOI: 10.1371/journal.pcbi.1008736] [Citation(s) in RCA: 70] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 03/10/2021] [Accepted: 01/25/2021] [Indexed: 11/20/2022] Open

Abstract

The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.

Collapse

Karimi M, Zhu S, Cao Y, Shen Y. De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks. J Chem Inf Model 2020;60:5667-5681. [PMID: 32945673 PMCID: PMC7775287 DOI: 10.1021/acs.jcim.0c00593] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]

Abstract

Although massive data is quickly accumulating on protein sequence and structure, there is a small and limited number of protein architectural types (or structural folds). This study is addressing the following question: how well could one reveal underlying sequence-structure relationships and design protein sequences for an arbitrary, potentially novel, structural fold? In response to the question, we have developed novel deep generative models, namely, semisupervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks). To overcome training difficulties and improve design qualities, we build our models on conditional Wasserstein GAN (WGAN) that uses Wasserstein distance in the loss function. Our major contributions include (1) constructing a low-dimensional and generalizable representation of the fold space for the conditional input, (2) developing an ultrafast sequence-to-fold predictor (or oracle) and incorporating its feedback into WGAN as a loss to guide model training, and (3) exploiting sequence data with and without paired structures to enable a semisupervised training strategy. Assessed by the oracle over 100 novel folds not in the training set, gcWGAN generates more successful designs and covers 3.5 times more target folds compared to a competing data-driven method (cVAE). Assessed by sequence- and structure-based predictors, gcWGAN designs are physically and biologically sound. Assessed by a structure predictor over representative novel folds, including one not even part of basis folds, gcWGAN designs have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. The ultrafast data-driven model is further shown to boost the success of a principle-driven de novo method (RosettaDesign), through generating design seeds and tailoring design space. In conclusion, gcWGAN explores uncharted sequence space to design proteins by learning generalizable principles from current sequence-structure data. Data, source codes, and trained models are available at https://github.com/Shen-Lab/gcWGAN.

Collapse

Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020;1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]