1
|
Reidenbach D, Krishnapriyan AS. CoarsenConf: Equivariant Coarsening with Aggregated Attention for Molecular Conformer Generation. J Chem Inf Model 2024. [PMID: 39688534 DOI: 10.1021/acs.jcim.4c01001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2024]
Abstract
Molecular conformer generation (MCG) is an important task in cheminformatics and drug discovery. The ability to efficiently generate low-energy 3D structures can avoid expensive quantum mechanical simulations, leading to accelerated virtual screenings and enhanced structural exploration. Several generative models have been developed for MCG, but many struggle to consistently produce high-quality conformers for meaningful downstream applications. To address these issues, we introduce CoarsenConf, which coarse-grains molecular graphs based on torsional angles and integrates them into an SE(3)-equivariant hierarchical variational autoencoder. Through equivariant coarse-graining, we aggregate the fine-grained atomic coordinates of subgraphs connected via rotatable bonds, creating a variable-length coarse-grained latent representation. Our model uses a novel aggregated attention mechanism to restore fine-grained coordinates from the coarse-grained latent representation, enabling efficient generation of accurate conformers. Furthermore, we evaluate the chemical and biochemical quality of our generated conformers on multiple downstream applications, including property prediction and large-scale oracle-based protein docking. Overall, CoarsenConf generates more accurate conformer ensembles compared to prior generative models.
Collapse
Affiliation(s)
- Danny Reidenbach
- Department of Chemical Engineering, Department of Computer Science, University of California Berkeley, Berkeley, California 94720, United States
- NVIDIA, Santa Clara, California 95051, United States
| | - Aditi S Krishnapriyan
- Department of Chemical Engineering, Department of Computer Science, University of California Berkeley, Berkeley, California 94720, United States
| |
Collapse
|
2
|
Fan J, Li Z, Alcaide E, Ke G, Huang H, E W. Accurate Conformation Sampling via Protein Structural Diffusion. J Chem Inf Model 2024; 64:8414-8426. [PMID: 39340358 DOI: 10.1021/acs.jcim.4c00928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2024]
Abstract
Accurate sampling of protein conformations is pivotal for advances in biology and medicine. Although there has been tremendous progress in protein structure prediction in recent years due to deep learning, models that can predict the different stable conformations of proteins with high accuracy and structural validity are still lacking. Here, we introduce UFConf, a cutting-edge approach designed for robust sampling of diverse protein conformations based solely on amino acid sequences. This method transforms AlphaFold2 into a diffusion model by implementing a conformation-based diffusion process and adapting the architecture to process diffused inputs effectively. To counteract the inherent conformational bias in the Protein Data Bank, we developed a novel hierarchical reweighting protocol based on structural clustering. Our evaluations demonstrate that UFConf outperforms existing methods in terms of successful sampling and structural validity. The comparisons with long-time molecular dynamics show that UFConf can overcome the energy barrier existing in molecular dynamics simulations and perform more efficient sampling. Furthermore, We showcase UFConf's utility in drug discovery through its application in neural protein-ligand docking. In a blind test, it accurately predicted a novel protein-ligand complex, underscoring its potential to impact real-world biological research. Additionally, we present other modes of sampling using UFConf, including partial sampling with fixed motif, Langevin dynamics, and structural interpolation.
Collapse
Affiliation(s)
- Jiahao Fan
- School of Physics, Peking University, Beijing 100871, China
- DP Technology, Beijing 100080, China
| | - Ziyao Li
- DP Technology, Beijing 100080, China
- Center for Data Science, Peking University, Beijing 100871, China
| | - Eric Alcaide
- DP Technology, Beijing 100080, China
- University of Barcelona, Barcelona 08007, Spain
| | - Guolin Ke
- DP Technology, Beijing 100080, China
| | - Huaqing Huang
- School of Physics, Peking University, Beijing 100871, China
| | - Weinan E
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
3
|
Máté B, Fleuret F, Bereau T. Neural Thermodynamic Integration: Free Energies from Energy-Based Diffusion Models. J Phys Chem Lett 2024; 15:11395-11404. [PMID: 39503734 DOI: 10.1021/acs.jpclett.4c01958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2024]
Abstract
Thermodynamic integration (TI) offers a rigorous method for estimating free-energy differences by integrating over a sequence of interpolating conformational ensembles. However, TI calculations are computationally expensive and typically limited to coupling a small number of degrees of freedom due to the need to sample numerous intermediate ensembles with sufficient conformational-space overlap. In this work, we propose to perform TI along an alchemical pathway represented by a trainable neural network, which we term Neural TI. Critically, we parametrize a time-dependent Hamiltonian interpolating between the interacting and noninteracting systems and optimize its gradient using a score matching objective. The ability of the resulting energy-based diffusion model to sample all intermediate ensembles allows us to perform TI from a single reference calculation. We apply our method to Lennard-Jones fluids, where we report accurate calculations of the excess chemical potential, demonstrating that Neural TI reproduces the underlying changes in free energy without the need for simulations at interpolating Hamiltonians.
Collapse
Affiliation(s)
- Bálint Máté
- Institute for Theoretical Physics, Heidelberg University, 69120 Heidelberg, Germany
- Department of Computer Science, University of Geneva, 1227 Carouge, Switzerland
- Department of Physics, University of Geneva, 1211 Geneva, Switzerland
| | - François Fleuret
- Department of Computer Science, University of Geneva, 1227 Carouge, Switzerland
| | - Tristan Bereau
- Institute for Theoretical Physics, Heidelberg University, 69120 Heidelberg, Germany
| |
Collapse
|
4
|
Yue Y, Li S, Cheng Y, Wang L, Hou T, Zhu Z, He S. Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction. Nat Commun 2024; 15:9629. [PMID: 39511202 PMCID: PMC11544137 DOI: 10.1038/s41467-024-53583-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 10/16/2024] [Indexed: 11/15/2024] Open
Abstract
Structure-based machine learning algorithms have been utilized to predict the properties of protein-protein interaction (PPI) complexes, such as binding affinity, which is critical for understanding biological mechanisms and disease treatments. While most existing algorithms represent PPI complex graph structures at the atom-scale or residue-scale, these representations can be computationally expensive or may not sufficiently integrate finer chemical-plausible interaction details for improving predictions. Here, we introduce MCGLPPI, a geometric representation learning framework that combines graph neural networks (GNNs) with MARTINI molecular coarse-grained (CG) models to predict PPI overall properties accurately and efficiently. Extensive experiments on three types of downstream PPI property prediction tasks demonstrate that at the CG-scale, MCGLPPI achieves competitive performance compared with the counterparts at the atom- and residue-scale, but with only a third of computational resource consumption. Furthermore, CG-scale pre-training on protein domain-domain interaction structures enhances its predictive capabilities for PPI tasks. MCGLPPI offers an effective and efficient solution for PPI overall property predictions, serving as a promising tool for the large-scale analysis of biomolecular interactions.
Collapse
Affiliation(s)
- Yang Yue
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, UK
| | - Shu Li
- Macao Polytechnic University, Macao, China
| | - Yihua Cheng
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, UK
| | - Lie Wang
- Bone Marrow Transplantation Center of the First Affiliated Hospital, Institute of Immunology, Zhejiang University School of Medicine, Hangzhou, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Zexuan Zhu
- National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, China.
| | - Shan He
- School of Computer Science, The University of Birmingham, Edgbaston, Birmingham, UK.
- Macao Polytechnic University, Macao, China.
| |
Collapse
|
5
|
Ma S, Li D, Li X, Hu G. Neural network-assisted model of interfacial fluids with explicit coarse-grained molecular structures. J Chem Phys 2024; 161:174110. [PMID: 39494790 DOI: 10.1063/5.0230195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024] Open
Abstract
Interfacial fluids are ubiquitous in systems ranging from biological membranes to chemical droplets and exhibit a complex behavior due to their nonlinear, multiphase, and multicomponent nature. The development of accurate coarse-grained (CG) models for such systems poses significant challenges, as these models must effectively capture the intricate many-body interactions, both inter- and intramolecular, arising from atomic-level phenomena, and account for the diverse density distributions and fluctuations at the interface. In this study, we use advanced machine learning techniques incorporating force matching and diffusion probabilistic models to construct a robust CG model of interfacial fluids. We evaluate our model through simulations in various settings, including the water-air interface, bulk decane, and dipalmitoylphosphatidylcholine monolayer membranes. Our results show that our CG model accurately reproduces the essential many-body and interfacial properties of interfacial fluids and proves effective across different CG mapping strategies. This work not only validates the utility of our model for multiscale simulations, but also lays the groundwork for future improvements in the simulation of complex interfacial systems.
Collapse
Affiliation(s)
- Shuhao Ma
- Department of Engineering Mechanics, Zhejiang University, Hangzhou 310027, People's Republic of China
| | - Dechang Li
- Department of Engineering Mechanics, Zhejiang University, Hangzhou 310027, People's Republic of China
| | - Xuejin Li
- Department of Engineering Mechanics, Zhejiang University, Hangzhou 310027, People's Republic of China
| | - Guoqing Hu
- Department of Engineering Mechanics, Zhejiang University, Hangzhou 310027, People's Republic of China
| |
Collapse
|
6
|
Cheng B. Response Matching for Generating Materials and Molecules. J Chem Theory Comput 2024; 20:9259-9266. [PMID: 39365029 PMCID: PMC11500275 DOI: 10.1021/acs.jctc.4c00998] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 09/22/2024] [Accepted: 09/24/2024] [Indexed: 10/05/2024]
Abstract
Diffusion models have recently emerged as powerful tools for the generation of new molecular and material structures. The key insight is that the noise in these models is related to the response of the atoms to displacement, and the denoising step is thus analogous to the geometry relaxation of atomistic systems starting from a random structure. Building on this, we present a generative method called Response Matching (RM), which leverages the fact that each stable material or molecule exists at the minimum of its potential energy surface. Any perturbation induces a response in energy and stress, driving the structure back to equilibrium. Matching this response is closely related to score matching in diffusion models. Another important aspect of state-of-the-art diffusion models is the incorporation of physical symmetries such as translation, rotation, and periodicity. RM employs a machine learning interatomic potential and random structure search as the denoising model, inherently respecting these symmetries and exploiting the locality of atomic interactions. RM handles both molecules and bulk materials under the same framework. Its efficiency and generalization are demonstrated on three systems: a small organic molecular data set, stable crystals from the Materials Project, and one-shot learning on a single diamond configuration.
Collapse
Affiliation(s)
- Bingqing Cheng
- Department
of Chemistry, University of California, Berkeley, California 94720, United States
- The
Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Austria
| |
Collapse
|
7
|
Alakhdar A, Poczos B, Washburn N. Diffusion Models in De Novo Drug Design. J Chem Inf Model 2024; 64:7238-7256. [PMID: 39322943 PMCID: PMC11481093 DOI: 10.1021/acs.jcim.4c01107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 09/14/2024] [Accepted: 09/16/2024] [Indexed: 09/27/2024]
Abstract
Diffusion models have emerged as powerful tools for molecular generation, particularly in the context of 3D molecular structures. Inspired by nonequilibrium statistical physics, these models can generate 3D molecular structures with specific properties or requirements crucial to drug discovery. Diffusion models were particularly successful at learning the complex probability distributions of 3D molecular geometries and their corresponding chemical and physical properties through forward and reverse diffusion processes. This review focuses on the technical implementation of diffusion models tailored for 3D molecular generation. It compares the performance, evaluation methods, and implementation details of various diffusion models used for molecular generation tasks. We cover strategies for atom and bond representation, architectures of reverse diffusion denoising networks, and challenges associated with generating stable 3D molecular structures. This review also explores the applications of diffusion models in de novo drug design and related areas of computational chemistry, such as structure-based drug design, including target-specific molecular generation, molecular docking, and molecular dynamics of protein-ligand complexes. We also cover conditional generation on physical properties, conformation generation, and fragment-based drug design. By summarizing the state-of-the-art diffusion models for 3D molecular generation, this review sheds light on their role in advancing drug discovery and their current limitations.
Collapse
Affiliation(s)
- Amira Alakhdar
- Department
of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Barnabas Poczos
- Machine
Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| | - Newell Washburn
- Department
of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
- Department
of Biomedical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
8
|
Cheng AH, Ser CT, Skreta M, Guzmán-Cordero A, Thiede L, Burger A, Aldossary A, Leong SX, Pablo-García S, Strieth-Kalthoff F, Aspuru-Guzik A. Spiers Memorial Lecture: How to do impactful research in artificial intelligence for chemistry and materials science. Faraday Discuss 2024. [PMID: 39400305 DOI: 10.1039/d4fd00153b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Machine learning has been pervasively touching many fields of science. Chemistry and materials science are no exception. While machine learning has been making a great impact, it is still not reaching its full potential or maturity. In this perspective, we first outline current applications across a diversity of problems in chemistry. Then, we discuss how machine learning researchers view and approach problems in the field. Finally, we provide our considerations for maximizing impact when researching machine learning for chemistry.
Collapse
Affiliation(s)
- Austin H Cheng
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Cher Tian Ser
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andrés Guzmán-Cordero
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Tinbergen Institute, University of Amsterdam, Amsterdam, Netherlands
| | - Luca Thiede
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | - Andreas Burger
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
| | | | - Shi Xuan Leong
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- School of Chemistry, Chemical Engineering and Biotechnology, Nanyang Technological University, Singapore 63737, Singapore
| | | | | | - Alán Aspuru-Guzik
- Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 2E4, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario M5G 1M1, Canada
- Acceleration Consortium, Toronto, Ontario M5G 1X6, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Canada
- Department of Materials Science and Engineering, University of Toronto, Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR), Canada
| |
Collapse
|
9
|
Norton T, Bhattacharya D. Sifting through the noise: A survey of diffusion probabilistic models and their applications to biomolecules. J Mol Biol 2024:168818. [PMID: 39389290 DOI: 10.1016/j.jmb.2024.168818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 09/20/2024] [Accepted: 10/03/2024] [Indexed: 10/12/2024]
Abstract
Diffusion probabilistic models have made their way into a number of high-profile applications since their inception. In particular, there has been a wave of research into using diffusion models in the prediction and design of biomolecular structures and sequences. Their growing ubiquity makes it imperative for researchers in these fields to understand them. This paper serves as a general overview for the theory behind these models and the current state of research. We first introduce diffusion models and discuss common motifs used when applying them to biomolecules. We then present the significant outcomes achieved through the application of these models in generative and predictive tasks. This survey aims to provide readers with a comprehensive understanding of the increasingly critical role of diffusion models.
Collapse
Affiliation(s)
- Trevor Norton
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States
| | | |
Collapse
|
10
|
Chu LS, Sarma S, Gray JJ. Unified Sampling and Ranking for Protein Docking with DFMDock. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.27.615401. [PMID: 39386449 PMCID: PMC11463455 DOI: 10.1101/2024.09.27.615401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Diffusion models have shown promise in addressing the protein docking problem. Traditionally, these models are used solely for sampling docked poses, with a separate confidence model for ranking. We introduce DFMDock (Denoising Force Matching Dock), a diffusion model that unifies sampling and ranking within a single framework. DFMDock features two output heads: one for predicting forces and the other for predicting energies. The forces are trained using a denoising force matching objective, while the energy gradients are trained to align with the forces. This design enables our model to sample using the predicted forces and rank poses using the predicted energies, thereby eliminating the need for an additional confidence model. Our approach outperforms the previous diffusion model for protein docking, DiffDock-PP, with a sampling success rate of 44% compared to its 8%, and a Top- 1 ranking success rate of 16% compared to 0% on the Docking Benchmark 5.5 test set. In successful decoy cases, the DFMDock Energy forms a binding funnel similar to the physics-based Rosetta Energy, suggesting that DFMDock can capture the underlying energy landscape.
Collapse
Affiliation(s)
- Lee-Shin Chu
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sudeep Sarma
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
11
|
Obi P, Gc JB, Mariasoosai C, Diyaolu A, Natesan S. Application of Generative Artificial Intelligence in Predicting Membrane Partitioning of Drugs: Combining Denoising Diffusion Probabilistic Models and MD Simulations Reduces the Computational Cost to One-Third. J Chem Theory Comput 2024; 20:5866-5881. [PMID: 38942732 DOI: 10.1021/acs.jctc.4c00315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/30/2024]
Abstract
The optimal interaction of drugs with plasma membranes and membranes of subcellular organelles is a prerequisite for desirable pharmacology. Importantly, for drugs targeting the transmembrane lipid-facing sites of integral membrane proteins, the relative affinity of a drug to the bilayer lipids compared to the surrounding aqueous phase affects the partitioning, access, and binding of the drug to the target site. Molecular dynamics (MD) simulations, including enhanced sampling techniques such as steered MD, umbrella sampling (US), and metadynamics, offer valuable insights into the interactions of drugs with the membrane lipids and water in atomistic detail. However, these methods are computationally prohibitive for the high-throughput screening of drug candidates. This study shows that applying denoising diffusion probabilistic models (DDPMs), a generative AI method, to US simulation data reduces the computational cost significantly. Specifically, the models used only partial (one-third) data from the US simulations and reproduced the complete potential of mean force (PMF) profiles for three FDA-approved drugs (β2-adrenergic agonists) and ∼20 biologically relevant chemicals with known experimentally characterized bilayer locations. Intriguingly, the model can predict the solvation-free energies for partitioning and crossing the bilayer, preferred bilayer locations (low-energy well), and orientations of the ligands with high accuracy. The results indicate that DDPMs can be used to characterize the complete membrane partitioning profile of drug molecules using fewer umbrella sampling simulations at select positions along the bilayer normal (z-axis), irrespective of their amphiphilic-lipophilic-cephalophilic characteristics.
Collapse
Affiliation(s)
- Peter Obi
- College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, Washington 99202, United States
| | - Jeevan B Gc
- The Center for Protein Degradation, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, United States
| | - Charles Mariasoosai
- College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, Washington 99202, United States
| | - Ayobami Diyaolu
- College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, Washington 99202, United States
| | - Senthil Natesan
- College of Pharmacy and Pharmaceutical Sciences, Washington State University, Spokane, Washington 99202, United States
| |
Collapse
|
12
|
Duignan TT. The Potential of Neural Network Potentials. ACS PHYSICAL CHEMISTRY AU 2024; 4:232-241. [PMID: 38800721 PMCID: PMC11117678 DOI: 10.1021/acsphyschemau.4c00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/04/2024] [Accepted: 03/05/2024] [Indexed: 05/29/2024]
Abstract
In the next half-century, physical chemistry will likely undergo a profound transformation, driven predominantly by the combination of recent advances in quantum chemistry and machine learning (ML). Specifically, equivariant neural network potentials (NNPs) are a breakthrough new tool that are already enabling us to simulate systems at the molecular scale with unprecedented accuracy and speed, relying on nothing but fundamental physical laws. The continued development of this approach will realize Paul Dirac's 80-year-old vision of using quantum mechanics to unify physics with chemistry and providing invaluable tools for understanding materials science, biology, earth sciences, and beyond. The era of highly accurate and efficient first-principles molecular simulations will provide a wealth of training data that can be used to build automated computational methodologies, using tools such as diffusion models, for the design and optimization of systems at the molecular scale. Large language models (LLMs) will also evolve into increasingly indispensable tools for literature review, coding, idea generation, and scientific writing.
Collapse
|
13
|
Janson G, Feig M. Transferable deep generative modeling of intrinsically disordered protein conformations. PLoS Comput Biol 2024; 20:e1012144. [PMID: 38781245 PMCID: PMC11152266 DOI: 10.1371/journal.pcbi.1012144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 06/05/2024] [Accepted: 05/07/2024] [Indexed: 05/25/2024] Open
Abstract
Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulation data have emerged as an efficient alternative for generating structural ensembles. However, such methods currently suffer from limited transferability when modeling sequences and conformations absent in the training data. Here, we develop a novel generative model that achieves high levels of transferability for intrinsically disordered protein ensembles. The approach, named idpSAM, is a latent diffusion model based on transformer neural networks. It combines an autoencoder to learn a representation of protein geometry and a diffusion model to sample novel conformations in the encoded space. IdpSAM was trained on a large dataset of simulations of disordered protein regions performed with the ABSINTH implicit solvent model. Thanks to the expressiveness of its neural networks and its training stability, idpSAM faithfully captures 3D structural ensembles of test sequences with no similarity in the training set. Our study also demonstrates the potential for generating full conformational ensembles from datasets with limited sampling and underscores the importance of training set size for generalization. We believe that idpSAM represents a significant progress in transferable protein ensemble modeling through machine learning.
Collapse
Affiliation(s)
- Giacomo Janson
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| | - Michael Feig
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
14
|
Liu Y, Ghosh TK, Lin G, Chen M. Unbiasing Enhanced Sampling on a High-Dimensional Free Energy Surface with a Deep Generative Model. J Phys Chem Lett 2024; 15:3938-3945. [PMID: 38568182 DOI: 10.1021/acs.jpclett.3c03515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Biased enhanced sampling methods that utilize collective variables (CVs) are powerful tools for sampling conformational ensembles. Due to their large intrinsic dimensions, efficiently generating conformational ensembles for complex systems requires enhanced sampling on high-dimensional free energy surfaces. While temperature-accelerated molecular dynamics (TAMD) can trivially adopt many CVs in a simulation, unbiasing the simulation to generate unbiased conformational ensembles requires accurate modeling of a high-dimensional CV probability distribution, which is challenging for traditional density estimation techniques. Here we propose an unbiasing method based on the score-based diffusion model, a deep generative learning method that excels in density estimation across complex data landscapes. We demonstrate that this unbiasing approach, tested on multiple TAMD simulations, significantly outperforms traditional unbiasing methods and can generate accurate unbiased conformational ensembles. With the proposed approach, TAMD can adopt CVs that focus on improving sampling efficiency and the proposed unbiasing method enables accurate evaluation of ensemble averages of important chemical features.
Collapse
Affiliation(s)
- Yikai Liu
- Department of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47906, United States
| | - Tushar K Ghosh
- Department of Chemistry, Purdue University, West Lafayette, Indiana 47906, United States
| | - Guang Lin
- Department of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47906, United States
| | - Ming Chen
- Department of Chemistry, Purdue University, West Lafayette, Indiana 47906, United States
| |
Collapse
|
15
|
Hsu T, Sadigh B, Bulatov V, Zhou F. Score Dynamics: Scaling Molecular Dynamics with Picoseconds Time Steps via Conditional Diffusion Model. J Chem Theory Comput 2024; 20:2335-2348. [PMID: 38489243 DOI: 10.1021/acs.jctc.3c01361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
We propose score dynamics (SD), a general framework for learning accelerated evolution operators with large timesteps from molecular dynamics (MD) simulations. SD is centered around scores or derivatives of the transition log-probability with respect to the dynamical degrees of freedom. The latter play the same role as force fields in MD but are used in denoising diffusion probability models to generate discrete transitions of the dynamical variables in an SD time step, which can be orders of magnitude larger than a typical MD time step. In this work, we construct graph neural network-based SD models of realistic molecular systems that are evolved with 10 ps timesteps. We demonstrate the efficacy of SD with case studies of the alanine dipeptide and short alkanes in aqueous solution. Both equilibrium predictions derived from the stationary distributions of the conditional probability and kinetic predictions for the transition rates and transition paths are in good agreement with MD. Our current SD implementation is about 2 orders of magnitude faster than the MD counterpart for the systems studied in this work. Open challenges and possible future remedies to improve SD are also discussed.
Collapse
Affiliation(s)
- Tim Hsu
- Lawrence Livermore National Laboratory, Livermore, California 94551, United States
| | - Babak Sadigh
- Lawrence Livermore National Laboratory, Livermore, California 94551, United States
| | - Vasily Bulatov
- Lawrence Livermore National Laboratory, Livermore, California 94551, United States
| | - Fei Zhou
- Lawrence Livermore National Laboratory, Livermore, California 94551, United States
| |
Collapse
|
16
|
Caparotta M, Perez A. Advancing Molecular Dynamics: Toward Standardization, Integration, and Data Accessibility in Structural Biology. J Phys Chem B 2024; 128:2219-2227. [PMID: 38418288 DOI: 10.1021/acs.jpcb.3c04823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Molecular dynamics (MD) simulations have become a valuable tool in structural biology, offering insights into complex biological systems that are difficult to obtain through experimental techniques alone. The lack of available data sets and structures in most published computational work has limited other researchers' use of these models. In recent years, the emergence of online sharing platforms and MD database initiatives favor the deposition of ensembles and structures to accompany publications, favoring reuse of the data sets. However, the lack of uniform metadata collection, formats, and what data are deposited limits the impact and its use by different communities that are not necessarily experts in MD. This Perspective highlights the need for standardization and better resource sharing for processing and interpreting MD simulation results, akin to efforts in other areas of structural biology. As the field moves forward, we will see an increase in popularity and benefits of MD-based integrative approaches combining experimental data and simulations through probabilistic reasoning, but these too are limited by uniformity in experimental data availability and choices on how the data are modeled that are not trivial to decipher from papers. Other fields have addressed similar challenges comprehensively by establishing task forces with different degrees of success. The large scope and number of communities to represent the breadth of types of MD simulations complicates a parallel approach that would fit all. Thus, each group typically decides what data and which format to upload on servers like Zenodo. Uploading data with FAIR (findable, accessible, interoperable, reusable) principles in mind including optimal metadata collection will make the data more accessible and actionable by the community. Such a wealth of simulation data will foster method development and infrastructure advancements, thus propelling the field forward.
Collapse
Affiliation(s)
- Marcelo Caparotta
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| | - Alberto Perez
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|
17
|
Matthies MC, Krueger R, Torda AE, Ward M. Differentiable partition function calculation for RNA. Nucleic Acids Res 2024; 52:e14. [PMID: 38038257 PMCID: PMC10853804 DOI: 10.1093/nar/gkad1168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 10/24/2023] [Accepted: 11/28/2023] [Indexed: 12/02/2023] Open
Abstract
Ribonucleic acid (RNA) is an essential molecule in a wide range of biological functions. In 1990, McCaskill introduced a dynamic programming algorithm for computing the partition function of an RNA sequence. McCaskill's algorithm is widely used today for understanding the thermodynamic properties of RNA. In this work, we introduce a generalization of McCaskill's algorithm that is well-defined over continuous inputs. Crucially, this enables us to implement an end-to-end differentiable partition function calculation. The derivative can be computed with respect to the input, or to any other fixed values, such as the parameters of the energy model. This builds a bridge between RNA thermodynamics and the tools of differentiable programming including deep learning as it enables the partition function to be incorporated directly into any end-to-end differentiable pipeline. To demonstrate the effectiveness of our new approach, we tackle the inverse folding problem directly using gradient optimization. We find that using the gradient to optimize the sequence directly is sufficient to arrive at sequences with a high probability of folding into the desired structure. This indicates that the gradients we compute are meaningful.
Collapse
Affiliation(s)
- Marco C Matthies
- Centre for Bioinformatics, University of Hamburg, Bundesstr. 43, 20146 Hamburg, Germany
| | - Ryan Krueger
- Department of Applied Mathematics, Harvard University, 29 Oxford St, Cambridge, MA 02138, USA
| | - Andrew E Torda
- Centre for Bioinformatics, University of Hamburg, Bundesstr. 43, 20146 Hamburg, Germany
| | - Max Ward
- Department of Computer Science and Software Engineering, The University of Western Australia, 241, 35 Stirling Hwy, Crawley, WA 6009, Australia
| |
Collapse
|
18
|
Janson G, Feig M. Transferable deep generative modeling of intrinsically disordered protein conformations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.08.579522. [PMID: 38370653 PMCID: PMC10871340 DOI: 10.1101/2024.02.08.579522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Intrinsically disordered proteins have dynamic structures through which they play key biological roles. The elucidation of their conformational ensembles is a challenging problem requiring an integrated use of computational and experimental methods. Molecular simulations are a valuable computational strategy for constructing structural ensembles of disordered proteins but are highly resource-intensive. Recently, machine learning approaches based on deep generative models that learn from simulation data have emerged as an efficient alternative for generating structural ensembles. However, such methods currently suffer from limited transferability when modeling sequences and conformations absent in the training data. Here, we develop a novel generative model that achieves high levels of transferability for intrinsically disordered protein ensembles. The approach, named idpSAM, is a latent diffusion model based on transformer neural networks. It combines an autoencoder to learn a representation of protein geometry and a diffusion model to sample novel conformations in the encoded space. IdpSAM was trained on a large dataset of simulations of disordered protein regions performed with the ABSINTH implicit solvent model. Thanks to the expressiveness of its neural networks and its training stability, idpSAM faithfully captures 3D structural ensembles of test sequences with no similarity in the training set. Our study also demonstrates the potential for generating full conformational ensembles from datasets with limited sampling and underscores the importance of training set size for generalization. We believe that idpSAM represents a significant progress in transferable protein ensemble modeling through machine learning.
Collapse
Affiliation(s)
- Giacomo Janson
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
| | - Michael Feig
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
| |
Collapse
|
19
|
Jones MS, Shmilovich K, Ferguson AL. DiAMoNDBack: Diffusion-Denoising Autoregressive Model for Non-Deterministic Backmapping of Cα Protein Traces. J Chem Theory Comput 2023; 19:7908-7923. [PMID: 37906711 DOI: 10.1021/acs.jctc.3c00840] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Coarse-grained molecular models of proteins permit access to length and time scales unattainable by all-atom models and the simulation of processes that occur on long time scales, such as aggregation and folding. The reduced resolution realizes computational accelerations, but an atomistic representation can be vital for a complete understanding of mechanistic details. Backmapping is the process of restoring all-atom resolution to coarse-grained molecular models. In this work, we report DiAMoNDBack (Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping) as an autoregressive denoising diffusion probability model to restore all-atom details to coarse-grained protein representations retaining only Cα coordinates. The autoregressive generation process proceeds from the protein N-terminus to C-terminus in a residue-by-residue fashion conditioned on the Cα trace and previously backmapped backbone and side-chain atoms within the local neighborhood. The local and autoregressive nature of our model makes it transferable between proteins. The stochastic nature of the denoising diffusion process means that the model generates a realistic ensemble of backbone and side-chain all-atom configurations consistent with the coarse-grained Cα trace. We train DiAMoNDBack over 65k+ structures from the Protein Data Bank (PDB) and validate it in applications to a hold-out PDB test set, intrinsically disordered protein structures from the Protein Ensemble Database (PED), molecular dynamics simulations of fast-folding mini-proteins from DE Shaw Research, and coarse-grained simulation data. We achieve state-of-the-art reconstruction performance in terms of correct bond formation, avoidance of side-chain clashes, and the diversity of the generated side-chain configurational states. We make the DiAMoNDBack model publicly available as a free and open-source Python package.
Collapse
Affiliation(s)
- Michael S Jones
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Kirill Shmilovich
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
20
|
Navarro C, Majewski M, De Fabritiis G. Top-Down Machine Learning of Coarse-Grained Protein Force Fields. J Chem Theory Comput 2023; 19:7518-7526. [PMID: 37874270 PMCID: PMC10777392 DOI: 10.1021/acs.jctc.3c00638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Indexed: 10/25/2023]
Abstract
Developing accurate and efficient coarse-grained representations of proteins is crucial for understanding their folding, function, and interactions over extended time scales. Our methodology involves simulating proteins with molecular dynamics and utilizing the resulting trajectories to train a neural network potential through differentiable trajectory reweighting. Remarkably, this method requires only the native conformation of proteins, eliminating the need for labeled data derived from extensive simulations or memory-intensive end-to-end differentiable simulations. Once trained, the model can be employed to run parallel molecular dynamics simulations and sample folding events for proteins both within and beyond the training distribution, showcasing its extrapolation capabilities. By applying Markov state models, native-like conformations of the simulated proteins can be predicted from the coarse-grained simulations. Owing to its theoretical transferability and ability to use solely experimental static structures as training data, we anticipate that this approach will prove advantageous for developing new protein force fields and further advancing the study of protein dynamics, folding, and interactions.
Collapse
Affiliation(s)
- Carles Navarro
- Acellera
Labs, Doctor Trueta 183, 08005 Barcelona, Spain
| | | | - Gianni De Fabritiis
- Computational
Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
- Acellera
Ltd., Devonshire House
582, Middlesex HA7 1JS, United Kingdom
- Institució
Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|