1
|
Shi J, Zhang X, Punyapu VR, Getman RB. Prediction of hydration energies of adsorbates at Pt(111) and liquid water interfaces using machine learning. J Chem Phys 2025; 162:084106. [PMID: 39998168 DOI: 10.1063/5.0248572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Accepted: 02/06/2025] [Indexed: 02/26/2025] Open
Abstract
Aqueous phase heterogeneous catalysis is important to various industrial processes, including biomass conversion, Fischer-Tropsch synthesis, and electrocatalysis. Accurate calculation of solvation thermodynamic properties is essential for modeling the performance of catalysts for these processes. Explicit solvation methods employing multiscale modeling, e.g., involving density functional theory and molecular dynamics have emerged for this purpose. Although accurate, these methods are computationally intensive. This study introduces machine learning (ML) models to predict solvation thermodynamics for adsorbates on a Pt(111) surface, aiming to enhance computational efficiency without compromising accuracy. In particular, ML models are developed using a combination of molecular descriptors and fingerprints and trained on previously published water-adsorbate interaction energies, energies of solvation, and free energies of solvation of adsorbates bound to Pt(111). These models achieve root mean square error values of 0.09 eV for interaction energies, 0.04 eV for energies of solvation, and 0.06 eV for free energies of solvation, demonstrating accuracy within the standard error of multiscale modeling. Feature importance analysis reveals that hydrogen bonding, van der Waals interactions, and solvent density, together with the properties of the adsorbate, are critical factors influencing solvation thermodynamics. These findings suggest that ML models can provide rapid and reliable predictions of solvation properties. This approach not only reduces computational costs but also offers insights into the solvation characteristics of adsorbates at Pt(111)-water interfaces.
Collapse
Affiliation(s)
- Jiexin Shi
- Department of Chemical and Biomolecular Engineering, Clemson University, Clemson, South Carolina 29634-0909, USA
- William G. Lowrie Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Xiaohong Zhang
- Department of Chemical and Biomolecular Engineering, Clemson University, Clemson, South Carolina 29634-0909, USA
| | - Venkata Rohit Punyapu
- William G. Lowrie Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| | - Rachel B Getman
- Department of Chemical and Biomolecular Engineering, Clemson University, Clemson, South Carolina 29634-0909, USA
- William G. Lowrie Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
2
|
Yadav AK, Prakash MV, Bandyopadhyay P. Physics-Based Machine Learning to Predict Hydration Free Energies for Small Molecules with a Minimal Number of Descriptors: Interpretable and Accurate. J Phys Chem B 2025; 129:1640-1647. [PMID: 39841935 DOI: 10.1021/acs.jpcb.4c07090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2025]
Abstract
Hydration free energy (HFE) of molecules is a fundamental property having importance throughout chemistry and biology. Calculation of the HFE can be challenging and expensive with classical molecular dynamics simulation-based approaches. Machine learning (ML) models are increasingly being used to predict HFE. Although the accuracy of ML models for data sets for small molecules is impressive, these models suffer from lack of interpretability. In this work, we have developed a physics-based ML model with only six descriptors, which is both accurate and fully interpretable, and applied it to a database for small molecule HFE, FreeSolv. We evaluated the electrostatic energy by an approximate closed form of the Generalized Born (GB) model and polar surface area. In addition, we have log P and hydrogen bond acceptor and donors as descriptors along with the number of rotatable bonds. We have used different ML models, such as random forest and extreme gradient boosting. The best result from these models has a mean absolute error of only 0.74 kcal/mol. The main power of this model is that the descriptors have clear physical meaning, and it was found that the descriptor describing the electrostatics and the polar surface area, followed by the hydrogen bond donors and acceptors, are the most important factors for the calculation of hydration free energy.
Collapse
Affiliation(s)
- Ajeet Kumar Yadav
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Marvin V Prakash
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| | - Pradipta Bandyopadhyay
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India
| |
Collapse
|
3
|
Isaev VV, Minenkov Y. Comparative study of various molecular feature representations for solvation free energy predictions of neutral species. J Mol Graph Model 2025; 134:108901. [PMID: 39515275 DOI: 10.1016/j.jmgm.2024.108901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 10/13/2024] [Accepted: 10/31/2024] [Indexed: 11/16/2024]
Abstract
Predicting molecular properties with the help of Neural Networks is a common way to substitute or enhance comprehensive quantum-chemical calculations. One of the problems facing researchers is the choice of vectorization approach to representing the solvent and the solute for the estimator model. In this work, 10 different approaches have been investigated for both organic solute and solvent including vectorizers that relied on macroscopic parameters, functional groups classification, molecular graphs, or atomic coordinates. A variation of the Bag of Bonds approach called JustBonds, trained on the MNSol database, showed the best overall performance resulting in RMSD <2 kcal/mol for the blind dataset that contains the solutes not presented in the training subset and <1 kcal/mol on records from Solv@TUM database, which is close to contemporary continuum models. We have also demonstrated that the most important bags usually contain heteroatom and play a key role in the solvation process. Furthermore, the small role of solvent vectorization was demonstrated and revealed that approaches based on functional groups or macroscopic solvent parameters are often enough to efficiently represent solvent media.
Collapse
Affiliation(s)
- Valerii V Isaev
- Lomonosov Moscow State University, Leninskie gory 1 bld. 3, 119991, Moscow, Russia; N.N. Semenov Federal Research Center for Chemical Physics, Kosygina Street 4, 119991, Moscow, Russia.
| | - Yury Minenkov
- N.N. Semenov Federal Research Center for Chemical Physics, Kosygina Street 4, 119991, Moscow, Russia
| |
Collapse
|
4
|
Röcken S, Burnet AF, Zavadlav J. Predicting solvation free energies with an implicit solvent machine learning potential. J Chem Phys 2024; 161:234101. [PMID: 39679504 DOI: 10.1063/5.0235189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 11/29/2024] [Indexed: 12/17/2024] Open
Abstract
Machine learning (ML) potentials are a powerful tool in molecular modeling, enabling ab initio accuracy for comparably small computational costs. Nevertheless, all-atom simulations employing best-performing graph neural network architectures are still too expensive for applications requiring extensive sampling, such as free energy computations. Implicit solvent models could provide the necessary speed-up due to reduced degrees of freedom and faster dynamics. Here, we introduce a Solvation Free Energy Path Reweighting (ReSolv) framework to parameterize an implicit solvent ML potential for small organic molecules that accurately predicts the hydration free energy, an essential parameter in drug design and pollutant modeling. Learning on a combination of experimental hydration free energy data and ab initio data of molecules in vacuum, ReSolv bypasses the need for intractable ab initio data of molecules in an explicit bulk solvent and does not have to resort to less accurate data-generating models. On the FreeSolv dataset, ReSolv achieves a mean absolute error close to average experimental uncertainty, significantly outperforming standard explicit solvent force fields. Compared to the explicit solvent ML potential, ReSolv offers a computational speedup of four orders of magnitude and attains closer agreement with experiments. The presented framework paves the way for deep molecular models that are more accurate yet computationally more cost-effective than classical atomistic models.
Collapse
Affiliation(s)
- Sebastien Röcken
- Multiscale Modeling of Fluid Materials, Department of Engineering Physics and Computation, TUM School of Engineering and Design, Technical University of Munich, Munich, Germany
| | - Anton F Burnet
- Multiscale Modeling of Fluid Materials, Department of Engineering Physics and Computation, TUM School of Engineering and Design, Technical University of Munich, Munich, Germany
| | - Julija Zavadlav
- Multiscale Modeling of Fluid Materials, Department of Engineering Physics and Computation, TUM School of Engineering and Design, Technical University of Munich, Munich, Germany
| |
Collapse
|
5
|
Chung J, Li J, Saimon AI, Hong P, Kong Z. Predicting the stereoselectivity of chemical reactions by composite machine learning method. Sci Rep 2024; 14:12131. [PMID: 38802415 PMCID: PMC11130203 DOI: 10.1038/s41598-024-62158-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 05/14/2024] [Indexed: 05/29/2024] Open
Abstract
Stereoselective reactions have played a vital role in the emergence of life, evolution, human biology, and medicine. However, for a long time, most industrial and academic efforts followed a trial-and-error approach for asymmetric synthesis in stereoselective reactions. In addition, most previous studies have been qualitatively focused on the influence of steric and electronic effects on stereoselective reactions. Therefore, quantitatively understanding the stereoselectivity of a given chemical reaction is extremely difficult. As proof of principle, this paper develops a novel composite machine learning method for quantitatively predicting the enantioselectivity representing the degree to which one enantiomer is preferentially produced from the reactions. Specifically, machine learning methods that are widely used in data analytics, including Random Forest, Support Vector Regression, and LASSO, are utilized. In addition, the Bayesian optimization and permutation importance tests are provided for an in-depth understanding of reactions and accurate prediction. Finally, the proposed composite method approximates the key features of the available reactions by using Gaussian mixture models, which provide suitable machine learning methods for new reactions. The case studies using the real stereoselective reactions show that the proposed method is effective and provides a solid foundation for further application to other chemical reactions.
Collapse
Affiliation(s)
- Jihoon Chung
- Department of Industrial Engineering, Pusan National University, Busan, Korea
| | - Justin Li
- Management, Entrepreneurship, and Technology, University of California, Berkeley, CA, USA
| | - Amirul Islam Saimon
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA, USA.
| | - Zhenyu Kong
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA.
| |
Collapse
|
6
|
Si P, Jayanth A, Andreussi O. Soft-sphere continuum solvation models for nonaqueous solvents. J Comput Chem 2024; 45:719-737. [PMID: 38112395 DOI: 10.1002/jcc.27254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 12/21/2023]
Abstract
Solvation effects profoundly influence the characteristics and behavior of chemical systems in liquid solutions. The interaction between solute and solvent molecules intricately impacts solubility, reactivity, stability, and various chemical processes. Continuum solvation models gained prominence in quantum chemistry by implicitly capturing these interactions and enabling efficient investigations of diverse chemical systems in solution. In comparison, continuum solvation models in condensed matter simulation are very recent. Among these, the self-consistent continuum solvation (SCCS) and the soft-sphere continuum solvation models (SSCS) have been among the first to be successfully parameterized and extended to model periodic systems in aqueous solutions and electrolytes. As most continuum approaches, these models depend on a number of parameters that are linked to experimental or theoretical properties of the solvent, or that can be tuned based on reference data. Here, we present a systematic parameterization of the SSCS model for over 100 nonaqueous solvents. We validate the model's efficacy across diverse solvent environments by leveraging experimental solvation-free energies and partition coefficients from comprehensive databases. The average root means square error over all the solvents was calculated as 0.85 kcal/mol which is below the chemical accuracy (1 kcal/mol). Similarly to what has been reported by Hille et al. (J. Chem. Phys. 2019, 150, 041710.) for the SCCS model, a single-parameter model accurately reproduces experimental solvation energies, showcasing the transferability and predictive power of these continuum approaches. Our findings underscore the potential for a unified approach to predict solvation properties, paving the way for enhanced computational studies across various chemical environments.
Collapse
Affiliation(s)
- Pradip Si
- Department of Chemistry, University of North Texas, Denton, Texas, USA
| | - Ajay Jayanth
- Texas Academy of Math and Science, University of North Texas, Denton, Texas, USA
| | - Oliviero Andreussi
- Department of Chemistry and Biochemistry, Boise State University, Boise, Idaho, USA
| |
Collapse
|
7
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Explainable Supervised Machine Learning Model To Predict Solvation Gibbs Energy. J Chem Inf Model 2024; 64:2250-2262. [PMID: 37603608 PMCID: PMC11005042 DOI: 10.1021/acs.jcim.3c00544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Indexed: 08/23/2023]
Abstract
Many challenges persist in developing accurate computational models for predicting solvation free energy (ΔGsol). Despite recent developments in Machine Learning (ML) methodologies that outperformed traditional quantum mechanical models, several issues remain concerning explanatory insights for broad chemical predictions with an acceptable speed-accuracy trade-off. To overcome this, we present a novel supervised ML model to predict the ΔGsol for an array of solvent-solute pairs. Using two different ensemble regressor algorithms, we made fast and accurate property predictions using open-source chemical features, encoding complex electronic, structural, and surface area descriptors for every solvent and solute. By integrating molecular properties and chemical interaction features, we have analyzed individual descriptor importance and optimized our model though explanatory information form feature groups. On aqueous and organic solvent databases, ML models revealed the predictive relevance of solutes with increasing polar surface area and decreasing polarizability, yielding better results than state-of-the-art benchmark Neural Network methods (without complex quantum mechanical or molecular dynamic simulations). Both algorithms successfully outperformed previous ΔGsol predictions methods, with a maximum absolute error of 0.22 ± 0.02 kcal mol-1, further validated in an external benchmark database and with solvent hold-out tests. With these explanatory and statistical insights, they allow a thoughtful application of this method for predicting other thermodynamic properties, stressing the relevance of ML modeling for further complex computational chemistry problems.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| | - Filipe Teixeira
- Centre
of Chemistry, University of Minho, Campus
de Gualtar, 4710-057 Braga, Portugal
| | - M. Natália D. S. Cordeiro
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| |
Collapse
|
8
|
Chung Y, Green WH. Machine learning from quantum chemistry to predict experimental solvent effects on reaction rates. Chem Sci 2024; 15:2410-2424. [PMID: 38362410 PMCID: PMC10866337 DOI: 10.1039/d3sc05353a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 01/04/2024] [Indexed: 02/17/2024] Open
Abstract
Fast and accurate prediction of solvent effects on reaction rates are crucial for kinetic modeling, chemical process design, and high-throughput solvent screening. Despite the recent advance in machine learning, a scarcity of reliable data has hindered the development of predictive models that are generalizable for diverse reactions and solvents. In this work, we generate a large set of data with the COSMO-RS method for over 28 000 neutral reactions and 295 solvents and train a machine learning model to predict the solvation free energy and solvation enthalpy of activation (ΔΔG‡solv, ΔΔH‡solv) for a solution phase reaction. On unseen reactions, the model achieves mean absolute errors of 0.71 and 1.03 kcal mol-1 for ΔΔG‡solv and ΔΔH‡solv, respectively, relative to the COSMO-RS calculations. The model also provides reliable predictions of relative rate constants within a factor of 4 when tested on experimental data. The presented model can provide nearly instantaneous predictions of kinetic solvent effects or relative rate constants for a broad range of neutral closed-shell or free radical reactions and solvents only based on atom-mapped reaction SMILES and solvent SMILES strings.
Collapse
Affiliation(s)
- Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
9
|
Song Z, Chen J, Cheng J, Chen G, Qi Z. Computer-Aided Molecular Design of Ionic Liquids as Advanced Process Media: A Review from Fundamentals to Applications. Chem Rev 2024; 124:248-317. [PMID: 38108629 DOI: 10.1021/acs.chemrev.3c00223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The unique physicochemical properties, flexible structural tunability, and giant chemical space of ionic liquids (ILs) provide them a great opportunity to match different target properties to work as advanced process media. The crux of the matter is how to efficiently and reliably tailor suitable ILs toward a specific application. In this regard, the computer-aided molecular design (CAMD) approach has been widely adapted to cover this family of high-profile chemicals, that is, to perform computer-aided IL design (CAILD). This review discusses the past developments that have contributed to the state-of-the-art of CAILD and provides a perspective about how future works could pursue the acceleration of the practical application of ILs. In a broad context of CAILD, key aspects related to the forward structure-property modeling and reverse molecular design of ILs are overviewed. For the former forward task, diverse IL molecular representations, modeling algorithms, as well as representative models on physical properties, thermodynamic properties, among others of ILs are introduced. For the latter reverse task, representative works formulating different molecular design scenarios are summarized. Beyond the substantial progress made, some future perspectives to move CAILD a step forward are finally provided.
Collapse
Affiliation(s)
- Zhen Song
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jiahui Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jie Cheng
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guzhong Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Zhiwen Qi
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
10
|
Zou Y, Wang R, Du M, Wang X, Xu D. Identifying Protein-Ligand Interactions via a Novel Distance Self-Feedback Biomolecular Interaction Network. J Phys Chem B 2023; 127:899-911. [PMID: 36657025 DOI: 10.1021/acs.jpcb.2c07592] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Efficient and accurate characterizations of protein-ligand interactions are key to understanding biology at the molecular level. They are particularly useful in pharmaceutical industry applications. They are usually computationally demanding for those widely applied dynamics-based methods in identifying important residues or calculating ligand binding free energy. In this work, we proposed a graph deep learning (DL) framework, namely, the distance self-feedback biomolecular interaction network (DSBIN), in which the relationship between the complex structure and binding affinity can be established by means of a carefully designed distance self-feedback module and interaction layer. Our model can directly provide a quantitative evaluation of inhibitor binding affinities (pKd). More importantly, the DSBIN model efficiently identifies key interactions for inhibitor binding and thus intrinsically bears the interpretability. Its generalization performance was further verified using 1405 unseen structures. The predicted binding free energies' deviations were calculated to be less than 1.37 kcal/mol for more than 55% structures. Moreover, we also compared the DSBIN model with a commonly used theoretical method in calculating the substrate binding free energy, MM/GBSA. Our results show that the current DL model has generally better performance in predicting the binding free energy. For a specific complex system, mannopentaose/TmCBM27, the DSBIN predicted binding free energy is -8.21 kcal/mol, which is very close to experimentally measured -7.76 kcal/mol and MM/GBSA calculated -7.16 kcal/mol. Meanwhile, all important aromatic residues around the binding pocket can be identified by our DL model. Considering the accuracy and efficiency of the newly developed DL model, it may be very helpful in the field of drug design and molecular recognition.
Collapse
Affiliation(s)
- Yurong Zou
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan610064, PR China
| | - Ruihan Wang
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan610064, PR China
| | - Meng Du
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan610064, PR China
| | - Xin Wang
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan610064, PR China
| | - Dingguo Xu
- MOE Key Laboratory of Green Chemistry and Technology, College of Chemistry, Sichuan University, Chengdu, Sichuan610064, PR China.,Research Center for Materials Genome Engineering, Sichuan University, Chengdu, Sichuan610065, PR China
| |
Collapse
|
11
|
Low K, Coote ML, Izgorodina EI. Explainable Solvation Free Energy Prediction Combining Graph Neural Networks with Chemical Intuition. J Chem Inf Model 2022; 62:5457-5470. [PMID: 36317829 DOI: 10.1021/acs.jcim.2c01013] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The prediction of a molecule's solvation Gibbs free (ΔGsolv) energy in a given solvent is an important task which has traditionally been carried out via quantum chemical continuum methods or force field-based molecular simulations. Machine learning (ML) and graph neural networks in particular have emerged as powerful techniques for elucidating structure-property relationships. This work presents a graph neural network (GNN) for the prediction of ΔGsolv which, in addition to encoding typical atom and bond-level features, incorporates chemically intuitive, solvation-relevant parameters into the featurization process: semiempirical partial atomic charges and solvent dielectric constant. Solute-solvent interactions are included via an interaction map layer which can be visualized to examine solubility-enhancing or -decreasing interactions learnt by the model. On a test set of small organic molecules, our GNN predicts ΔGsolv in water and cyclohexane with an accuracy comparable to polarizable and ab initio generated force field methods [mean absolute error (MAE) = 0.4 and 0.2 kcal mol-1, respectively], without the need for any molecular simulation. For the FreeSolv data set of hydration free energies, the test MAE is 0.7 kcal mol-1. Interpretability and applicability of the model is highlighted through several examples including rationalizing the increased solubility of modified diaminoanthraquinones in organic solvents. The clear explanations afforded by our GNN allow for easy understanding of the model's predictions, giving the experimental chemist confidence in employing ML models toward more optimized synthetic routes.
Collapse
Affiliation(s)
- Kaycee Low
- Monash Computational Chemistry Group, School of Chemistry, Monash University, Clayton, Victoria3800, Australia
| | - Michelle L Coote
- Institute for Nanoscale Science and Technology, College of Science and Engineering, Flinders University, Bedford Park, South Australia5042, Australia
| | - Ekaterina I Izgorodina
- Monash Computational Chemistry Group, School of Chemistry, Monash University, Clayton, Victoria3800, Australia
| |
Collapse
|
12
|
Vermeire FH, Chung Y, Green WH. Predicting Solubility Limits of Organic Solutes for a Wide Range of Solvents and Temperatures. J Am Chem Soc 2022; 144:10785-10797. [PMID: 35687887 DOI: 10.1021/jacs.2c01768] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The solubility of organic molecules is crucial in organic synthesis and industrial chemistry; it is important in the design of many phase separation and purification units, and it controls the migration of many species into the environment. To decide which solvents and temperatures can be used in the design of new processes, trial and error is often used, as the choice is restricted by unknown solid solubility limits. Here, we present a fast and convenient computational method for estimating the solubility of solid neutral organic molecules in water and many organic solvents for a broad range of temperatures. The model is developed by combining fundamental thermodynamic equations with machine learning models for solvation free energy, solvation enthalpy, Abraham solute parameters, and aqueous solid solubility at 298 K. We provide free open-source and online tools for the prediction of solid solubility limits and a curated data collection (SolProp) that includes more than 5000 experimental solid solubility values for validation of the model. The model predictions are accurate for aqueous systems and for a huge range of organic solvents up to 550 K or higher. Methods to further improve solid solubility predictions by providing experimental data on the solute of interest in another solvent, or on the solute's sublimation enthalpy, are also presented.
Collapse
Affiliation(s)
- Florence H Vermeire
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
13
|
Chung Y, Vermeire FH, Wu H, Walker PJ, Abraham MH, Green WH. Group Contribution and Machine Learning Approaches to Predict Abraham Solute Parameters, Solvation Free Energy, and Solvation Enthalpy. J Chem Inf Model 2022; 62:433-446. [PMID: 35044781 DOI: 10.1021/acs.jcim.1c01103] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
We present a group contribution method (SoluteGC) and a machine learning model (SoluteML) to predict the Abraham solute parameters, as well as a machine learning model (DirectML) to predict solvation free energy and enthalpy at 298 K. The proposed group contribution method uses atom-centered functional groups with corrections for ring and polycyclic strain while the machine learning models adopt a directed message passing neural network. The solute parameters predicted from SoluteGC and SoluteML are used to calculate solvation energy and enthalpy via linear free energy relationships. Extensive data sets containing 8366 solute parameters, 20,253 solvation free energies, and 6322 solvation enthalpies are compiled in this work to train the models. The three models are each evaluated on the same test sets using both random and substructure-based solute splits for solvation energy and enthalpy predictions. The results show that the DirectML model is superior to the SoluteML and SoluteGC models for both predictions and can provide accuracy comparable to that of advanced quantum chemistry methods. Yet, even though the DirectML model performs better in general, all three models are useful for various purposes. Uncertain predicted values can be identified by comparing the three models, and when the 3 models are combined together, they can provide even more accurate predictions than any one of them individually. Finally, we present our compiled solute parameter, solvation energy, and solvation enthalpy databases (SoluteDB, dGsolvDBx, dHsolvDB) and provide public access to our final prediction models through a simple web-based tool, software packages, and source code.
Collapse
Affiliation(s)
- Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Florence H Vermeire
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Haoyang Wu
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Pierre J Walker
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.,Department of Chemical Engineering, Imperial College London, London SW7 2AZ, United Kingdom
| | - Michael H Abraham
- Department of Chemistry, University College London, 20 Gordon Street, London WC1H OAJ, United Kingdom
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
14
|
Sridharan B, Goel M, Priyakumar UD. Modern Machine Learning for Tackling Inverse Problems in Chemistry: Molecular Design to Realization. Chem Commun (Camb) 2022; 58:5316-5331. [DOI: 10.1039/d1cc07035e] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The discovery of new molecules and materials helps expand the horizons of novel and innovative real-life applications. In the pursuit of finding molecules with desired properties, chemists have traditionally relied...
Collapse
|
15
|
Goel M, Raghunathan S, Laghuvarapu S, Priyakumar UD. MoleGuLAR: Molecule Generation Using Reinforcement Learning with Alternating Rewards. J Chem Inf Model 2021; 61:5815-5826. [PMID: 34866384 DOI: 10.1021/acs.jcim.1c01341] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The design of new inhibitors for novel targets is a very important problem especially in the current scenario with the world being plagued by COVID-19. Conventional approaches such as high-throughput virtual screening require extensive combing through existing data sets in the hope of finding possible matches. In this study, we propose a computational strategy for de novo generation of molecules with high binding affinities to the specified target and other desirable properties for druglike molecules using reinforcement learning. A deep generative model built using a stack-augmented recurrent neural network initially trained to generate druglike molecules is optimized using reinforcement learning to start generating molecules with desirable properties like LogP, quantitative estimate of drug likeliness, topological polar surface area, and hydration free energy along with the binding affinity. For multiobjective optimization, we have devised a novel strategy in which the property being used to calculate the reward is changed periodically. In comparison to the conventional approach of taking a weighted sum of all rewards, this strategy shows an enhanced ability to generate a significantly higher number of molecules with desirable properties.
Collapse
Affiliation(s)
- Manan Goel
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| | - Shampa Raghunathan
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India.,École Centrale School of Engineering, Mahindra University, Hyderabad 500 043, India
| | - Siddhartha Laghuvarapu
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
16
|
Karthikeyan A, Priyakumar UD. Artificial intelligence: machine learning for chemical sciences. J CHEM SCI 2021; 134:2. [PMID: 34955617 PMCID: PMC8691161 DOI: 10.1007/s12039-021-01995-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 09/08/2021] [Accepted: 09/14/2021] [Indexed: 12/05/2022]
Abstract
Research in molecular sciences witnessed the rise and fall of Artificial Intelligence (AI)/ Machine Learning (ML) methods, especially artificial neural networks, few decades ago. However, we see a major resurgence in the use of modern ML methods in scientific research during the last few years. These methods have had phenomenal success in the areas of computer vision, speech recognition, natural language processing (NLP), etc. This has inspired chemists and biologists to apply these algorithms to problems in natural sciences. Availability of high performance Graphics Processing Unit (GPU) accelerators, large datasets, new algorithms, and libraries has enabled this surge. ML algorithms have successfully been applied to various domains in molecular sciences by providing much faster and sometimes more accurate solutions compared to traditional methods like Quantum Mechanical (QM) calculations, Density Functional Theory (DFT) or Molecular Mechanics (MM) based methods, etc. Some of the areas where the potential of ML methods are shown to be effective are in drug design, prediction of high-level quantum mechanical energies, molecular design, molecular dynamics materials, and retrosynthesis of organic compounds, etc. This article intends to conceptually introduce various modern ML methods and their relevance and applications in computational natural sciences. Synopsis Recent surge in the application of machine learning (ML) methods in fundamental sciences has led to a perspective that these methods may become important tools in chemical science. This perspective provides an overview of the modern ML methods and their successful applications in chemistry during the last few years.
Collapse
Affiliation(s)
- Akshaya Karthikeyan
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500 032 India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, 500 032 India
| |
Collapse
|
17
|
Sharma S, Arya A, Cruz R, Cleaves II HJ. Automated Exploration of Prebiotic Chemical Reaction Space: Progress and Perspectives. Life (Basel) 2021; 11:1140. [PMID: 34833016 PMCID: PMC8624352 DOI: 10.3390/life11111140] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 10/15/2021] [Accepted: 10/18/2021] [Indexed: 12/12/2022] Open
Abstract
Prebiotic chemistry often involves the study of complex systems of chemical reactions that form large networks with a large number of diverse species. Such complex systems may have given rise to emergent phenomena that ultimately led to the origin of life on Earth. The environmental conditions and processes involved in this emergence may not be fully recapitulable, making it difficult for experimentalists to study prebiotic systems in laboratory simulations. Computational chemistry offers efficient ways to study such chemical systems and identify the ones most likely to display complex properties associated with life. Here, we review tools and techniques for modelling prebiotic chemical reaction networks and outline possible ways to identify self-replicating features that are central to many origin-of-life models.
Collapse
Affiliation(s)
- Siddhant Sharma
- Blue Marble Space Institute of Science, Seattle, WA 98154, USA; (S.S.); (A.A.); (R.C.)
- Department of Biochemistry, Deshbandhu College, University of Delhi, New Delhi 110019, India
- Department of Chemistry and Chemical Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden
| | - Aayush Arya
- Blue Marble Space Institute of Science, Seattle, WA 98154, USA; (S.S.); (A.A.); (R.C.)
- Department of Physics, Lovely Professional University, Jalandhar-Delhi GT Road, Phagwara 144001, India
| | - Romulo Cruz
- Blue Marble Space Institute of Science, Seattle, WA 98154, USA; (S.S.); (A.A.); (R.C.)
- Big Data Laboratory, Information and Communications Technology Center (CTIC), National University of Engineering, Amaru 210, Lima 15333, Peru
| | - Henderson James Cleaves II
- Blue Marble Space Institute of Science, Seattle, WA 98154, USA; (S.S.); (A.A.); (R.C.)
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo 152-8550, Japan
| |
Collapse
|
18
|
Samaga YBL, Raghunathan S, Priyakumar UD. SCONES: Self-Consistent Neural Network for Protein Stability Prediction Upon Mutation. J Phys Chem B 2021; 125:10657-10671. [PMID: 34546056 DOI: 10.1021/acs.jpcb.1c04913] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
Engineering proteins to have desired properties by mutating amino acids at specific sites is commonplace. Such engineered proteins must be stable to function. Experimental methods used to determine stability at throughputs required to scan the protein sequence space thoroughly are laborious. To this end, many machine learning based methods have been developed to predict thermodynamic stability changes upon mutation. These methods have been evaluated for symmetric consistency by testing with hypothetical reverse mutations. In this work, we propose transitive data augmentation, evaluating transitive consistency with our new Stransitive data set, and a new machine learning based method, the first of its kind, that incorporates both symmetric and transitive properties into the architecture. Our method, called SCONES, is an interpretable neural network that predicts small relative protein stability changes for missense mutations that do not significantly alter the structure. It estimates a residue's contributions toward protein stability (ΔG) in its local structural environment, and the difference between independently predicted contributions of the reference and mutant residues is reported as ΔΔG. We show that this self-consistent machine learning architecture is immune to many common biases in data sets, relies less on data than existing methods, is robust to overfitting, and can explain a substantial portion of the variance in experimental data.
Collapse
Affiliation(s)
- Yashas B L Samaga
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| | - Shampa Raghunathan
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
19
|
Aggarwal R, Gupta A, Chelur V, Jawahar CV, Priyakumar UD. DeepPocket: Ligand Binding Site Detection and Segmentation using 3D Convolutional Neural Networks. J Chem Inf Model 2021; 62:5069-5079. [PMID: 34374539 DOI: 10.1021/acs.jcim.1c00799] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
A structure-based drug design pipeline involves the development of potential drug molecules or ligands that form stable complexes with a given receptor at its binding site. A prerequisite to this is finding druggable and functionally relevant binding sites on the 3D structure of the protein. Although several methods for detecting binding sites have been developed beforehand, a majority of them surprisingly fail in the identification and ranking of binding sites accurately. The rapid adoption and success of deep learning algorithms in various sections of structural biology beckons the usage of such algorithms for accurate binding site detection. As a combination of geometry based software and deep learning, we report a novel framework, DeepPocket that utilizes 3D convolutional neural networks for the rescoring of pockets identified by Fpocket and further segments these identified cavities on the protein surface. Apart from this, we also propose another data set SC6K containing protein structures submitted in the Protein Data Bank (PDB) from January 1st, 2018, until February 28th, 2020, for ligand binding site (LBS) detection. DeepPocket's results on various binding site data sets and SC6K highlight its better performance over current state-of-the-art methods and good generalization ability over novel structures.
Collapse
Affiliation(s)
- Rishal Aggarwal
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Akash Gupta
- International Institute of Information Technology, Hyderabad 500 032, India
| | - Vineeth Chelur
- International Institute of Information Technology, Hyderabad 500 032, India
| | - C V Jawahar
- International Institute of Information Technology, Hyderabad 500 032, India
| | - U Deva Priyakumar
- International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
20
|
Mulligan VK. Current directions in combining simulation-based macromolecular modeling approaches with deep learning. Expert Opin Drug Discov 2021; 16:1025-1044. [PMID: 33993816 DOI: 10.1080/17460441.2021.1918097] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Introduction: Structure-guided drug discovery relies on accurate computational methods for modeling macromolecules. Simulations provide means of predicting macromolecular folds, of discovering function from structure, and of designing macromolecules to serve as drugs. Success rates are limited for any of these tasks, however. Recently, deep neural network-based methods have greatly enhanced the accuracy of predictions of protein structure from sequence, generating excitement about the potential impact of deep learning.Areas covered: This review introduces biologists to deep neural network architecture, surveys recent successes of deep learning in structure prediction, and discusses emerging deep learning-based approaches for structure-function analysis and design. Particular focus is given to the interplay between simulation-based and neural network-based approaches.Expert opinion: As deep learning grows integral to macromolecular modeling, simulation- and neural network-based approaches must grow more tightly interconnected. Modular software architecture must emerge allowing both types of tools to be combined with maximal versatility. Open sharing of code under permissive licenses will be essential. Although experiments will remain the gold standard for reliable information to guide drug discovery, we may soon see successful drug development projects based on high-accuracy predictions from algorithms that combine simulation with deep learning - the ultimate validation of this combination's power.
Collapse
|