1
|
Tolokh IS, Folescu DE, Onufriev AV. Inclusion of Water Multipoles into the Implicit Solvation Framework Leads to Accuracy Gains. J Phys Chem B 2024; 128:5855-5873. [PMID: 38860842 PMCID: PMC11194828 DOI: 10.1021/acs.jpcb.4c00254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 05/28/2024] [Accepted: 05/29/2024] [Indexed: 06/12/2024]
Abstract
The current practical "workhorses" of the atomistic implicit solvation─the Poisson-Boltzmann (PB) and generalized Born (GB) models─face fundamental accuracy limitations. Here, we propose a computationally efficient implicit solvation framework, the Implicit Water Multipole GB (IWM-GB) model, that systematically incorporates the effects of multipole moments of water molecules in the first hydration shell of a solute, beyond the dipole water polarization already present at the PB/GB level. The framework explicitly accounts for coupling between polar and nonpolar contributions to the total solvation energy, which is missing from many implicit solvation models. An implementation of the framework, utilizing the GAFF force field and AM1-BCC atomic partial charges model, is parametrized and tested against the experimental hydration free energies of small molecules from the FreeSolv database. The resulting accuracy on the test set (RMSE ∼ 0.9 kcal/mol) is 12% better than that of the explicit solvation (TIP3P) treatment, which is orders of magnitude slower. We also find that the coupling between polar and nonpolar parts of the solvation free energy is essential to ensuring that several features of the IWM-GB model are physically meaningful, including the sign of the nonpolar contributions.
Collapse
Affiliation(s)
- Igor S. Tolokh
- Department
of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Dan E. Folescu
- Department
of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
- Department
of Mathematics, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Alexey V. Onufriev
- Department
of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
- Department
of Physics, Virginia Tech, Blacksburg, Virginia 24061, United States
- Center
for Soft Matter and Biological Physics, Virginia Tech, Blacksburg, Virginia 24061, United States
| |
Collapse
|
2
|
Ferraz-Caetano J, Teixeira F, Cordeiro MNDS. Explainable Supervised Machine Learning Model To Predict Solvation Gibbs Energy. J Chem Inf Model 2024; 64:2250-2262. [PMID: 37603608 PMCID: PMC11005042 DOI: 10.1021/acs.jcim.3c00544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Indexed: 08/23/2023]
Abstract
Many challenges persist in developing accurate computational models for predicting solvation free energy (ΔGsol). Despite recent developments in Machine Learning (ML) methodologies that outperformed traditional quantum mechanical models, several issues remain concerning explanatory insights for broad chemical predictions with an acceptable speed-accuracy trade-off. To overcome this, we present a novel supervised ML model to predict the ΔGsol for an array of solvent-solute pairs. Using two different ensemble regressor algorithms, we made fast and accurate property predictions using open-source chemical features, encoding complex electronic, structural, and surface area descriptors for every solvent and solute. By integrating molecular properties and chemical interaction features, we have analyzed individual descriptor importance and optimized our model though explanatory information form feature groups. On aqueous and organic solvent databases, ML models revealed the predictive relevance of solutes with increasing polar surface area and decreasing polarizability, yielding better results than state-of-the-art benchmark Neural Network methods (without complex quantum mechanical or molecular dynamic simulations). Both algorithms successfully outperformed previous ΔGsol predictions methods, with a maximum absolute error of 0.22 ± 0.02 kcal mol-1, further validated in an external benchmark database and with solvent hold-out tests. With these explanatory and statistical insights, they allow a thoughtful application of this method for predicting other thermodynamic properties, stressing the relevance of ML modeling for further complex computational chemistry problems.
Collapse
Affiliation(s)
- José Ferraz-Caetano
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| | - Filipe Teixeira
- Centre
of Chemistry, University of Minho, Campus
de Gualtar, 4710-057 Braga, Portugal
| | - M. Natália D. S. Cordeiro
- Department
of Chemistry and Biochemistry − Faculty of Sciences, University of Porto - Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal
| |
Collapse
|
3
|
Bass L, Elder LH, Folescu DE, Forouzesh N, Tolokh IS, Karpatne A, Onufriev AV. Improving the Accuracy of Physics-Based Hydration-Free Energy Predictions by Machine Learning the Remaining Error Relative to the Experiment. J Chem Theory Comput 2024; 20:396-410. [PMID: 38149593 PMCID: PMC10950260 DOI: 10.1021/acs.jctc.3c00981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
The accuracy of computational models of water is key to atomistic simulations of biomolecules. We propose a computationally efficient way to improve the accuracy of the prediction of hydration-free energies (HFEs) of small molecules: the remaining errors of the physics-based models relative to the experiment are predicted and mitigated by machine learning (ML) as a postprocessing step. Specifically, the trained graph convolutional neural network attempts to identify the "blind spots" in the physics-based model predictions, where the complex physics of aqueous solvation is poorly accounted for, and partially corrects for them. The strategy is explored for five classical solvent models representing various accuracy/speed trade-offs, from the fast analytical generalized Born (GB) to the popular TIP3P explicit solvent model; experimental HFEs of small neutral molecules from the FreeSolv set are used for the training and testing. For all of the models, the ML correction reduces the resulting root-mean-square error relative to the experiment for HFEs of small molecules, without significant overfitting and with negligible computational overhead. For example, on the test set, the relative accuracy improvement is 47% for the fast analytical GB, making it, after the ML correction, almost as accurate as uncorrected TIP3P. For the TIP3P model, the accuracy improvement is about 39%, bringing the ML-corrected model's accuracy below the 1 kcal/mol threshold. In general, the relative benefit of the ML corrections is smaller for more accurate physics-based models, reaching the lower limit of about 20% relative accuracy gain compared with that of the physics-based treatment alone. The proposed strategy of using ML to learn the remaining error of physics-based models offers a distinct advantage over training ML alone directly on reference HFEs: it preserves the correct overall trend, even well outside of the training set.
Collapse
Affiliation(s)
- Lewis Bass
- Department of Computer Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Luke H Elder
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Dan E Folescu
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
- Department of Mathematics, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Negin Forouzesh
- Department of Computer Science, California State University, Los Angeles, California 90032, United States
| | - Igor S Tolokh
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Anuj Karpatne
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Alexey V Onufriev
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
- Department of Physics, Virginia Tech, Blacksburg, Virginia 24061, United States
- Center for Soft Matter and Biological Physics, Virginia Tech, Blacksburg, Virginia 24061, United States
| |
Collapse
|
4
|
Liao M, Wu F, Yu X, Zhao L, Wu H, Zhou J. Random Forest Algorithm-Based Prediction of Solvation Gibbs Energies. J SOLUTION CHEM 2023. [DOI: 10.1007/s10953-023-01247-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
|
5
|
Zhang ZY, Peng D, Liu L, Shen L, Fang WH. Machine Learning Prediction of Hydration Free Energy with Physically Inspired Descriptors. J Phys Chem Lett 2023; 14:1877-1884. [PMID: 36779933 DOI: 10.1021/acs.jpclett.2c03858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
We present machine learning models for predicting experimental hydration free energies of molecules without any atom-, bond-, or geometry-specific input feature. Four types of physically inspired descriptors are adopted for predictions. The first type is composed of the total dipole moment, anisotropic polarizability, and vibrational analysis results of the solute molecule. The second and third types are derived from the electrostatic potential distribution of the solute. The last type includes the solvent accessible surface area and shape similarities. Several machine learning regression models are built on the basis of the FreeSolv database with ∼600 samples, showing a better performance in comparison with that of most traditional approaches and other prediction methods based on molecular fingerprints. In particular, the present descriptors are capable of predicting hydration free energies of new compounds with elements or fragments that are never seen in the training set. The importance of these descriptors, the impact of dissociation energies of specific covalent bonds, and the outliers with relatively large prediction errors are also discussed.
Collapse
Affiliation(s)
- Zhan-Yun Zhang
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing 100875, P. R. China
| | - Ding Peng
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing 100875, P. R. China
| | - Lihong Liu
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing 100875, P. R. China
| | - Lin Shen
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing 100875, P. R. China
- Yantai-Jingshi Institute of Material Genome Engineering, Yantai 265505, Shandong, P. R. China
| | - Wei-Hai Fang
- Key Laboratory of Theoretical and Computational Photochemistry of Ministry of Education, College of Chemistry, Beijing Normal University, Beijing 100875, P. R. China
- Shandong Laboratory of Yantai Advanced Materials and Green Manufacturing, Yantai 264006, Shandong, P. R. China
| |
Collapse
|
6
|
Low K, Coote ML, Izgorodina EI. Explainable Solvation Free Energy Prediction Combining Graph Neural Networks with Chemical Intuition. J Chem Inf Model 2022; 62:5457-5470. [PMID: 36317829 DOI: 10.1021/acs.jcim.2c01013] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The prediction of a molecule's solvation Gibbs free (ΔGsolv) energy in a given solvent is an important task which has traditionally been carried out via quantum chemical continuum methods or force field-based molecular simulations. Machine learning (ML) and graph neural networks in particular have emerged as powerful techniques for elucidating structure-property relationships. This work presents a graph neural network (GNN) for the prediction of ΔGsolv which, in addition to encoding typical atom and bond-level features, incorporates chemically intuitive, solvation-relevant parameters into the featurization process: semiempirical partial atomic charges and solvent dielectric constant. Solute-solvent interactions are included via an interaction map layer which can be visualized to examine solubility-enhancing or -decreasing interactions learnt by the model. On a test set of small organic molecules, our GNN predicts ΔGsolv in water and cyclohexane with an accuracy comparable to polarizable and ab initio generated force field methods [mean absolute error (MAE) = 0.4 and 0.2 kcal mol-1, respectively], without the need for any molecular simulation. For the FreeSolv data set of hydration free energies, the test MAE is 0.7 kcal mol-1. Interpretability and applicability of the model is highlighted through several examples including rationalizing the increased solubility of modified diaminoanthraquinones in organic solvents. The clear explanations afforded by our GNN allow for easy understanding of the model's predictions, giving the experimental chemist confidence in employing ML models toward more optimized synthetic routes.
Collapse
Affiliation(s)
- Kaycee Low
- Monash Computational Chemistry Group, School of Chemistry, Monash University, Clayton, Victoria3800, Australia
| | - Michelle L Coote
- Institute for Nanoscale Science and Technology, College of Science and Engineering, Flinders University, Bedford Park, South Australia5042, Australia
| | - Ekaterina I Izgorodina
- Monash Computational Chemistry Group, School of Chemistry, Monash University, Clayton, Victoria3800, Australia
| |
Collapse
|
7
|
Alibakhshi A, Hartke B. Implicitly perturbed Hamiltonian as a class of versatile and general-purpose molecular representations for machine learning. Nat Commun 2022; 13:1245. [PMID: 35273170 PMCID: PMC8913769 DOI: 10.1038/s41467-022-28912-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 02/01/2022] [Indexed: 11/28/2022] Open
Abstract
Unraveling challenging problems by machine learning has recently become a hot topic in many scientific disciplines. For developing rigorous machine-learning models to study problems of interest in molecular sciences, translating molecular structures to quantitative representations as suitable machine-learning inputs play a central role. Many different molecular representations and the state-of-the-art ones, although efficient in studying numerous molecular features, still are suboptimal in many challenging cases, as discussed in the context of the present research. The main aim of the present study is to introduce the Implicitly Perturbed Hamiltonian (ImPerHam) as a class of versatile representations for more efficient machine learning of challenging problems in molecular sciences. ImPerHam representations are defined as energy attributes of the molecular Hamiltonian, implicitly perturbed by a number of hypothetic or real arbitrary solvents based on continuum solvation models. We demonstrate the outstanding performance of machine-learning models based on ImPerHam representations for three diverse and challenging cases of predicting inhibition of the CYP450 enzyme, high precision, and transferrable evaluation of non-covalent interaction energy of molecular systems, and accurately reproducing solvation free energies for large benchmark sets. Molecular representations are fundamental tools for machine-learning models. The current work introduces a new set of molecular representations demonstrated to enable accurate predictions of molecular conformational energy and solvation free energy.
Collapse
Affiliation(s)
- Amin Alibakhshi
- Theoretical Chemistry, Institute for Physical Chemistry, Christian-Albrechts-University, Olshausenstr. 40, Kiel, Germany.
| | - Bernd Hartke
- Theoretical Chemistry, Institute for Physical Chemistry, Christian-Albrechts-University, Olshausenstr. 40, Kiel, Germany
| |
Collapse
|
8
|
Gao P, Yang X, Tang YH, Zheng M, Andersen A, Murugesan V, Hollas A, Wang W. Graphical Gaussian process regression model for aqueous solvation free energy prediction of organic molecules in redox flow batteries. Phys Chem Chem Phys 2021; 23:24892-24904. [PMID: 34724700 DOI: 10.1039/d1cp04475c] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, pKa and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using the Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model for electrostatic interaction, the nonpolar interaction contribution of the solvent and the conformational entropy of the solute in the solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of the conformational entropy of the solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal mol-1 for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.
Collapse
Affiliation(s)
- Peiyuan Gao
- Pacific Northwest National Laboratory, Richland 99352, USA.
| | - Xiu Yang
- Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015, USA.
| | - Yu-Hang Tang
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA
| | - Muqing Zheng
- Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015, USA.
| | - Amity Andersen
- Pacific Northwest National Laboratory, Richland 99352, USA.
| | | | - Aaron Hollas
- Pacific Northwest National Laboratory, Richland 99352, USA.
| | - Wei Wang
- Pacific Northwest National Laboratory, Richland 99352, USA.
| |
Collapse
|
9
|
Giannakoulias S, Shringari SR, Ferrie JJ, Petersson EJ. Biomolecular simulation based machine learning models accurately predict sites of tolerability to the unnatural amino acid acridonylalanine. Sci Rep 2021; 11:18406. [PMID: 34526629 PMCID: PMC8443755 DOI: 10.1038/s41598-021-97965-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 08/17/2021] [Indexed: 11/08/2022] Open
Abstract
The incorporation of unnatural amino acids (Uaas) has provided an avenue for novel chemistries to be explored in biological systems. However, the successful application of Uaas is often hampered by site-specific impacts on protein yield and solubility. Although previous efforts to identify features which accurately capture these site-specific effects have been unsuccessful, we have developed a set of novel Rosetta Custom Score Functions and alternative Empirical Score Functions that accurately predict the effects of acridon-2-yl-alanine (Acd) incorporation on protein yield and solubility. Acd-containing mutants were simulated in PyRosetta, and machine learning (ML) was performed using either the decomposed values of the Rosetta energy function, or changes in residue contacts and bioinformatics. Using these feature sets, which represent Rosetta score function specific and bioinformatics-derived terms, ML models were trained to predict highly abstract experimental parameters such as mutant protein yield and solubility and displayed robust performance on well-balanced holdouts. Model feature importance analyses demonstrated that terms corresponding to hydrophobic interactions, desolvation, and amino acid angle preferences played a pivotal role in predicting tolerance of mutation to Acd. Overall, this work provides evidence that the application of ML to features extracted from simulated structural models allow for the accurate prediction of diverse and abstract biological phenomena, beyond the predictivity of traditional modeling and simulation approaches.
Collapse
Affiliation(s)
- Sam Giannakoulias
- Department of Chemistry, University of Pennsylvania, 231 S. 34th St, Philadelphia, PA, 19104, USA
| | - Sumant R Shringari
- Department of Chemistry, University of Pennsylvania, 231 S. 34th St, Philadelphia, PA, 19104, USA
| | - John J Ferrie
- Department of Molecular & Cell Biology, University of California, Berkeley, 475B Li Ka Shing Center, Berkeley, CA, 94720, USA.
| | - E James Petersson
- Department of Chemistry, University of Pennsylvania, 231 S. 34th St, Philadelphia, PA, 19104, USA.
| |
Collapse
|
10
|
Çaylak O, Baumeier B. Machine Learning of Quasiparticle Energies in Molecules and Clusters. J Chem Theory Comput 2021; 17:4891-4900. [PMID: 34314186 PMCID: PMC8359011 DOI: 10.1021/acs.jctc.1c00520] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Indexed: 11/30/2022]
Abstract
We present a Δ-machine learning approach for the prediction of GW quasiparticle energies (ΔMLQP) and photoelectron spectra of molecules and clusters, using orbital-sensitive representations (OSRs) based on molecular Cartesian coordinates in kernel ridge regression-based supervised learning. Coulomb matrix, bag-of-bond, and bond-angle-torsion representations are made orbital-sensitive by augmenting them with atom-centered orbital charges and Kohn-Sham orbital energies, both of which are readily available from baseline calculations at the level of density functional theory (DFT). We first illustrate the effects of different constructions of the OSRs on the prediction of frontier orbital energies of 22k molecules of the QM8 data set and show that it is possible to predict the full photoelectron spectrum of molecules within the data set using a single model with a mean absolute error below 0.1 eV. We further demonstrate that the OSR-based ΔMLQP captures the effects of intra- and intermolecular conformations in application to water monomers and dimers. Finally, we show that the approach can be embedded in multiscale simulation workflows, by studying the solvatochromic shifts of quasiparticle and electron-hole excitation energies of solvated acetone in a setup combining molecular dynamics, DFT, the GW approximation, and the Bethe-Salpeter equation. Our findings suggest that the ΔMLQP model allows us to predict quasiparticle energies and photoelectron spectra of molecules and clusters with GW accuracy at DFT cost.
Collapse
Affiliation(s)
- Onur Çaylak
- Department of Mathematics
and Computer Science, Eindhoven University
of Technology, P.O. Box 513, 5600MB Eindhoven, The Netherlands
- Institute for Complex Molecular
Systems, Eindhoven University of Technology, P.O. Box 513, 5600MB Eindhoven, The Netherlands
| | | |
Collapse
|
11
|
Lim H, Jung Y. MLSolvA: solvation free energy prediction from pairwise atomistic interactions by machine learning. J Cheminform 2021; 13:56. [PMID: 34332634 PMCID: PMC8325294 DOI: 10.1186/s13321-021-00533-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 07/15/2021] [Indexed: 01/04/2023] Open
Abstract
Recent advances in machine learning technologies and their applications have led to the development of diverse structure-property relationship models for crucial chemical properties. The solvation free energy is one of them. Here, we introduce a novel ML-based solvation model, which calculates the solvation energy from pairwise atomistic interactions. The novelty of the proposed model consists of a simple architecture: two encoding functions extract atomic feature vectors from the given chemical structure, while the inner product between the two atomistic feature vectors calculates their interactions. The results of 6239 experimental measurements achieve outstanding performance and transferability for enlarging training data owing to its solvent-non-specific nature. An analysis of the interaction map shows that our model has significant potential for producing group contributions on the solvation energy, which indicates that the model provides not only predictions of target properties but also more detailed physicochemical insights.
Collapse
Affiliation(s)
- Hyuntae Lim
- Department of Chemistry, Seoul National University, Seoul, 08826, South Korea
| | - YounJoon Jung
- Department of Chemistry, Seoul National University, Seoul, 08826, South Korea.
| |
Collapse
|
12
|
Ward L, Dandu N, Blaiszik B, Narayanan B, Assary RS, Redfern PC, Foster I, Curtiss LA. Graph-Based Approaches for Predicting Solvation Energy in Multiple Solvents: Open Datasets and Machine Learning Models. J Phys Chem A 2021; 125:5990-5998. [PMID: 34191512 DOI: 10.1021/acs.jpca.1c01960] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The solvation properties of molecules, often estimated using quantum chemical simulations, are important in the synthesis of energy storage materials, drugs, and industrial chemicals. Here, we develop machine learning models of solvation energies to replace expensive quantum chemistry calculations with inexpensive-to-compute message-passing neural network models that require only the molecular graph as inputs. Our models are trained on a new database of solvation energies for 130,258 molecules taken from the QM9 dataset computed in five solvents (acetone, ethanol, acetonitrile, dimethyl sulfoxide, and water) via an implicit solvent model. Our best model achieves a mean absolute error of 0.5 kcal/mol for molecules with nine or fewer non-hydrogen atoms and 1 kcal/mol for molecules with between 10 and 14 non-hydrogen atoms. We make the entire dataset of 651,290 computed entries openly available and provide simple web and programmatic interfaces to enable others to run our solvation energy model on new molecules. This model calculates the solvation energies for molecules using only the SMILES string and also provides an estimate of whether each molecule is within the domain of applicability of our model. We envision that the dataset and models will provide the functionality needed for the rapid screening of large chemical spaces to discover improved molecules for many applications.
Collapse
Affiliation(s)
- Logan Ward
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Naveen Dandu
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ben Blaiszik
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Globus, University of Chicago, Chicago, Illinois 60637, United States
| | - Badri Narayanan
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Department of Mechanical Engineering, University of Louisville, Louisville, Kentucky 40292, United States
| | - Rajeev S Assary
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Paul C Redfern
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| | - Ian Foster
- Data Science and Learning Division, Argonne National Laboratory, Lemont, Illinois 60439, United States.,Globus, University of Chicago, Chicago, Illinois 60637, United States.,Department of Computer Science, University of Chicago, Chicago, Illinois 60637, United States
| | - Larry A Curtiss
- Materials Science Division, Argonne National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
13
|
Alibakhshi A, Hartke B. Improved prediction of solvation free energies by machine-learning polarizable continuum solvation model. Nat Commun 2021; 12:3584. [PMID: 34145237 PMCID: PMC8213834 DOI: 10.1038/s41467-021-23724-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 05/12/2021] [Indexed: 11/30/2022] Open
Abstract
Theoretical estimation of solvation free energy by continuum solvation models, as a standard approach in computational chemistry, is extensively applied by a broad range of scientific disciplines. Nevertheless, the current widely accepted solvation models are either inaccurate in reproducing experimentally determined solvation free energies or require a number of macroscopic observables which are not always readily available. In the present study, we develop and introduce the Machine-Learning Polarizable Continuum solvation Model (ML-PCM) for a substantial improvement of the predictability of solvation free energy. The performance and reliability of the developed models are validated through a rigorous and demanding validation procedure. The ML-PCM models developed in the present study improve the accuracy of widely accepted continuum solvation models by almost one order of magnitude with almost no additional computational costs. A freely available software is developed and provided for a straightforward implementation of the new approach. Accurate theoretical evaluation of solvation free energy is challenging. Here the authors introduce a machine-learning based polarizable continuum solvation approach to improve the accuracy of widely accepted continuum solvation models by almost one order of magnitude without additional computational costs.
Collapse
Affiliation(s)
- Amin Alibakhshi
- Theoretical Chemistry, Institute for Physical Chemistry, Christian-Albrechts-University, Olshausenstr. 40, Kiel, Germany.
| | - Bernd Hartke
- Theoretical Chemistry, Institute for Physical Chemistry, Christian-Albrechts-University, Olshausenstr. 40, Kiel, Germany
| |
Collapse
|
14
|
Zeni C, Rossi K, Glielmo A, de Gironcoli S. Compact atomic descriptors enable accurate predictions via linear models. J Chem Phys 2021; 154:224112. [PMID: 34241204 DOI: 10.1063/5.0052961] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
We probe the accuracy of linear ridge regression employing a three-body local density representation derived from the atomic cluster expansion. We benchmark the accuracy of this framework in the prediction of formation energies and atomic forces in molecules and solids. We find that such a simple regression framework performs on par with state-of-the-art machine learning methods which are, in most cases, more complex and more computationally demanding. Subsequently, we look for ways to sparsify the descriptor and further improve the computational efficiency of the method. To this aim, we use both principal component analysis and least absolute shrinkage operator regression for energy fitting on six single-element datasets. Both methods highlight the possibility of constructing a descriptor that is four times smaller than the original with a similar or even improved accuracy. Furthermore, we find that the reduced descriptors share a sizable fraction of their features across the six independent datasets, hinting at the possibility of designing material-agnostic, optimally compressed, and accurate descriptors.
Collapse
Affiliation(s)
- Claudio Zeni
- Physics Area, International School for Advanced Studies, Trieste, Italy
| | - Kevin Rossi
- Laboratory of Nanochemistry, Institute of Chemistry and Chemical Engineering, Ecole Polytechnique Fédérale de Lausanne, Lausanne, CH, Switzerland
| | - Aldo Glielmo
- Physics Area, International School for Advanced Studies, Trieste, Italy
| | | |
Collapse
|
15
|
Abstract
Machine learning (ML) techniques applied to chemical reactions have a long history. The present contribution discusses applications ranging from small molecule reaction dynamics to computational platforms for reaction planning. ML-based techniques can be particularly relevant for problems involving both computation and experiments. For one, Bayesian inference is a powerful approach to develop models consistent with knowledge from experiments. Second, ML-based methods can also be used to handle problems that are formally intractable using conventional approaches, such as exhaustive characterization of state-to-state information in reactive collisions. Finally, the explicit simulation of reactive networks as they occur in combustion has become possible using machine-learned neural network potentials. This review provides an overview of the questions that can and have been addressed using machine learning techniques, and an outlook discusses challenges in this diverse and stimulating field. It is concluded that ML applied to chemistry problems as practiced and conceived today has the potential to transform the way with which the field approaches problems involving chemical reactions, in both research and academic teaching.
Collapse
Affiliation(s)
- Markus Meuwly
- Department of Chemistry, University of Basel, Klingelbergstrasse 80, 4056 Basel, Switzerland.,Department of Chemistry, Brown University, Providence, Rhode Island 02912, United States
| |
Collapse
|
16
|
Ceriotti M, Clementi C, Anatole von Lilienfeld O. Machine learning meets chemical physics. J Chem Phys 2021; 154:160401. [PMID: 33940847 DOI: 10.1063/5.0051418] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Over recent years, the use of statistical learning techniques applied to chemical problems has gained substantial momentum. This is particularly apparent in the realm of physical chemistry, where the balance between empiricism and physics-based theory has traditionally been rather in favor of the latter. In this guest Editorial for the special topic issue on "Machine Learning Meets Chemical Physics," a brief rationale is provided, followed by an overview of the topics covered. We conclude by making some general remarks.
Collapse
Affiliation(s)
- Michele Ceriotti
- Laboratory of Computational Science and Modeling, IMX, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Cecilia Clementi
- Department of Physics, Freie Universität Berlin, Arnimallee 14, 14195 Berlin, Germany
| | | |
Collapse
|
17
|
Weinreich J, Browning NJ, von Lilienfeld OA. Machine learning of free energies in chemical compound space using ensemble representations: Reaching experimental uncertainty for solvation. J Chem Phys 2021; 154:134113. [PMID: 33832231 DOI: 10.1063/5.0041548] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Free energies govern the behavior of soft and liquid matter, and improving their predictions could have a large impact on the development of drugs, electrolytes, or homogeneous catalysts. Unfortunately, it is challenging to devise an accurate description of effects governing solvation such as hydrogen-bonding, van der Waals interactions, or conformational sampling. We present a Free energy Machine Learning (FML) model applicable throughout chemical compound space and based on a representation that employs Boltzmann averages to account for an approximated sampling of configurational space. Using the FreeSolv database, FML's out-of-sample prediction errors of experimental hydration free energies decay systematically with training set size, and experimental uncertainty (0.6 kcal/mol) is reached after training on 490 molecules (80% of FreeSolv). Corresponding FML model errors are on par with state-of-the art physics based approaches. To generate the input representation for a new query compound, FML requires approximate and short molecular dynamics runs. We showcase its usefulness through analysis of solvation free energies for 116k organic molecules (all force-field compatible molecules in the QM9 database), identifying the most and least solvated systems and rediscovering quasi-linear structure-property relationships in terms of simple descriptors such as hydrogen-bond donors, number of NH or OH groups, number of oxygen atoms in hydrocarbons, and number of heavy atoms. FML's accuracy is maximal when the temperature used for the molecular dynamics simulation to generate averaged input representation samples in training is the same as for the query compounds. The sampling time for the representation converges rapidly with respect to the prediction error.
Collapse
Affiliation(s)
- Jan Weinreich
- University of Vienna, Faculty of Physics, Kolingasse 14-16, AT-1090 Wien, Austria
| | - Nicholas J Browning
- Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), Department of Chemistry, University of Basel, Klingelbergstrasse 80, CH-4056 Basel, Switzerland
| | | |
Collapse
|
18
|
Aggarwal A, Vinayak V, Bag S, Bhattacharyya C, Waghmare UV, Maiti PK. Predicting the DNA Conductance Using a Deep Feedforward Neural Network Model. J Chem Inf Model 2020; 61:106-114. [PMID: 33320660 DOI: 10.1021/acs.jcim.0c01072] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Double-stranded DNA (dsDNA) has been established as an efficient medium for charge migration, bringing it to the forefront of the field of molecular electronics and biological research. The charge migration rate is controlled by the electronic couplings between the two nucleobases of DNA/RNA. These electronic couplings strongly depend on the intermolecular geometry and orientation. Estimating these electronic couplings for all the possible relative geometries of molecules using the computationally demanding first-principles calculations requires a lot of time and computational resources. In this article, we present a machine learning (ML)-based model to calculate the electronic coupling between any two bases of dsDNA/dsRNA and bypass the computationally expensive first-principles calculations. Using the Coulomb matrix representation which encodes the atomic identities and coordinates of the DNA base pairs to prepare the input dataset, we train a feedforward neural network model. Our neural network (NN) model can predict the electronic couplings between dsDNA base pairs with any structural orientation with a mean absolute error (MAE) of less than 0.014 eV. We further use the NN-predicted electronic coupling values to compute the dsDNA/dsRNA conductance.
Collapse
Affiliation(s)
- Abhishek Aggarwal
- Center for Condensed Matter Theory, Department of Physics, Indian Institute of Science, Bangalore 560012, India
| | - Vinayak Vinayak
- Undergraduate Program, Indian Institute of Science, Bangalore 560012, India
| | - Saientan Bag
- Center for Condensed Matter Theory, Department of Physics, Indian Institute of Science, Bangalore 560012, India
| | - Chiranjib Bhattacharyya
- Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India
| | - Umesh V Waghmare
- Theoretical Sciences Unit, Jawaharlal Nehru Center for Advanced Scientific Research, Jakkur P.O., Bangalore 560064, India
| | - Prabal K Maiti
- Center for Condensed Matter Theory, Department of Physics, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|