1
|
Gallegos M, Vassilev-Galindo V, Poltavsky I, Martín Pendás Á, Tkatchenko A. Explainable chemical artificial intelligence from accurate machine learning of real-space chemical descriptors. Nat Commun 2024; 15:4345. [PMID: 38773090 DOI: 10.1038/s41467-024-48567-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 04/24/2024] [Indexed: 05/23/2024] Open
Abstract
Machine-learned computational chemistry has led to a paradoxical situation in which molecular properties can be accurately predicted, but they are difficult to interpret. Explainable AI (XAI) tools can be used to analyze complex models, but they are highly dependent on the AI technique and the origin of the reference data. Alternatively, interpretable real-space tools can be employed directly, but they are often expensive to compute. To address this dilemma between explainability and accuracy, we developed SchNet4AIM, a SchNet-based architecture capable of dealing with local one-body (atomic) and two-body (interatomic) descriptors. The performance of SchNet4AIM is tested by predicting a wide collection of real-space quantities ranging from atomic charges and delocalization indices to pairwise interaction energies. The accuracy and speed of SchNet4AIM breaks the bottleneck that has prevented the use of real-space chemical descriptors in complex systems. We show that the group delocalization indices, arising from our physically rigorous atomistic predictions, provide reliable indicators of supramolecular binding events, thus contributing to the development of Explainable Chemical Artificial Intelligence (XCAI) models.
Collapse
Affiliation(s)
- Miguel Gallegos
- Department of Analytical and Physical Chemistry, University of Oviedo, E-33006, Oviedo, Spain
| | | | - Igor Poltavsky
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg
| | - Ángel Martín Pendás
- Department of Analytical and Physical Chemistry, University of Oviedo, E-33006, Oviedo, Spain.
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| |
Collapse
|
2
|
Llompart P, Minoletti C, Baybekov S, Horvath D, Marcou G, Varnek A. Will we ever be able to accurately predict solubility? Sci Data 2024; 11:303. [PMID: 38499581 PMCID: PMC10948805 DOI: 10.1038/s41597-024-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 02/29/2024] [Indexed: 03/20/2024] Open
Abstract
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
Collapse
Affiliation(s)
- P Llompart
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
- IDD/CADD, Sanofi, Vitry-Sur-Seine, France
| | | | - S Baybekov
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - D Horvath
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - G Marcou
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France.
| | - A Varnek
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| |
Collapse
|
3
|
Gao P, Zhang Q, Keely D, Cleveland DW, Ye Y, Zheng W, Shen M, Yu H. Molecular Graph-Based Deep Learning Algorithm Facilitates an Imaging-Based Strategy for Rapid Discovery of Small Molecules Modulating Biomolecular Condensates. J Med Chem 2023; 66:15084-15093. [PMID: 37937963 PMCID: PMC10810226 DOI: 10.1021/acs.jmedchem.3c00490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2023]
Abstract
Biomolecular condensates are proposed to cause diseases, such as cancer and neurodegeneration, by concentrating proteins at abnormal subcellular loci. Imaging-based compound screens have been used to identify small molecules that reverse or promote biomolecular condensates. However, limitations of conventional imaging-based methods restrict the screening scale. Here, we used a graph convolutional network (GCN)-based computational approach and identified small molecule candidates that reduce the nuclear liquid-liquid phase separation of TAR DNA-binding protein 43 (TDP-43), an essential protein that undergoes phase transition in neurodegenerative diseases. We demonstrated that the GCN-based deep learning algorithm is suitable for spatial information extraction from the molecular graph. Thus, this is a promising method to identify small molecule candidates with novel scaffolds. Furthermore, we validated that these candidates do not affect the normal splicing function of TDP-43. Taken together, a combination of an imaging-based screen and a GCN-based deep learning method dramatically improves the speed and accuracy of the compound screen for biomolecular condensates.
Collapse
Affiliation(s)
- Peng Gao
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), MD 20850, USA
| | - Qi Zhang
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), MD 20850, USA
| | - Devin Keely
- Center for Alzheimer’s and Neurodegenerative Diseases, Department of Molecular Biology, Peter O’Donnell Jr. Brain Institute, UT Southwestern Medical Center, TX, 75287, USA
| | - Don W. Cleveland
- Department of Cellular and Molecular Medicine, UC San Diego, CA, 92093, USA
| | - Yihong Ye
- National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institutes of Health (NIH), MD 20850, USA
| | - Wei Zheng
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), MD 20850, USA
| | - Min Shen
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), MD 20850, USA
| | - Haiyang Yu
- Center for Alzheimer’s and Neurodegenerative Diseases, Department of Molecular Biology, Peter O’Donnell Jr. Brain Institute, UT Southwestern Medical Center, TX, 75287, USA
| |
Collapse
|
4
|
Sprueill HW, Bilbrey JA, Pang Q, Sushko PV. Active sampling for neural network potentials: Accelerated simulations of shear-induced deformation in Cu-Ni multilayers. J Chem Phys 2023; 158:114103. [PMID: 36948793 DOI: 10.1063/5.0133023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023] Open
Abstract
Neural network potentials (NNPs) can greatly accelerate atomistic simulations relative to ab initio methods, allowing one to sample a broader range of structural outcomes and transformation pathways. In this work, we demonstrate an active sampling algorithm that trains an NNP that is able to produce microstructural evolutions with accuracy comparable to those obtained by density functional theory, exemplified during structure optimizations for a model Cu-Ni multilayer system. We then use the NNP, in conjunction with a perturbation scheme, to stochastically sample structural and energetic changes caused by shear-induced deformation, demonstrating the range of possible intermixing and vacancy migration pathways that can be obtained as a result of the speedups provided by the NNP. The code to implement our active learning strategy and NNP-driven stochastic shear simulations is openly available at https://github.com/pnnl/Active-Sampling-for-Atomistic-Potentials.
Collapse
Affiliation(s)
- Henry W Sprueill
- National Security Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, USA
| | - Jenna A Bilbrey
- National Security Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, USA
| | - Qin Pang
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, USA
| | - Peter V Sushko
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, USA
| |
Collapse
|
5
|
Conn JM, Carter JW, Conn JJA, Subramanian V, Baxter A, Engkvist O, Llinas A, Ratkova EL, Pickett SD, McDonagh JL, Palmer DS. Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models. J Chem Inf Model 2023; 63:1099-1113. [PMID: 36758178 PMCID: PMC9976279 DOI: 10.1021/acs.jcim.2c01189] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.
Collapse
Affiliation(s)
- Jonathan
G. M. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - James W. Carter
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Justin J. A. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Vigneshwari Subramanian
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Andrew Baxter
- GSK
Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K.
| | - Ola Engkvist
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden,Department
of Computer Science and Engineering, Chalmers
University of Technology, SE-412 96 Göteborg, Sweden
| | - Antonio Llinas
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Ekaterina L. Ratkova
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden
| | - Stephen D. Pickett
- Computational
Sciences, GlaxoSmithKline R&D Pharmaceuticals, Stevenage SG1 2NY, U.K.
| | - James L. McDonagh
- IBM Research
Europe, Hartree Centre, SciTech Daresbury, Warrington, Cheshire WA4 4AD, U.K.
| | - David S. Palmer
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.,E-mail:
| |
Collapse
|
6
|
Wu J, Wang J, Wu Z, Zhang S, Deng Y, Kang Y, Cao D, Hsieh CY, Hou T. ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction. J Chem Inf Model 2022; 62:5975-5987. [PMID: 36417544 DOI: 10.1021/acs.jcim.2c01290] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Lipophilicity (logD) and aqueous solubility (logSw) play a central role in drug development. The accurate prediction of these properties remains to be solved due to data scarcity. Current methodologies neglect the intrinsic relationships between physicochemical properties and usually ignore the ionization effects. Here, we propose an attention-driven mixture-of-experts (MoE) model named ALipSol, which explicitly reproduces the hierarchy of task relationships. We adopt the principle of divide-and-conquer by breaking down the complex end point (logD or logSw) into simpler ones (acidic pKa, basic pKa, and logP) and allocating a specific expert network for each subproblem. Subsequently, we implement transfer learning to extract knowledge from related tasks, thus alleviating the dilemma of limited data. Additionally, we substitute the gating network with an attention mechanism to better capture the dynamic task relationships on a per-example basis. We adopt local fine-tuning and consensus prediction to further boost model performance. Extensive evaluation experiments verify the success of the ALipSol model, which achieves RMSE improvement of 8.04%, 2.49%, 8.57%, 12.8%, and 8.60% on the Lipop, ESOL, AqSolDB, external logD, and external logS data sets, respectively, compared with Attentive FP and the state-of-the-art in silico tools. In particular, our model yields more significant advantages (Welch's t-test) for small training data, implying its high robustness and generalizability. The interpretability analysis proves that the atom contributions learned by ALipSol are more reasonable compared with the vanilla Attentive FP, and the substitution effects in benzene derivatives agreed well with empirical constants, revealing the potential of our model to extract useful patterns from data and provide guidance for lead optimization.
Collapse
Affiliation(s)
- Jialu Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058Zhejiang, P. R. China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018Zhejiang, P. R. China
| | - Junmei Wang
- Department of Pharmaceutical Sciences and Computational Chemical Genomics Screening Center, School of Pharmacy, University of Pittsburgh, Pittsburgh, Pennsylvania15261, United States
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058Zhejiang, P. R. China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018Zhejiang, P. R. China
| | - Shengyu Zhang
- Tencent Quantum Laboratory, Tencent, Shenzhen, 518057Guangdong, P. R. China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018Zhejiang, P. R. China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058Zhejiang, P. R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004Hunan, P. R. China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058Zhejiang, P. R. China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058Zhejiang, P. R. China
| |
Collapse
|
7
|
Gao P, Xu M, Zhang Q, Chen CZ, Guo H, Ye Y, Zheng W, Shen M. Graph Convolutional Network-Based Screening Strategy for Rapid Identification of SARS-CoV-2 Cell-Entry Inhibitors. J Chem Inf Model 2022; 62:1988-1997. [PMID: 35404596 PMCID: PMC9016773 DOI: 10.1021/acs.jcim.2c00222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Indexed: 11/29/2022]
Abstract
The cell entry of SARS-CoV-2 has emerged as an attractive drug development target. We previously reported that the entry of SARS-CoV-2 depends on the cell surface heparan sulfate proteoglycan (HSPG) and the cortex actin, which can be targeted by therapeutic agents identified by conventional drug repurposing screens. However, this drug identification strategy requires laborious library screening, which is time consuming, and often limited number of compounds can be screened. As an alternative approach, we developed and trained a graph convolutional network (GCN)-based classification model using information extracted from experimentally identified HSPG and actin inhibitors. This method allowed us to virtually screen 170,000 compounds, resulting in ∼2000 potential hits. A hit confirmation assay with the uptake of a fluorescently labeled HSPG cargo further shortlisted 256 active compounds. Among them, 16 compounds had modest to strong inhibitory activities against the entry of SARS-CoV-2 pseudotyped particles into Vero E6 cells. These results establish a GCN-based virtual screen workflow for rapid identification of new small molecule inhibitors against validated drug targets.
Collapse
Affiliation(s)
- Peng Gao
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| | - Miao Xu
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| | - Qi Zhang
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
- National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institutes of Health (NIH), Bethesda, Maryland 20892, United States
| | - Catherine Z Chen
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| | - Hui Guo
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| | - Yihong Ye
- National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institutes of Health (NIH), Bethesda, Maryland 20892, United States
| | - Wei Zheng
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| | - Min Shen
- The National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, Maryland 20850, United States
| |
Collapse
|
8
|
Lee S, Lee M, Gyak KW, Kim SD, Kim MJ, Min K. Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks. ACS OMEGA 2022; 7:12268-12277. [PMID: 35449985 PMCID: PMC9016862 DOI: 10.1021/acsomega.2c00697] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 03/18/2022] [Indexed: 05/27/2023]
Abstract
Predicting both accurate and reliable solubility values has long been a crucial but challenging task. In this work, surrogated model-based methods were developed to accurately predict the solubility of two molecules (solute and solvent) through machine learning and deep learning. The current study employed two methods: (1) converting molecules into molecular fingerprints and adding optimal physicochemical properties as descriptors and (2) using graph convolutional network (GCN) models to convert molecules into a graph representation and deal with prediction tasks. Then, two prediction tasks were conducted with each method: (1) the solubility value (regression) and (2) the solubility class (classification). The fingerprint-based method clearly demonstrates that high performance is possible by adding simple but significant physicochemical descriptors to molecular fingerprints, while the GCN method shows that it is possible to predict various properties of chemical compounds with relatively simplified features from the graph representation. The developed methodologies provide a comprehensive understanding of constructing a proper model for predicting solubility and can be employed to find suitable solutes and solvents.
Collapse
Affiliation(s)
- Sumin Lee
- Department
of Industrial and Information Systems Engineering, School of Systems
Biomedical Science, School of Mechanical Engineering, Soongsil
University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Myeonghun Lee
- Department
of Industrial and Information Systems Engineering, School of Systems
Biomedical Science, School of Mechanical Engineering, Soongsil
University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| | - Ki-Won Gyak
- Polymer
Research Lab, Samsung Advanced Institute of Technology, 130 Samsung-ro, Suwon, Gyeonggi-do 16678, Republic of Korea
| | - Sung Dug Kim
- Polymer
Research Lab, Samsung Advanced Institute of Technology, 130 Samsung-ro, Suwon, Gyeonggi-do 16678, Republic of Korea
| | - Mi-Jeong Kim
- Polymer
Research Lab, Samsung Advanced Institute of Technology, 130 Samsung-ro, Suwon, Gyeonggi-do 16678, Republic of Korea
| | - Kyoungmin Min
- Department
of Industrial and Information Systems Engineering, School of Systems
Biomedical Science, School of Mechanical Engineering, Soongsil
University, 369 Sangdo-ro, Dongjak-gu, Seoul 06978, Republic of Korea
| |
Collapse
|
9
|
Accurate predictions of drugs aqueous solubility via deep learning tools. J Mol Struct 2022. [DOI: 10.1016/j.molstruc.2021.131562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
10
|
Donyapour N, Dickson A. Predicting partition coefficients for the SAMPL7 physical property challenge using the ClassicalGSG method. J Comput Aided Mol Des 2021; 35:819-830. [PMID: 34181200 PMCID: PMC8295205 DOI: 10.1007/s10822-021-00400-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Accepted: 06/17/2021] [Indexed: 02/02/2023]
Abstract
The prediction of [Formula: see text] values is one part of the statistical assessment of the modeling of proteins and ligands (SAMPL) blind challenges. Here, we use a molecular graph representation method called Geometric Scattering for Graphs (GSG) to transform atomic attributes to molecular features. The atomic attributes used here are parameters from classical molecular force fields including partial charges and Lennard-Jones interaction parameters. The molecular features from GSG are used as inputs to neural networks that are trained using a "master" dataset comprised of over 41,000 unique [Formula: see text] values. The specific molecular targets in the SAMPL7 [Formula: see text] prediction challenge were unique in that they all contained a sulfonyl moeity. This motivated a set of ClassicalGSG submissions where predictors were trained on different subsets of the master dataset that are filtered according to chemical types and/or the presence of the sulfonyl moeity. We find that our ranked prediction obtained 5th place with an RMSE of 0.77 [Formula: see text] units and an MAE of 0.62, while one of our non-ranked predictions achieved first place among all submissions with an RMSE of 0.55 and an MAE of 0.44. After the conclusion of the challenge we also examined the performance of open-source force field parameters that allow for an end-to-end [Formula: see text] predictor model: General AMBER Force Field (GAFF), Universal Force Field (UFF), Merck Molecular Force Field 94 (MMFF94) and Ghemical. We find that ClassicalGSG models trained with atomic attributes from MMFF94 can yield more accurate predictions compared to those trained with CGenFF atomic attributes.
Collapse
Affiliation(s)
- Nazanin Donyapour
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Alex Dickson
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA.
| |
Collapse
|
11
|
Gao P, Zhang J, Qiu H, Zhao S. A general QSPR protocol for the prediction of atomic/inter-atomic properties: a fragment based graph convolutional neural network (F-GCN). Phys Chem Chem Phys 2021; 23:13242-13249. [PMID: 34086015 DOI: 10.1039/d1cp00677k] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
In this study, a general quantitative structure-property relationship (QSPR) protocol, fragment based graph convolutional neural network (F-GCN), was developed for the prediction of atomic/inter-atomic properties. We applied this novel artificial intelligence (AI) tool in predictions of NMR chemical shifts and bond dissociation energies (BDEs). The obtained results were comparable to experimental measurements, while the computational cost was substantially reduced, with respect to pure density functional theory (DFT) calculations. The two important features of F-GCN can be summarised as: first, it could utilise different levels of molecular fragments for atomic/inter-atomic information extraction; second, the designed architecture is also open to include additional descriptors for a more accurate solution of the local environment at atomic level, making itself more efficient for structural solutions. And during our test, the averaged prediction error of 1H NMR chemical shifts is as small as 0.32 ppm, and the error of C-H BDE estimation is 2.7 kcal mol-1. Moreover, we further demonstrated the applicability of this developed F-GCN model via several challenging structural assignments. The success of the F-GCN in atomic and inter-atomic predictions also indicates an essential improvement of computational chemistry with the assistance of AI tools.
Collapse
Affiliation(s)
- Peng Gao
- School of Chemistry and Molecular Bioscience, University of Wollongong, NSW 2500, Australia
| | - Jie Zhang
- Centre of Chemistry and Chemical Biology, Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou 53000, China. and School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Hongbo Qiu
- Department of Chemical Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shuaifei Zhao
- Institute for Frontier Materials (IFM), Deakin University, Perth, WA, Australia
| |
Collapse
|
12
|
Donyapour N, Hirn MJ, Dickson A. ClassicalGSG: Prediction of log P using classical molecular force fields and geometric scattering for graphs. J Comput Chem 2021; 42:1006-1017. [PMID: 33786857 PMCID: PMC8062296 DOI: 10.1002/jcc.26519] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 02/11/2021] [Accepted: 02/21/2021] [Indexed: 12/15/2022]
Abstract
This work examines methods for predicting the partition coefficient (log P) for a dataset of small molecules. Here, we use atomic attributes such as radius and partial charge, which are typically used as force field parameters in classical molecular dynamics simulations. These atomic attributes are transformed into index-invariant molecular features using a recently developed method called geometric scattering for graphs (GSG). We call this approach "ClassicalGSG" and examine its performance under a broad range of conditions and hyperparameters. We train ClassicalGSG log P predictors with neural networks using 10,722 molecules from the OpenChem dataset and apply them to predict the log P values from four independent test sets. The ClassicalGSG method's performance is compared to a baseline model that employs graph convolutional networks. Our results show that the best prediction accuracies are obtained using atomic attributes generated with the CHARMM generalized force field and 2D molecular structures.
Collapse
Affiliation(s)
- Nazanin Donyapour
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, USA
| | - Matthew J. Hirn
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, USA
- Department of Mathematics, Michigan State University, East Lansing, Michigan, USA
- Center for Quantum Computing, Science and Engineering, Michigan State University, East Lansing, Michigan, USA
| | - Alex Dickson
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
| |
Collapse
|
13
|
Gao P, Zhang J, Sun Y, Yu J. Toward Accurate Predictions of Atomic Properties via Quantum Mechanics Descriptors Augmented Graph Convolutional Neural Network: Application of This Novel Approach in NMR Chemical Shifts Predictions. J Phys Chem Lett 2020; 11:9812-9818. [PMID: 33151693 DOI: 10.1021/acs.jpclett.0c02654] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this study, an augmented Graph Convolutional Network (GCN) with quantum mechanics (QM) descriptors was reported for its accurate predictions of NMR chemical shifts with respect to experimental values. The prediction errors of 13C/1H NMR chemical shifts can be as small as 2.14/0.11 ppm. There are two crucial characteristics for this modified GCN: in one aspect, such a novel neural network could efficiently extract the overall molecule structure information; in another aspect, it could accurately solve the chemical environment of the target atom. As there exists an imperfect linear regression between the experimental NMR chemical shifts (δ) and the density functional theory (DFT) calculated isotropic shielding constants (σ), the inclusion of QM descriptors within GCN can largely improve its performance. Moreover, few-shot learning also becomes feasible with these descriptors. The success of this novel GCN in chemical shifts predictions also indicates its potential applicability for other computational studies.
Collapse
Affiliation(s)
- Peng Gao
- School of Chemistry and Molecular Bioscience, University of Wollongong, Wollongong, NSW 2500, Australia
| | - Jie Zhang
- Centre of Chemistry and Chemical Biology, Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou 53000, China
- School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Yuzhu Sun
- School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Jianguo Yu
- School of Chemical Engineering, East China University of Science and Technology, Shanghai 200237, China
| |
Collapse
|