1
|
Ramos MC, White AD. Predicting small molecules solubility on endpoint devices using deep ensemble neural networks. DIGITAL DISCOVERY 2024; 3:786-795. [PMID: 38638648 PMCID: PMC11022985 DOI: 10.1039/d3dd00217a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/07/2024] [Indexed: 04/20/2024]
Abstract
Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is useable at https://mol.dev.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| | - Andrew D White
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| |
Collapse
|
2
|
Conn JM, Carter JW, Conn JJA, Subramanian V, Baxter A, Engkvist O, Llinas A, Ratkova EL, Pickett SD, McDonagh JL, Palmer DS. Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models. J Chem Inf Model 2023; 63:1099-1113. [PMID: 36758178 PMCID: PMC9976279 DOI: 10.1021/acs.jcim.2c01189] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.
Collapse
Affiliation(s)
- Jonathan
G. M. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - James W. Carter
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Justin J. A. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Vigneshwari Subramanian
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Andrew Baxter
- GSK
Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K.
| | - Ola Engkvist
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden,Department
of Computer Science and Engineering, Chalmers
University of Technology, SE-412 96 Göteborg, Sweden
| | - Antonio Llinas
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Ekaterina L. Ratkova
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden
| | - Stephen D. Pickett
- Computational
Sciences, GlaxoSmithKline R&D Pharmaceuticals, Stevenage SG1 2NY, U.K.
| | - James L. McDonagh
- IBM Research
Europe, Hartree Centre, SciTech Daresbury, Warrington, Cheshire WA4 4AD, U.K.
| | - David S. Palmer
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.,E-mail:
| |
Collapse
|
3
|
Ge K, Ji Y. Novel Computational Approach by Combining Machine Learning with Molecular Thermodynamics for Predicting Drug Solubility in Solvents. Ind Eng Chem Res 2021. [DOI: 10.1021/acs.iecr.1c00998] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Kai Ge
- Jiangsu Province Hi-Tech Key Laboratory for Biomedical Research, School of Chemistry and Chemical Engineering, Southeast University, Nanjing 211189, People’s Republic of China
| | - Yuanhui Ji
- Jiangsu Province Hi-Tech Key Laboratory for Biomedical Research, School of Chemistry and Chemical Engineering, Southeast University, Nanjing 211189, People’s Republic of China
| |
Collapse
|
4
|
Fowles DJ, Palmer DS, Guo R, Price SL, Mitchell JBO. Toward Physics-Based Solubility Computation for Pharmaceuticals to Rival Informatics. J Chem Theory Comput 2021; 17:3700-3709. [PMID: 33988381 PMCID: PMC8190954 DOI: 10.1021/acs.jctc.1c00130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
![]()
We demonstrate that
physics-based calculations of intrinsic aqueous
solubility can rival cheminformatics-based machine learning predictions.
A proof-of-concept was developed for a physics-based approach via
a sublimation thermodynamic cycle, building upon previous work that
relied upon several thermodynamic approximations, notably the 2RT approximation, and limited conformational sampling. Here,
we apply improvements to our sublimation free-energy model with the
use of crystal phonon mode calculations to capture the contributions
of the vibrational modes of the crystal. Including these improvements
with lattice energies computed using the model-potential-based Ψmol method leads to accurate estimates of sublimation free
energy. Combining these with hydration free energies obtained from
either molecular dynamics free-energy perturbation simulations or
density functional theory calculations, solubilities comparable to
both experiment and informatics predictions are obtained. The application
to coronene, succinic acid, and the pharmaceutical desloratadine shows
how the methods must be adapted for the adoption of different conformations
in different phases. The approach has the flexibility to extend to
applications that cannot be covered by informatics methods.
Collapse
Affiliation(s)
- Daniel J Fowles
- Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow, Scotland G1 1XL, U.K
| | - David S Palmer
- Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow, Scotland G1 1XL, U.K
| | - Rui Guo
- Department of Chemistry, University College London, 20 Gordon Street, London WC1H 0AJ, U.K
| | - Sarah L Price
- Department of Chemistry, University College London, 20 Gordon Street, London WC1H 0AJ, U.K
| | - John B O Mitchell
- EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, St Andrews, Scotland KY16 9ST, U.K
| |
Collapse
|
5
|
Boobier S, Hose DRJ, Blacker AJ, Nguyen BN. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 2020; 11:5753. [PMID: 33188226 PMCID: PMC7666209 DOI: 10.1038/s41467-020-19594-z] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Accepted: 10/12/2020] [Indexed: 11/09/2022] Open
Abstract
Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.
Collapse
Affiliation(s)
- Samuel Boobier
- Institute of Process Research & Development, School of Chemistry, University of Leeds, Woodhouse Lane, Leeds, LS2 9JT, UK
| | - David R J Hose
- Chemical Development, Pharmaceutical Technology and Development, Operations, AstraZeneca, Macclesfield, SK10 2NA, UK
| | - A John Blacker
- Institute of Process Research & Development, School of Chemistry, University of Leeds, Woodhouse Lane, Leeds, LS2 9JT, UK
| | - Bao N Nguyen
- Institute of Process Research & Development, School of Chemistry, University of Leeds, Woodhouse Lane, Leeds, LS2 9JT, UK.
| |
Collapse
|