1
|
Ahmad W, Chong KT, Tayara H. GGAS2SN: Gated Graph and SmilesToSeq Network for Solubility Prediction. J Chem Inf Model 2024; 64:7833-7843. [PMID: 39387596 DOI: 10.1021/acs.jcim.4c00792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Aqueous solubility is a critical physicochemical property of drug discovery. Solubility is a key issue in pharmaceutical development because it can limit a drug's absorption capacity. Accurate solubility prediction is crucial for pharmacological, environmental, and drug development studies. This research introduces a novel method for solubility prediction by combining gated graph neural networks (GGNNs) and graph attention neural networks (GATs) with Smiles2Seq encoding. Our methodology involves converting chemical compounds into graph structures with nodes representing atoms and edges indicating chemical bonds. These graphs are then processed by using a specialized graph neural network (GNN) architecture. Incorporating attention mechanisms into GNN allows for capturing subtle structural dependencies, fostering improved solubility predictions. Furthermore, we utilized the Smiles2Seq encoding technique to bridge the semantic gap between molecular structures and their textual representations. Smiles2Seq seamlessly converts chemical notations into numeric sequences, facilitating the efficient transfer of information into our model. We demonstrate the efficacy of our approach through comprehensive experiments on benchmark solubility data sets, showcasing superior predictive performance compared to traditional methods. Our model outperforms existing solubility prediction models and provides interpretable insights into the molecular features driving solubility behavior. This research signifies an important advancement in solubility prediction, offering potent tools for drug discovery, formulation development, and environmental assessments. The fusion of GGNN and Smiles2Seq encoding establishes a robust framework for accurately forecasting solubility across various chemical compounds, fostering innovation in various domains reliant on solubility data.
Collapse
Affiliation(s)
- Waqar Ahmad
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
2
|
Zhao J, Hermans E, Sepassi K, Tistaert C, Bergström CAS, Ahmad M, Larsson P. Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set. Mol Pharm 2024; 21:5261-5271. [PMID: 39267585 PMCID: PMC11462503 DOI: 10.1021/acs.molpharmaceut.4c00685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/05/2024] [Accepted: 09/05/2024] [Indexed: 09/17/2024]
Abstract
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log D to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log S ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
Collapse
Affiliation(s)
- Jiaxi Zhao
- Department
of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden
| | - Eline Hermans
- Pharmaceutical
& Material Sciences, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | - Kia Sepassi
- Discovery
Pharmaceutics, Janssen Research & Development,
LLC, La Jolla, California 92121, United States
| | - Christophe Tistaert
- Pharmaceutical
& Material Sciences, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | | | - Mazen Ahmad
- In
Silico Discovery, Janssen Pharmaceutica
NV, B-2340 Beerse, Belgium
| | - Per Larsson
- Department
of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden
| |
Collapse
|
3
|
Zheng T, Mitchell JBO, Dobson S. Revisiting the Application of Machine Learning Approaches in Predicting Aqueous Solubility. ACS OMEGA 2024; 9:35209-35222. [PMID: 39157153 PMCID: PMC11325511 DOI: 10.1021/acsomega.4c06163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 07/19/2024] [Accepted: 07/22/2024] [Indexed: 08/20/2024]
Abstract
The solubility of chemical substances in water is a critical parameter in pharmaceutical development, environmental chemistry, agrochemistry, and other fields; however, accurately predicting it remains a challenge. This study aims to evaluate and compare the effectiveness of some of the most popular machine learning modeling methods and molecular featurization techniques in predicting aqueous solubility. Although these methods were not implemented in a competitive environment, some of their performance surpassed previous benchmarks, offering gradual but significant improvements. Our results show that methods based on graph convolution and graph attention mechanisms demonstrated exceptional predictive abilities with high-quality data sets, albeit with a sensitivity to data noise and errors. In contrast, models leveraging molecular descriptors not only provided better interpretability but also showed more resilience when dealing with inherent noise and errors in data. Our analysis of over 4000 molecular descriptors used in various models identified that approximately 800 of these descriptors make a significant contribution to solubility prediction. These insights offer guidance and direction for future developments in solubility prediction.
Collapse
Affiliation(s)
- Tianyuan Zheng
- School
of Computer Science, University of St Andrews, St Andrews, Fife KY16 9SX, U.K.
| | - John B. O. Mitchell
- EaStCHEM
School of Chemistry, University of St Andrews, St Andrews, Fife KY16 9ST, U.K.
| | - Simon Dobson
- School
of Computer Science, University of St Andrews, St Andrews, Fife KY16 9SX, U.K.
| |
Collapse
|
4
|
Ramani V, Karmakar T. Graph Neural Networks for Predicting Solubility in Diverse Solvents Using MolMerger Incorporating Solute-Solvent Interactions. J Chem Theory Comput 2024. [PMID: 39041858 DOI: 10.1021/acs.jctc.4c00382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
The prediction of solubility is a complex and challenging physicochemical problem that has tremendous implications for the chemical and pharmaceutical industry. Recent advancements in machine learning methods have provided a great scope for predicting the reliable solubility of a large number of molecular systems. However, most of these methods rely on using physical properties obtained from experiments and expensive quantum chemical calculations. Here, we developed a method that utilizes a graphical representation of solute-solvent interactions using "MolMerger," which captures the strongest polar interactions between molecules using Gasteiger charges and creates a graph incorporating the true nature of the system. Using these graphs as input, a neural network learns the correlation between the structural properties of a molecule in the form of node embedding and its physicochemical properties as the output. This approach has been used to calculate molecular solubility by predicting the Log solubility values of various organic molecules and pharmaceuticals in diverse sets of solvents.
Collapse
Affiliation(s)
- Vansh Ramani
- Department of Chemical Engineering, Indian Institute of Technology, Delhi, Hauz Khas, New Delhi 110016, India
| | - Tarak Karmakar
- Department of Chemistry, Indian Institute of Technology, Delhi, Hauz Khas, New Delhi 110016, India
| |
Collapse
|
5
|
Li T, Huls NJ, Lu S, Hou P. Unsupervised manifold embedding to encode molecular quantum information for supervised learning of chemical data. Commun Chem 2024; 7:133. [PMID: 38862828 PMCID: PMC11166954 DOI: 10.1038/s42004-024-01217-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 06/03/2024] [Indexed: 06/13/2024] Open
Abstract
Molecular representation is critical in chemical machine learning. It governs the complexity of model development and the fulfillment of training data to avoid either over- or under-fitting. As electronic structures and associated attributes are the root cause for molecular interactions and their manifested properties, we have sought to examine the local electron information on a molecular manifold to understand and predict molecular interactions. Our efforts led to the development of a lower-dimensional representation of a molecular manifold, Manifold Embedding of Molecular Surface (MEMS), to embody surface electronic quantities. By treating a molecular surface as a manifold and computing its embeddings, the embedded electronic attributes retain the chemical intuition of molecular interactions. MEMS can be further featurized as input for chemical learning. Our solubility prediction with MEMS demonstrated the feasibility of both shallow and deep learning by neural networks, suggesting that MEMS is expressive and robust against dimensionality reduction.
Collapse
Affiliation(s)
- Tonglei Li
- Deparment of Industrial and Molecular Pharmaceutics, Purdue University, West Lafayette, 47907, IN, USA.
| | - Nicholas J Huls
- Deparment of Industrial and Molecular Pharmaceutics, Purdue University, West Lafayette, 47907, IN, USA
| | - Shan Lu
- Deparment of Industrial and Molecular Pharmaceutics, Purdue University, West Lafayette, 47907, IN, USA
| | - Peng Hou
- Deparment of Industrial and Molecular Pharmaceutics, Purdue University, West Lafayette, 47907, IN, USA
| |
Collapse
|
6
|
Wang W, Tang J, Zaliani A. Outline and background for the EU-OS solubility prediction challenge. SLAS DISCOVERY : ADVANCING LIFE SCIENCES R & D 2024; 29:100155. [PMID: 38518955 DOI: 10.1016/j.slasd.2024.100155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 02/27/2024] [Accepted: 03/19/2024] [Indexed: 03/24/2024]
Abstract
In June 2022, EU-OS came to the decision to make public a solubility data set of 100+K compounds obtained from several of the EU-OS proprietary screening compound collections. Leveraging on the interest of SLAS for screening scientific development it was decided to launch a joint EUOS-SLAS competition within the chemoinformatics and machine learning (ML) communities. The competition was open to real world computation experts, for the best, most predictive, classification model of compound solubility. The aim of the competition was multiple: from a practical side, the winning model should then serve as a cornerstone for future solubility predictions having used the largest training set so far publicly available. From a higher project perspective, the intent was to focus the energies and experiences, even if professionally not precisely coming from Pharma R&D; to address the issue of how to predict compound solubility. Here we report how the competition was ideated and the practical aspects of conducting it within the Kaggle framework, leveraging of the versatility and the open-source nature of this data science platform. Consideration on results and challenges encountered have been also examined.
Collapse
Affiliation(s)
- Wenyu Wang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki 00290, Finland; Institute for Molecular Medicine Finland-FIMM, Helsinki Institute of Life Science-HiLIFE, University of Helsinki, Helsinki 00290, Finland; iCAN Digital Precision Cancer Medicine Flagship, University of Helsinki, Helsinki 00290, Finland
| | - Jing Tang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki 00290, Finland; iCAN Digital Precision Cancer Medicine Flagship, University of Helsinki, Helsinki 00290, Finland.
| | - Andrea Zaliani
- Fraunhofer Institute for Translational Medicine and Pharmacology (ITMP), Schnackenburgallee 114, Hamburg 22525, Germany; Fraunhofer Cluster of Excellence for Immune-Mediated Diseases (CIMD), Theodor Stern Kai 7, Frankfurt 60590, Germany.
| |
Collapse
|
7
|
Ramos MC, White AD. Predicting small molecules solubility on endpoint devices using deep ensemble neural networks. DIGITAL DISCOVERY 2024; 3:786-795. [PMID: 38638648 PMCID: PMC11022985 DOI: 10.1039/d3dd00217a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 03/07/2024] [Indexed: 04/20/2024]
Abstract
Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is useable at https://mol.dev.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| | - Andrew D White
- Chemical Engineer Department, University of Rochester Rochester NY 14642 USA
| |
Collapse
|
8
|
Llompart P, Minoletti C, Baybekov S, Horvath D, Marcou G, Varnek A. Will we ever be able to accurately predict solubility? Sci Data 2024; 11:303. [PMID: 38499581 PMCID: PMC10948805 DOI: 10.1038/s41597-024-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 02/29/2024] [Indexed: 03/20/2024] Open
Abstract
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
Collapse
Affiliation(s)
- P Llompart
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
- IDD/CADD, Sanofi, Vitry-Sur-Seine, France
| | | | - S Baybekov
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - D Horvath
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - G Marcou
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France.
| | - A Varnek
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| |
Collapse
|
9
|
Hunklinger A, Hartog P, Šícho M, Godin G, Tetko IV. The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS/SLAS joint compound solubility challenge. SLAS DISCOVERY : ADVANCING LIFE SCIENCES R & D 2024; 29:100144. [PMID: 38316342 DOI: 10.1016/j.slasd.2024.01.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 01/06/2024] [Accepted: 01/22/2024] [Indexed: 02/07/2024]
Abstract
The EUOS/SLAS challenge aimed to facilitate the development of reliable algorithms to predict the aqueous solubility of small molecules using experimental data from 100 K compounds. In total, hundred teams took part in the challenge to predict low, medium and highly soluble compounds as measured by the nephelometry assay. This article describes the winning model, which was developed using the publicly available Online CHEmical database and Modeling environment (OCHEM) available on the website https://ochem.eu/article/27. We describe in detail the assumptions and steps used to select methods, descriptors and strategy which contributed to the winning solution. In particular we show that consensus based on 28 models calculated using descriptor-based and representation learning methods allowed us to obtain the best score, which was higher than those based on individual approaches or consensus models developed using each individual approach. A combination of diverse models allowed us to decrease both bias and variance of individual models and to calculate the highest score. The model based on Transformer CNN contributed the best individual score thus highlighting the power of Natural Language Processing (NLP) methods. The inclusion of information about aleatoric uncertainty would be important to better understand and use the challenge data by the contestants.
Collapse
Affiliation(s)
- Andrea Hunklinger
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany
| | - Peter Hartog
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany
| | - Martin Šícho
- Leiden Academic Centre for Drug Research, Leiden University, 55 Einsteinweg, 2333 CC Leiden, the Netherlands; CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - Guillaume Godin
- dsm-firmenich SA, Rue de la Bergère 7, CH-1242 Satigny, Switzerland
| | - Igor V Tetko
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich-Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), DE-85764 Neuherberg, Germany; BIGCHEM GmbH, Valerystr. 49, DE-85716 Unterschleißheim, Germany.
| |
Collapse
|
10
|
Baybekov S, Llompart P, Marcou G, Gizzi P, Galzi JL, Ramos P, Saurel O, Bourban C, Minoletti C, Varnek A. Kinetic solubility: Experimental and machine-learning modeling perspectives. Mol Inform 2024; 43:e202300216. [PMID: 38149685 DOI: 10.1002/minf.202300216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 11/25/2023] [Accepted: 12/23/2023] [Indexed: 12/28/2023]
Abstract
Kinetic aqueous or buffer solubility is important parameter measuring suitability of compounds for high throughput assays in early drug discovery while thermodynamic solubility is reserved for later stages of drug discovery and development. Kinetic solubility is also considered to have low inter-laboratory reproducibility because of its sensitivity to protocol parameters [1]. Presumably, this is why little efforts have been put to build QSPR models for kinetic in comparison to thermodynamic aqueous solubility. Here, we investigate the reproducibility and modelability of kinetic solubility assays. We first analyzed the relationship between kinetic and thermodynamic solubility data, and then examined the consistency of data from different kinetic assays. In this contribution, we report differences between kinetic and thermodynamic solubility data that are consistent with those reported by others [1, 2] and good agreement between data from different kinetic solubility campaigns in contrast to general expectations. The latter is confirmed by achieving high performing QSPR models trained on merged kinetic solubility datasets. The poor performance of QSPR model trained on thermodynamic solubility when applied to kinetic solubility dataset reinforces the conclusion that kinetic and thermodynamic solubilities do not correlate: one cannot be used as an ersatz for the other. This encourages for building predictive models for kinetic solubility. The kinetic solubility QSPR model developed in this study is freely accessible through the Predictor web service of the Laboratory of Chemoinformatics (https://chematlas.chimie.unistra.fr/cgi-bin/predictor2.cgi).
Collapse
Affiliation(s)
- Shamkhal Baybekov
- Laboratoire de Chémoinformatique UMR 7140 CNRS, Institut Le Bel, University of Strasbourg, 4 Rue Blaise Pascal, 67081, Strasbourg, France
| | - Pierre Llompart
- Laboratoire de Chémoinformatique UMR 7140 CNRS, Institut Le Bel, University of Strasbourg, 4 Rue Blaise Pascal, 67081, Strasbourg, France
- IDD/CADD, Sanofi, Vitry-Sur-Seine, France
| | - Gilles Marcou
- Laboratoire de Chémoinformatique UMR 7140 CNRS, Institut Le Bel, University of Strasbourg, 4 Rue Blaise Pascal, 67081, Strasbourg, France
| | - Patrick Gizzi
- Plateforme de Chimie Biologique Intégrative de Strasbourg UAR 3286 CNRS, University of Strasbourg, 300 Boulevard Sébastien Brant, 67412, Illkirch, France
| | - Jean-Luc Galzi
- Biotechnologie et signalisation cellulaire UMR 7242 CNRS, École supérieure de biotechnologie de Strasbourg, University of Strasbourg, 300 Boulevard Sébastien Brant, 67412, Illkirch, France
- ChemBioFrance - Chimiothèque Nationale UAR 3035, ENSCM - 240, Avenue du Prof. E. Jeanbrau, CS 60297-34296, Montpellier Cedex 5, France
| | - Pascal Ramos
- Institut de Pharmacologie et de Biologie Structurale (IPBS), Université de Toulouse, CNRS, Université Toulouse III - Paul Sabatier (UT3), Toulouse, France
| | - Olivier Saurel
- Institut de Pharmacologie et de Biologie Structurale (IPBS), Université de Toulouse, CNRS, Université Toulouse III - Paul Sabatier (UT3), Toulouse, France
| | - Claire Bourban
- Plateforme de Chimie Biologique Intégrative de Strasbourg UAR 3286 CNRS, University of Strasbourg, 300 Boulevard Sébastien Brant, 67412, Illkirch, France
| | | | - Alexandre Varnek
- Laboratoire de Chémoinformatique UMR 7140 CNRS, Institut Le Bel, University of Strasbourg, 4 Rue Blaise Pascal, 67081, Strasbourg, France
| |
Collapse
|
11
|
Kim Y, Jung H, Kumar S, Paton RS, Kim S. Designing solvent systems using self-evolving solubility databases and graph neural networks. Chem Sci 2024; 15:923-939. [PMID: 38239675 PMCID: PMC10793204 DOI: 10.1039/d3sc03468b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 12/04/2023] [Indexed: 01/22/2024] Open
Abstract
Designing solvent systems is key to achieving the facile synthesis and separation of desired products from chemical processes, so many machine learning models have been developed to predict solubilities. However, breakthroughs are needed to address deficiencies in the model's predictive accuracy and generalizability; this can be addressed by expanding and integrating experimental and computational solubility databases. To maximize predictive accuracy, these two databases should not be trained separately, and they should not be simply combined without reconciling the discrepancies from different magnitudes of errors and uncertainties. Here, we introduce self-evolving solubility databases and graph neural networks developed through semi-supervised self-training approaches. Solubilities from quantum-mechanical calculations are referred to during semi-supervised learning, but they are not directly added to the experimental database. Dataset augmentation is performed from 11 637 experimental solubilities to >900 000 data points in the integrated database, while correcting for the discrepancies between experiment and computation. Our model was successfully applied to study solvent selection in organic reactions and separation processes. The accuracy (mean absolute error around 0.2 kcal mol-1 for the test set) is quantitatively useful in exploring Linear Free Energy Relationships between reaction rates and solvation free energies for 11 organic reactions. Our model also accurately predicted the partition coefficients of lignin-derived monomers and drug-like molecules. While there is room for expanding solubility predictions to transition states, radicals, charged species, and organometallic complexes, this approach will be attractive to predictive chemistry areas where experimental, computational, and other heterogeneous data should be combined.
Collapse
Affiliation(s)
- Yeonjoon Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
- Department of Chemistry, Pukyong National University Busan 48513 Republic of Korea
| | - Hojin Jung
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Sabari Kumar
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Robert S Paton
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Seonah Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| |
Collapse
|
12
|
Ahmad W, Tayara H, Shim H, Chong KT. SolPredictor: Predicting Solubility with Residual Gated Graph Neural Network. Int J Mol Sci 2024; 25:715. [PMID: 38255790 PMCID: PMC10815788 DOI: 10.3390/ijms25020715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 12/26/2023] [Accepted: 01/04/2024] [Indexed: 01/24/2024] Open
Abstract
Computational methods play a pivotal role in the pursuit of efficient drug discovery, enabling the rapid assessment of compound properties before costly and time-consuming laboratory experiments. With the advent of technology and large data availability, machine and deep learning methods have proven efficient in predicting molecular solubility. High-precision in silico solubility prediction has revolutionized drug development by enhancing formulation design, guiding lead optimization, and predicting pharmacokinetic parameters. These benefits result in considerable cost and time savings, resulting in a more efficient and shortened drug development process. The proposed SolPredictor is designed with the aim of developing a computational model for solubility prediction. The model is based on residual graph neural network convolution (RGNN). The RGNNs were designed to capture long-range dependencies in graph-structured data. Residual connections enable information to be utilized over various layers, allowing the model to capture and preserve essential features and patterns scattered throughout the network. The two largest datasets available to date are compiled, and the model uses a simplified molecular-input line-entry system (SMILES) representation. SolPredictor uses the ten-fold split cross-validation Pearson correlation coefficient R2 0.79±0.02 and root mean square error (RMSE) 1.03±0.04. The proposed model was evaluated using five independent datasets. Error analysis, hyperparameter optimization analysis, and model explainability were used to determine the molecular features that were most valuable for prediction.
Collapse
Affiliation(s)
- Waqar Ahmad
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - HyunJoo Shim
- School of Pharmacy, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
13
|
Hong RS, Rojas AV, Bhardwaj RM, Wang L, Mattei A, Abraham NS, Cusack KP, Pierce MO, Mondal S, Mehio N, Bordawekar S, Kym PR, Abel R, Sheikh AY. Free Energy Perturbation Approach for Accurate Crystalline Aqueous Solubility Predictions. J Med Chem 2023; 66:15883-15893. [PMID: 38016916 DOI: 10.1021/acs.jmedchem.3c01339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
Early assessment of crystalline thermodynamic solubility continues to be elusive for drug discovery and development despite its critical importance, especially for the ever-increasing fraction of poorly soluble drug candidates. Here we present a detailed evaluation of a physics-based free energy perturbation (FEP+) approach for computing the thermodynamic aqueous solubility. The predictive power of this approach is assessed across diverse chemical spaces, spanning pharmaceutically relevant literature compounds and more complex AbbVie compounds. Our approach achieves predictive (RMSE = 0.86) and differentiating power (R2 = 0.69) and therefore provides notably improved correlations to experimental solubility compared to state-of-the-art machine learning approaches that utilize quantum mechanics-based descriptors. The importance of explicit considerations of crystalline packing in predicting solubility by the FEP+ approach is also highlighted in this study. Finally, we show how computed energetics, including hydration and sublimation free energies, can provide further insights into molecule design to feed the medicinal chemistry DMTA cycle.
Collapse
Affiliation(s)
- Richard S Hong
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Ana V Rojas
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Rajni Miglani Bhardwaj
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Lingle Wang
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Alessandra Mattei
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Nathan S Abraham
- Ventus Therapeutics 100 Beaver St, Waltham, Massachusetts 02453, United States
| | - Kevin P Cusack
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - M Olivia Pierce
- Bristol Myer Squibb, 100 Binney Street, Cambridge, Massachusetts 02142, United States
| | - Sayan Mondal
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Nada Mehio
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Shailendra Bordawekar
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Philip R Kym
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| | - Robert Abel
- Schrödinger Inc., 1540 Broadway 24th Floor, New York, New York 10036, United States
| | - Ahmad Y Sheikh
- AbbVie Inc., Research & Development, 1 N Waukegan Road, North Chicago, Illinois 60064, United States
| |
Collapse
|
14
|
Ghahremanpour MM, Saar A, Tirado-Rives J, Jorgensen WL. Ensemble Geometric Deep Learning of Aqueous Solubility. J Chem Inf Model 2023; 63:7338-7349. [PMID: 37990484 DOI: 10.1021/acs.jcim.3c01536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
Geometric deep learning is one of the main workhorses for harnessing the power of big data to predict molecular properties such as aqueous solubility, which is key to the pharmacokinetic improvement of drug candidates. Two ensembles of graph neural network architectures were built, one based on spectral convolution and the other on spatial convolution. The pretrained models, denoted respectively as SolNet-GCN and SolNet-GAT, significantly outperformed the existing neural networks benchmarked on a validation set of 207 molecules. The SolNet-GCN model demonstrated the best performance on both the training and validation sets, with RMSE values of 0.53 and 0.72 log molar unit and Pearson r2 values of 0.95 and 0.75, respectively. Further, the ranking power of the SolNet models agreed well with a QM-based thermodynamic cycle approach at the PBE-vdW level of theory on a series of benzophenylurea derivatives and a series of benzodiazepine derivatives. Nevertheless, testing the resultant models on a set of inhibitors of the macrophage migration inhibitory factor (MIF) illustrated that the inclusion of atomic attributes to discriminate atoms with a higher tendency to form intermolecular hydrogen bonds in the crystalline state and to identify planar or nonplanar substructures can be beneficial for the prediction of aqueous solubility.
Collapse
Affiliation(s)
| | - Anastasia Saar
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| | - Julian Tirado-Rives
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| | - William L Jorgensen
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| |
Collapse
|
15
|
Gheta SKO, Bonin A, Gerlach T, Göller AH. Predicting absolute aqueous solubility by applying a machine learning model for an artificially liquid-state as proxy for the solid-state. J Comput Aided Mol Des 2023; 37:765-789. [PMID: 37878216 DOI: 10.1007/s10822-023-00538-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 10/02/2023] [Indexed: 10/26/2023]
Abstract
In this study, we use machine learning algorithms with QM-derived COSMO-RS descriptors, along with Morgan fingerprints, to predict the absolute solubility of drug-like compounds. The QM-derived descriptors account for the molecular properties of the solute, i.e., the solute-solute interactions in an artificial-liquid-state (super-cooled liquid), and the solute-solvent interactions in solution. We employ two main approaches to predict solubility: (i) a hypothetical pathway that involves melting the solute at room temperature T = T¯ ([Formula: see text]) and mixing the artificially liquid solute into the solvent ([Formula: see text]). In this approach [Formula: see text] is predicted using machine learning models, and the [Formula: see text] is obtained from COSMO-RS calculations; (ii) direct solubility prediction using machine learning algorithms. The models were trained on a large number of Bayer in-house compounds for which water solubility data is available at physiological pH of 6.5 and ambient temperature. We also evaluated our models using external datasets from a solubility challenge. Our models present great improvements compared to the absolute solubility prediction with the QSAR model for the artificial liquid state as implemented in the COSMOtherm software, for both in-house and external datasets. We are furthermore able to demonstrate the superiority of QM-derived descriptors compared to cheminformatics descriptors. We finally present low-cost alternative models using fragment-based COSMOquick calculations with only marginal reduction in the quality of predicted solubility.
Collapse
Affiliation(s)
- Sadra Kashef Ol Gheta
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096, Wuppertal, Germany
| | - Anne Bonin
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096, Wuppertal, Germany
| | - Thomas Gerlach
- Bayer AG, Crop Science, R&D, Digital Transformation, 40789, Monheim, Germany
- Bayer AG, Engineering & Technology, Thermal Separation Technologies, 51368, Leverkusen, Germany
| | - Andreas H Göller
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096, Wuppertal, Germany.
| |
Collapse
|
16
|
Tran TTV, Tayara H, Chong KT. Recent Studies of Artificial Intelligence on In Silico Drug Absorption. J Chem Inf Model 2023; 63:6198-6211. [PMID: 37819031 DOI: 10.1021/acs.jcim.3c00960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Absorption is an important area of research in pharmacochemistry and drug development, because the drug has to be absorbed before any drug effects can occur. Furthermore, the ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profile of drugs can be directly and considerably altered by modulating factors affecting absorption. Many drugs in development fail because of poor absorption. The research and continuous efforts of researchers in recent years have brought many successes and promises in drug absorption property prediction, especially in silico, which helps to reduce the time and cost significantly for screening undesirable drug candidates. In this report, we explicitly provide an overview of recent in silico studies on predicting absorption properties, especially from 2019 to the present, using artificial intelligence. Additionally, we have collected and investigated public databases that support absorption prediction research. On those grounds, we also proposed the challenges and development directions of absorption prediction in the future. We hope this review can provide researchers with valuable guidelines on absorption prediction to facilitate the development of newer approaches in drug discovery.
Collapse
Affiliation(s)
- Thi Tuyet Van Tran
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea
- Faculty of Information Technology, An Giang University, Long Xuyen 880000, Vietnam
- Vietnam National University, Ho Chi Minh City, Ho Chi Minh 700000, Vietnam
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
17
|
Zhu X, Polyakov VR, Bajjuri K, Hu H, Maderna A, Tovee CA, Ward SC. Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset. J Chem Inf Model 2023; 63:2948-2959. [PMID: 37125691 DOI: 10.1021/acs.jcim.3c00308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Predicting solubility of small molecules is a very difficult undertaking due to the lack of reliable and consistent experimental solubility data. It is well known that for a molecule in a crystal lattice to be dissolved, it must, first, dissociate from the lattice and then, second, be solvated. The melting point of a compound is proportional to the lattice energy, and the octanol-water partition coefficient (log P) is a measure of the compound's solvation efficiency. The CCDC's melting point dataset of almost one hundred thousand compounds was utilized to create widely applicable machine learning models of small molecule melting points. Using the general solubility equation, the aqueous thermodynamic solubilities of the same compounds can be predicted. The global model could be easily localized by adding additional melting point measurements for a chemical series of interest.
Collapse
Affiliation(s)
- Xiangwei Zhu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Valery R Polyakov
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Krishna Bajjuri
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Huiyong Hu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Andreas Maderna
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Clare A Tovee
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| | - Suzanna C Ward
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| |
Collapse
|
18
|
Conn JM, Carter JW, Conn JJA, Subramanian V, Baxter A, Engkvist O, Llinas A, Ratkova EL, Pickett SD, McDonagh JL, Palmer DS. Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models. J Chem Inf Model 2023; 63:1099-1113. [PMID: 36758178 PMCID: PMC9976279 DOI: 10.1021/acs.jcim.2c01189] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.
Collapse
Affiliation(s)
- Jonathan
G. M. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - James W. Carter
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Justin J. A. Conn
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.
| | - Vigneshwari Subramanian
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Andrew Baxter
- GSK
Medicines Research Centre, Gunnels Wood Road, Stevenage SG1 2NY, U.K.
| | - Ola Engkvist
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden,Department
of Computer Science and Engineering, Chalmers
University of Technology, SE-412 96 Göteborg, Sweden
| | - Antonio Llinas
- Drug
Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D,
AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden
| | - Ekaterina L. Ratkova
- Medicinal
Chemistry, Research and Early Development, Cardiovascular, Renal and
Metabolism (CVRM), BioPharmaceuticals R&D,
AstraZeneca, SE-431 50 Göteborg, Sweden
| | - Stephen D. Pickett
- Computational
Sciences, GlaxoSmithKline R&D Pharmaceuticals, Stevenage SG1 2NY, U.K.
| | - James L. McDonagh
- IBM Research
Europe, Hartree Centre, SciTech Daresbury, Warrington, Cheshire WA4 4AD, U.K.
| | - David S. Palmer
- Department
of Pure and Applied Chemistry, University
of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.,E-mail:
| |
Collapse
|
19
|
Cysewski P, Jeliński T, Przybyłek M, Nowak W, Olczak M. Solubility Characteristics of Acetaminophen and Phenacetin in Binary Mixtures of Aqueous Organic Solvents: Experimental and Deep Machine Learning Screening of Green Dissolution Media. Pharmaceutics 2022; 14:pharmaceutics14122828. [PMID: 36559321 PMCID: PMC9781932 DOI: 10.3390/pharmaceutics14122828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 12/10/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022] Open
Abstract
The solubility of active pharmaceutical ingredients is a mandatory physicochemical characteristic in pharmaceutical practice. However, the number of potential solvents and their mixtures prevents direct measurements of all possible combinations for finding environmentally friendly, operational and cost-effective solubilizers. That is why support from theoretical screening seems to be valuable. Here, a collection of acetaminophen and phenacetin solubility data in neat and binary solvent mixtures was used for the development of a nonlinear deep machine learning model using new intuitive molecular descriptors derived from COSMO-RS computations. The literature dataset was augmented with results of new measurements in aqueous binary mixtures of 4-formylmorpholine, DMSO and DMF. The solubility values back-computed with the developed ensemble of neural networks are in perfect agreement with the experimental data, which enables the extensive screening of many combinations of solvents not studied experimentally within the applicability domain of the trained model. The final predictions were presented not only in the form of the set of optimal hyperparameters but also in a more intuitive way by the set of parameters of the Jouyban-Acree equation often used in the co-solvency domain. This new and effective approach is easily extendible to other systems, enabling the fast and reliable selection of candidates for new solvents and directing the experimental solubility screening of active pharmaceutical ingredients.
Collapse
|
20
|
Oja M, Sild S, Piir G, Maran U. Intrinsic Aqueous Solubility: Mechanistically Transparent Data-Driven Modeling of Drug Substances. Pharmaceutics 2022; 14:pharmaceutics14102248. [PMID: 36297685 PMCID: PMC9611068 DOI: 10.3390/pharmaceutics14102248] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 10/12/2022] [Accepted: 10/18/2022] [Indexed: 11/07/2022] Open
Abstract
Intrinsic aqueous solubility is a foundational property for understanding the chemical, technological, pharmaceutical, and environmental behavior of drug substances. Despite years of solubility research, molecular structure-based prediction of the intrinsic aqueous solubility of drug substances is still under active investigation. This paper describes the authors’ systematic data-driven modelling in which two fit-for-purpose training data sets for intrinsic aqueous solubility were collected and curated, and three quantitative structure–property relationships were derived to make predictions for the most recent solubility challenge. All three models perform well individually, while being mechanistically transparent and easy to understand. Molecular descriptors involved in the models are related to the following key steps in the solubility process: dissociation of the molecule from the crystal, formation of a cavity in the solvent, and insertion of the molecule into the solvent. A consensus modeling approach with these models remarkably improved prediction capability and reduced the number of strong outliers by more than two times. The performance and outliers of the second solubility challenge predictions were analyzed retrospectively. All developed models have been published in the QsarDB.org repository according to FAIR principles and can be used without restrictions for exploring, downloading, and making predictions.
Collapse
Affiliation(s)
| | | | | | - Uko Maran
- Correspondence: ; Tel.: +372-7-375-254; Fax: +372-7-375-264
| |
Collapse
|
21
|
Avdeef A, Kansy M. Trends in PhysChem Properties of Newly Approved Drugs over the Last Six Years; Predicting Solubility of Drugs Approved in 2021. J SOLUTION CHEM 2022. [DOI: 10.1007/s10953-022-01199-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
22
|
Panapitiya G, Girard M, Hollas A, Sepulveda J, Murugesan V, Wang W, Saldanha E. Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction. ACS OMEGA 2022; 7:15695-15710. [PMID: 35571767 PMCID: PMC9096921 DOI: 10.1021/acsomega.2c00642] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 04/11/2022] [Indexed: 05/17/2023]
Abstract
Determining the aqueous solubility of molecules is a vital step in many pharmaceutical, environmental, and energy storage applications. Despite efforts made over decades, there are still challenges associated with developing a solubility prediction model with satisfactory accuracy for many of these applications. The goals of this study are to assess current deep learning methods for solubility prediction, develop a general model capable of predicting the solubility of a broad range of organic molecules, and to understand the impact of data properties, molecular representation, and modeling architecture on predictive performance. Using the largest currently available solubility data set, we implement deep learning-based models to predict solubility from the molecular structure and explore several different molecular representations including molecular descriptors, simplified molecular-input line-entry system strings, molecular graphs, and three-dimensional atomic coordinates using four different neural network architectures-fully connected neural networks, recurrent neural networks, graph neural networks (GNNs), and SchNet. We find that models using molecular descriptors achieve the best performance, with GNN models also achieving good performance. We perform extensive error analysis to understand the molecular properties that influence model performance, perform feature analysis to understand which information about the molecular structure is most valuable for prediction, and perform a transfer learning and data size study to understand the impact of data availability on model performance.
Collapse
Affiliation(s)
- Gihan Panapitiya
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Michael Girard
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Aaron Hollas
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Jonathan Sepulveda
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | | | - Wei Wang
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| | - Emily Saldanha
- Pacific Northwest National
Laboratory, Richland, Washington 99352, United States
| |
Collapse
|
23
|
Wellawatte GP, Seshadri A, White AD. Model agnostic generation of counterfactual explanations for molecules. Chem Sci 2022; 13:3697-3705. [PMID: 35432902 PMCID: PMC8966631 DOI: 10.1039/d1sc05259d] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 02/06/2022] [Indexed: 11/25/2022] Open
Abstract
An outstanding challenge in deep learning in chemistry is its lack of interpretability. The inability of explaining why a neural network makes a prediction is a major barrier to deployment of AI models. This not only dissuades chemists from using deep learning predictions, but also has led to neural networks learning spurious correlations that are difficult to notice. Counterfactuals are a category of explanations that provide a rationale behind a model prediction with satisfying properties like providing chemical structure insights. Yet, counterfactuals have been previously limited to specific model architectures or required reinforcement learning as a separate process. In this work, we show a universal model-agnostic approach that can explain any black-box model prediction. We demonstrate this method on random forest models, sequence models, and graph neural networks in both classification and regression.
Collapse
Affiliation(s)
| | - Aditi Seshadri
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| |
Collapse
|
24
|
Avdeef A, Kansy M. Predicting Solubility of Newly-Approved Drugs (2016–2020) with a Simple ABSOLV and GSE(Flexible-Acceptor) Consensus Model Outperforming Random Forest Regression. J SOLUTION CHEM 2022; 51:1020-1055. [PMID: 35153342 PMCID: PMC8818506 DOI: 10.1007/s10953-022-01141-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 11/10/2021] [Indexed: 11/24/2022]
Abstract
This study applies the ‘Flexible-Acceptor’ variant of the General Solubility Equation, GSE(Φ,B), to the prediction of the aqueous intrinsic solubility, log10S0, of FDA recently-approved (2016–2020) ‘small-molecule’ new molecular entities (NMEs). The novel equation had been shown to predict the solubility of drugs beyond Lipinski’s ‘Rule of 5’ chemical space (bRo5) to a precision nearly matching that of the Random Forest Regression (RFR) machine learning method. Since then, it was found that the GSE(Φ,B) appears to work well not only for bRo5 NMEs, but also for Ro5 drugs. To put context to GSE(Φ,B), Yalkowsky’s GSE(classic), Abraham’s ABSOLV, and Breiman’s RFR models were also applied to predict log10 S0 of 72 newly-approve NMEs, for which useable reported solubility values could be accessed (nearly 60% from FDA New Drug Application published reports). Except for GSE (classic), the prediction models were retrained with an enlarged version of the Wiki-pS0 database (nearly 400 added log10 S0 entries since our recent previous study). Thus, these four models were further validated by the additional independent solubility measurements which the newly-approved drugs introduced. The prediction methods ranked RFR ~ GSE (Φ,B) > ABSOLV > GSE (classic) in performance. It was further demonstrated that the biases generated in the four separate models could be nearly eliminated in a consensus model based on the average of just two of the methods: GSE (Φ,B) and ABSOLV. The resulting consensus prediction equation is simple in form and can be easily incorporated into spreadsheet calculations. Even more significant, it slightly outperformed the RFR method.
Collapse
|
25
|
Avdeef A, Sugano K. Salt Solubility and Disproportionation - Uses and Limitations of Equations for pH max and the In-silico Prediction of pH max. J Pharm Sci 2021; 111:225-246. [PMID: 34863819 DOI: 10.1016/j.xphs.2021.11.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/23/2021] [Accepted: 11/23/2021] [Indexed: 10/19/2022]
Abstract
A multiphasic mass action equilibrium model was used to study the phase properties near the critical pH ('pHmax') in an acid-base transformation of a solid drug salt into its corresponding solid free base form in pure water slurries. The goal of this study was to better define the characteristics of disproportionation of pharmaceutical salts, objectively (i) to classify salts as μ-type (microclimate stable) or δ-type (disproportionation prone) based on the relationship between the calculated pHmax and the calculated pH of the saturated salt solution, (ii) to compare the distribution of μ/δ-type salts to predictions from the disproportionation potential equation introduced by Merritt et al.,20 (iii) to determine if the intrinsic solubility of the free base, S0, can be predicted from the measured μ-type salt solubility as a means of estimating the value of pHmax, (iv) to determine S0 directly from the measured δ-type salt solubility, and (v) to address some of the limitations of the equations commonly used to calculate pHmax. When the salt solubility is measured for a basic API (pKa of which is known), but the experimental value of S0 is unavailable, a potentially useful simple screen for disproportionation is still possible, since pHmax can be estimated from a 'μ-predicted' (objective iii) or 'δ-measured' S0 (objective iv). Twelve model weak base API were selected in the study. For each API, 2-17 different salt forms with reported salt solubilities in distilled water were sourced from the literature. In all, 73 salt solubility values based on 29 different salt-forming acids comprise the studied set. All the corresponding free base solubility values were available. The pKa values for all the acids and bases studied are generally well known. For each API salt, an acid-base titration simulation was performed, anchored to the measured salt solubility value, using the general mass action analysis program pDISOL-X. The log S-pH profiles were drawn out by analytic continuity from pH 0 to 13, as described in detail previously.24 Potentially useful in-silico models were developed that correlate pS0 to linear functions of the salt solubility in water, pSw, the partition coefficient of the salt-forming acid (log POCTacid) and the melting point (mp) of the drug salt, thereby enabling the derivation of the approximate pHmax value from the predicted pS0.
Collapse
Affiliation(s)
- Alex Avdeef
- in-ADME Research, 1732 First Avenue, #102, New York, NY, 10128, USA.
| | - Kiyohiko Sugano
- Molecular Pharmaceutics Lab., College of Pharmaceutical Sciences, Ritsumeikan University, 1-1-1, Noji-higashi, Kusatsu, Shiga, 525-8577, Japan
| |
Collapse
|
26
|
Tosca EM, Bartolucci R, Magni P. Application of Artificial Neural Networks to Predict the Intrinsic Solubility of Drug-Like Molecules. Pharmaceutics 2021; 13:1101. [PMID: 34371792 PMCID: PMC8309152 DOI: 10.3390/pharmaceutics13071101] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 07/15/2021] [Accepted: 07/16/2021] [Indexed: 11/25/2022] Open
Abstract
Machine learning (ML) approaches are receiving increasing attention from pharmaceutical companies and regulatory agencies, given their ability to mine knowledge from available data. In drug discovery, for example, they are employed in quantitative structure-property relationship (QSPR) models to predict biological properties from the chemical structure of a drug molecule. In this paper, following the Second Solubility Challenge (SC-2), a QSPR model based on artificial neural networks (ANNs) was built to predict the intrinsic solubility (logS0) of the 100-compound low-variance tight set and the 32-compound high-variance loose set provided by SC-2 as test datasets. First, a training dataset of 270 drug-like molecules with logS0 value experimentally determined was gathered from the literature. Then, a standard three-layer feed-forward neural network was defined by using 10 ChemGPS physico-chemical descriptors as input features. The developed ANN showed adequate predictive performances on both of the SC-2 test datasets. Benefits and limitations of ML approaches have been highlighted and discussed, starting from this case-study. The main findings confirmed that ML approaches are an attractive and promising tool to predict logS0; however, many aspects, such as data quality, molecular descriptor computation and selection, and assessment of applicability domain, are crucial but often neglected, and should be carefully considered to improve predictions based on ML.
Collapse
Affiliation(s)
| | | | - Paolo Magni
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 5, I-27100 Pavia, Italy; (E.M.T.); (R.B.)
| |
Collapse
|
27
|
Falcón-Cano G, Molina C, Cabrera-Pérez MÁ. ADME prediction with KNIME: A retrospective contribution to the second "Solubility Challenge". ADMET AND DMPK 2021; 9:209-218. [PMID: 35300359 PMCID: PMC8920098 DOI: 10.5599/admet.979] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 06/21/2021] [Indexed: 12/12/2022] Open
Abstract
Computational models for predicting aqueous solubility from the molecular structure represent a promising strategy from the perspective of drug design and discovery. Since the first "Solubility Challenge", these initiatives have marked the state-of-art of the modelling algorithms used to predict drug solubility. In this regard, the quality of the input experimental data and its influence on model performance has been frequently discussed. In our previous study, we developed a computational model for aqueous solubility based on recursive random forest approaches. The aim of the current commentary is to analyse the performance of this already trained predictive model on the molecules of the second "Solubility Challenge". Even when our training set has inconsistencies related to the pH, solid form and temperature conditions of the solubility measurements, the model was able to predict the two sets from the second "Solubility Challenge" with statistics comparable to those of the top ranked models. Finally, we provided a KNIME automated workflow to predict aqueous solubility of new drug candidates, during the early stages of drug discovery and development, for ensuring the applicability and reproducibility of our model.
Collapse
Affiliation(s)
- Gabriela Falcón-Cano
- Unit of Modelling and Experimental Biopharmaceutics. Centro de Bioactivos Químicos. Universidad Central "Marta Abreu" de las Villas. Santa Clara 54830, Villa Clara, Cuba
| | | | - Miguel Ángel Cabrera-Pérez
- Unit of Modelling and Experimental Biopharmaceutics. Centro de Bioactivos Químicos. Universidad Central "Marta Abreu" de las Villas. Santa Clara 54830, Villa Clara, Cuba
- Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Burjassot 46100, Valencia, Spain
- Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández University, 03550 Sant Joan d'Alacant, Alicante, Spain
| |
Collapse
|
28
|
Francoeur PG, Koes DR. SolTranNet-A Machine Learning Tool for Fast Aqueous Solubility Prediction. J Chem Inf Model 2021; 61:2530-2536. [PMID: 34038123 DOI: 10.1021/acs.jcim.1c00331] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
While accurate prediction of aqueous solubility remains a challenge in drug discovery, machine learning (ML) approaches have become increasingly popular for this task. For instance, in the Second Challenge to Predict Aqueous Solubility (SC2), all groups utilized machine learning methods in their submissions. We present SolTranNet, a molecule attention transformer to predict aqueous solubility from a molecule's SMILES representation. Atypically, we demonstrate that larger models perform worse at this task, with SolTranNet's final architecture having 3,393 parameters while outperforming linear ML approaches. SolTranNet has a 3-fold scaffold split cross-validation root-mean-square error (RMSE) of 1.459 on AqSolDB and an RMSE of 1.711 on a withheld test set. We also demonstrate that, when used as a classifier to filter out insoluble compounds, SolTranNet achieves a sensitivity of 94.8% on the SC2 data set and is competitive with the other methods submitted to the competition. SolTranNet is distributed via pip, and its source code is available at https://github.com/gnina/SolTranNet.
Collapse
Affiliation(s)
- Paul G Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
29
|
Sorkun MC, Koelman JVA, Er S. Pushing the limits of solubility prediction via quality-oriented data selection. iScience 2021; 24:101961. [PMID: 33437941 PMCID: PMC7788089 DOI: 10.1016/j.isci.2020.101961] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Revised: 11/18/2020] [Accepted: 12/15/2020] [Indexed: 01/19/2023] Open
Abstract
Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.
Collapse
Affiliation(s)
- Murat Cihan Sorkun
- DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
- CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
- Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands
| | - J.M. Vianney A. Koelman
- DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
- CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
- Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands
| | - Süleyman Er
- DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
- CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands
| |
Collapse
|