1
|
Mora JR, Marquez EA, Pérez-Pérez N, Contreras-Torres E, Perez-Castillo Y, Agüero-Chapin G, Martinez-Rios F, Marrero-Ponce Y, Barigye SJ. Rethinking the applicability domain analysis in QSAR models. J Comput Aided Mol Des 2024; 38:9. [PMID: 38351144 DOI: 10.1007/s10822-024-00550-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2023] [Accepted: 02/05/2024] [Indexed: 02/16/2024]
Abstract
Notwithstanding the wide adoption of the OECD principles (or best practices) for QSAR modeling, disparities between in silico predictions and experimental results are frequent, suggesting that model predictions are often too optimistic. Of these OECD principles, the applicability domain (AD) estimation has been recognized in several reports in the literature to be one of the most challenging, implying that the actual reliability measures of model predictions are often unreliable. Applying tree-based error analysis workflows on 5 QSAR models reported in the literature and available in the QsarDB repository, i.e., androgen receptor bioactivity (agonists, antagonists, and binders, respectively) and membrane permeability (highest membrane permeability and the intrinsic permeability), we demonstrate that predictions erroneously tagged as reliable (AD prediction errors) overwhelmingly correspond to instances in subspaces (cohorts) with the highest prediction error rates, highlighting the inhomogeneity of the AD space. In this sense, we call for more stringent AD analysis guidelines which require the incorporation of model error analysis schemes, to provide critical insight on the reliability of underlying AD algorithms. Additionally, any selected AD method should be rigorously validated to demonstrate its suitability for the model space over which it is applied. These steps will ultimately contribute to more accurate estimations of the reliability of model predictions. Finally, error analysis may also be useful in "rational" model refinement in that data expansion efforts and model retraining are focused on cohorts with the highest error rates.
Collapse
Affiliation(s)
- Jose R Mora
- Departamento de Ingeniería Química, Universidad San Francisco de Quito (USFQ), Instituto de Simulación Computacional (ISC- USFQ), Diego de Robles y Vía Interoceánica, Quito, 170901, Ecuador
| | - Edgar A Marquez
- Grupo de Investigaciones en Química Y Biología, Departamento de Química Y Biología, Facultad de Ciencias Básicas, Universidad del Norte, Carrera 51B, Km 5, vía Puerto Colombia, Barranquilla, 081007, Colombia
- Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE), Cátedras Conacyt, Ensenada, Baja California, México
| | - Noel Pérez-Pérez
- Colegio de Ciencias e Ingenierías "El Politécnico", Universidad San Francisco de Quito (USFQ), Quito, Ecuador
| | - Ernesto Contreras-Torres
- Grupo de Medicina Molecular y Traslacional (MeM&T), Universidad San Francisco de Quito, Escuela de Medicina, Colegio de Ciencias de la Salud (COCSA), Av. Interoceánica Km 12 1/2 y Av. Florencia, 17, Quito, 1200-841, Ecuador
| | - Yunierkis Perez-Castillo
- Bio-Chemoinformatics Research Group, Escuela de Ciencias Físicas y Matemáticas, Universidad de Las Américas, Quito, 170504, Ecuador
| | - Guillermin Agüero-Chapin
- CIIMAR - Interdisciplinary Centre of Marine and Environmental Research, University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de Matos s/n, Porto, 4450-208, Portugal
- Department of Biology, Faculty of Sciences, University of Porto, Rua do Campo Alegre, Porto, 4169- 007, Portugal
| | - Felix Martinez-Rios
- Facultad de Ingeniería, Universidad Panamericana, CDMX, Augusto Rodin No. 498, Insurgentes Mixcoac, Benito Juárez, Ciudad de México, 03920, México
| | - Yovani Marrero-Ponce
- Grupo de Medicina Molecular y Traslacional (MeM&T), Universidad San Francisco de Quito, Escuela de Medicina, Colegio de Ciencias de la Salud (COCSA), Av. Interoceánica Km 12 1/2 y Av. Florencia, 17, Quito, 1200-841, Ecuador
- Facultad de Ingeniería, Universidad Panamericana, CDMX, Augusto Rodin No. 498, Insurgentes Mixcoac, Benito Juárez, Ciudad de México, 03920, México
- Computer-Aided Molecular "Biosilico" Discovery and Bioinformatics Research International Network (CAMD-BIR IN), Cumbayá, Quito, Ecuador
| | - Stephen J Barigye
- Departamento de Química Física Aplicada, Facultad de Ciencias, Universidad Autónoma de Madrid (UAM), Madrid, 28049, Spain.
| |
Collapse
|
2
|
Samanipour S, O’Brien JW, Reid MJ, Thomas KV, Praetorius A. From Molecular Descriptors to Intrinsic Fish Toxicity of Chemicals: An Alternative Approach to Chemical Prioritization. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:17950-17958. [PMID: 36480454 PMCID: PMC10666547 DOI: 10.1021/acs.est.2c07353] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 11/27/2022] [Accepted: 11/28/2022] [Indexed: 06/17/2023]
Abstract
The European and U.S. chemical agencies have listed approximately 800k chemicals about which knowledge of potential risks to human health and the environment is lacking. Filling these data gaps experimentally is impossible, so in silico approaches and prediction are essential. Many existing models are however limited by assumptions (e.g., linearity and continuity) and small training sets. In this study, we present a supervised direct classification model that connects molecular descriptors to toxicity. Categories can be driven by either data (using k-means clustering) or defined by regulation. This was tested via 907 experimentally defined 96 h LC50 values for acute fish toxicity. Our classification model explained ≈90% of the variance in our data for the training set and ≈80% for the test set. This strategy gave a 5-fold decrease in the frequency of incorrect categorization compared to a quantitative structure-activity relationship (QSAR) regression model. Our model was subsequently employed to predict the toxicity categories of ≈32k chemicals. A comparison between the model-based applicability domain (AD) and the training set AD was performed, suggesting that the training set-based AD is a more adequate way to avoid extrapolation when using such models. The better performance of our direct classification model compared to that of QSAR methods makes this approach a viable tool for assessing the hazards and risks of chemicals.
Collapse
Affiliation(s)
- Saer Samanipour
- Van
’t Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam (UvA), 1090 GDAmsterdam, The Netherlands
- UvA
Data Science Center, University of Amsterdam, 1090 GDAmsterdam, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Brisbane, QLD4072, Australia
| | - Jake W. O’Brien
- Van
’t Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam (UvA), 1090 GDAmsterdam, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Brisbane, QLD4072, Australia
| | - Malcolm J. Reid
- Norwegian
Institute for Water Research (NIVA), NO-0579Oslo, Norway
| | - Kevin V. Thomas
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Brisbane, QLD4072, Australia
| | - Antonia Praetorius
- Institute
for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, 1090 GDAmsterdam, The Netherlands
| |
Collapse
|
3
|
Wang J, Hu B, Liu W, Luo D, Peng J. Characterizing Soil Profile Salinization in Cotton Fields Using Landsat 8 Time-Series Data in Southern Xinjiang, China. SENSORS (BASEL, SWITZERLAND) 2023; 23:7003. [PMID: 37571787 PMCID: PMC10422238 DOI: 10.3390/s23157003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 08/02/2023] [Accepted: 08/04/2023] [Indexed: 08/13/2023]
Abstract
Soil salinization is a major obstacle to land productivity, crop yield and crop quality in arid areas and directly affects food security. Soil profile salt data are key for accurately determining irrigation volumes. To explore the potential for using Landsat 8 time-series data to monitor soil salinization, 172 Landsat 8 images from 2013 to 2019 were obtained from the Alar Reclamation Area of Xinjiang, northwest China. The multiyear extreme dataset was synthesized from the annual maximum or minimum values of 16 vegetation indices, which were combined with the soil conductivity of 540 samples from soil profiles at 0~0.375 m, 0~0.75 m and 0~1.00 m depths in 30 cotton fields with varying degrees of salinization as investigated by EM38-MK2. Three remote sensing monitoring models for soil conductivity at different depths were constructed using the Cubist method, and digital mapping was carried out. The results showed that the Cubist model of soil profile electrical conductivity from 0 to 0.375 m, 0 to 0.75 m and 0 to 1.00 m showed high prediction accuracy, and the determination coefficients of the prediction set were 0.80, 0.74 and 0.72, respectively. Therefore, it is feasible to use a multiyear extreme value for the vegetation index combined with a Cubist modeling method to monitor soil profile salinization at a regional scale.
Collapse
Affiliation(s)
- Jiaqiang Wang
- College of Agriculture, Tarim University, Alar 843300, China; (J.W.); (D.L.); (J.P.)
- Key Laboratory of Genetic Improvement and Efficient Production for Specialty Crops in Arid Southern Xinjiang of Xinjiang Corps, Tarim University, Alar 843300, China
- The Research Center of Oasis Agricultural Resources and Environment in Southern Xinjiang, Tarim University, Alar 843300, China
| | - Bifeng Hu
- Department of Land Resource Management, School of Tourism and Urban Management, Jiangxi University of Finance and Economics, Nanchang 330013, China;
| | - Weiyang Liu
- College of Agriculture, Tarim University, Alar 843300, China; (J.W.); (D.L.); (J.P.)
- Key Laboratory of Genetic Improvement and Efficient Production for Specialty Crops in Arid Southern Xinjiang of Xinjiang Corps, Tarim University, Alar 843300, China
- The Research Center of Oasis Agricultural Resources and Environment in Southern Xinjiang, Tarim University, Alar 843300, China
| | - Defang Luo
- College of Agriculture, Tarim University, Alar 843300, China; (J.W.); (D.L.); (J.P.)
- Key Laboratory of Genetic Improvement and Efficient Production for Specialty Crops in Arid Southern Xinjiang of Xinjiang Corps, Tarim University, Alar 843300, China
- The Research Center of Oasis Agricultural Resources and Environment in Southern Xinjiang, Tarim University, Alar 843300, China
| | - Jie Peng
- College of Agriculture, Tarim University, Alar 843300, China; (J.W.); (D.L.); (J.P.)
- Key Laboratory of Genetic Improvement and Efficient Production for Specialty Crops in Arid Southern Xinjiang of Xinjiang Corps, Tarim University, Alar 843300, China
- The Research Center of Oasis Agricultural Resources and Environment in Southern Xinjiang, Tarim University, Alar 843300, China
| |
Collapse
|
4
|
Abstract
The problem of human trust is one of the most fundamental problems in applied artificial intelligence in drug discovery. In silico models have been widely used to accelerate the process of drug discovery in recent years. However, most of these models can only give reliable predictions within a limited chemical space that the training set covers (applicability domain). Predictions of samples falling outside the applicability domain are unreliable and sometimes dangerous for the drug-design decision-making process. Uncertainty quantification accordingly has drawn great attention to enable autonomous drug designing. By quantifying the confidence level of model predictions, the reliability of the predictions can be quantitatively represented to assist researchers in their molecular reasoning and experimental design. Here we summarize the state-of-the-art approaches to uncertainty quantification and underline how they can be used for drug design and discovery projects. Furthermore, we also outline four representative application scenarios of uncertainty quantification in drug discovery.
Collapse
Affiliation(s)
- Jie Yu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Dingyan Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
5
|
Sheridan RP. Stability of Prediction in Production ADMET Models as a Function of Version: Why and When Predictions Change. J Chem Inf Model 2022; 62:3477-3485. [PMID: 35849796 DOI: 10.1021/acs.jcim.2c00803] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
As with other pharma companies, we maintain production QSAR models of ADMET end points and update them regularly. Here, for six ADMET end points, we examine the predictions of test set molecules on multiple versions of random forest models spanning a period of 10 years. For any given end point, the predictions for the majority of molecules are similar for all model versions. However, for a small minority of molecules, the prediction shifts substantially over the span of a few versions. For most molecules that shift, the prediction becomes more accurate at later times. This Perspective investigates metrics that can help indicate which molecules will shift substantially in prediction and when the shift will occur.
Collapse
Affiliation(s)
- Robert P Sheridan
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| |
Collapse
|
6
|
Sheridan RP, Culberson JC, Joshi E, Tudor M, Karnachi P. Prediction Accuracy of Production ADMET Models as a Function of Version: Activity Cliffs Rule. J Chem Inf Model 2022; 62:3275-3280. [PMID: 35796226 DOI: 10.1021/acs.jcim.2c00699] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
As with many other institutions, our company maintains many quantitative structure-activity relationship (QSAR) models of absorption, distribution, metabolism, excretion, and toxicity (ADMET) end points and updates the models regularly. We recently examined version-to-version predictivity for these models over a period of 10 years. In this approach we monitor the goodness of prediction of new molecules relative to the training set of model version V before they are incorporated in the updated model V+1. Using a cell-based permeability assay (Papp) as an example, we illustrate how the QSAR models made from this data are generally predictive and can be utilized to enrich chemical designs and synthesis. Despite the obvious utility of these models, we turned up unexpected behavior in Papp and other ADMET activities for which the explanation is not obvious. One such behavior is that the apparent predictivity of the models as measured by root-mean-square-error can vary greatly from version to version and is sometimes very poor. One intuitively appealing explanation is that the observed activities of the new molecules fall outside the bulk of activities in the training set. Alternatively, one may think that the new molecules are exploring different regions of chemical space than the training set. However, the real explanation has to do with activity cliffs. If the observed activities of the new molecules are different than expected based on similar molecules in the training set, the predictions will be less accurate. This is true for all our ADMET end points.
Collapse
Affiliation(s)
- Robert P Sheridan
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - J Chris Culberson
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Elizabeth Joshi
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Matthew Tudor
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Prabha Karnachi
- Computational and Structural Chemistry, Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| |
Collapse
|
7
|
HobPre: accurate prediction of human oral bioavailability for small molecules. J Cheminform 2022; 14:1. [PMID: 34991690 PMCID: PMC8740492 DOI: 10.1186/s13321-021-00580-6] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 12/28/2021] [Indexed: 11/10/2022] Open
Abstract
Human oral bioavailability (HOB) is a key factor in determining the fate of new drugs in clinical trials. HOB is conventionally measured using expensive and time-consuming experimental tests. The use of computational models to evaluate HOB before the synthesis of new drugs will be beneficial to the drug development process. In this study, a total of 1588 drug molecules with HOB data were collected from the literature for the development of a classifying model that uses the consensus predictions of five random forest models. The consensus model shows excellent prediction accuracies on two independent test sets with two cutoffs of 20% and 50% for classification of molecules. The analysis of the importance of the input variables allowed the identification of the main molecular descriptors that affect the HOB class value. The model is available as a web server at www.icdrug.com/ICDrug/ADMET for quick assessment of oral bioavailability for small molecules. The results from this study provide an accurate and easy-to-use tool for screening of drug candidates based on HOB, which may be used to reduce the risk of failure in late stage of drug development. ![]()
Collapse
|
8
|
Wu L, Gao Y, Ren WC, Su Y, Li J, Du YQ, Wang QH, Kuang HX. Rapid determination and origin identification of total polysaccharides contents in Schisandra chinensis by near-infrared spectroscopy. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2022; 264:120327. [PMID: 34474220 DOI: 10.1016/j.saa.2021.120327] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 07/16/2021] [Accepted: 08/24/2021] [Indexed: 06/13/2023]
Abstract
In this study, a classification model was established based on near-infrared spectroscopy and random forest method to accurately distinguish three samples of Schisandra chinensis from different habitats. At the same time, the feasibility of fast and effective prediction of polysaccharide contents in Schisandra chinensis by near-infrared spectroscopy combined with chemometrics was evaluated. In this paper, phenol sulfuric acid method was used to determine the content of total polysaccharides in samples, and partial least squares regression algorithm was used to link the spectral information with the reference value. Different spectral pretreatment methods were used to optimize the model to improve its predictability and stability. The results showed that random forest could distinguish these samples accurately, with an accuracy of 97.47%. In the established prediction model, the RMSEC of the optimal model calibration set is 0.0012, and the coefficient of determination R is 0.9976. The RMSEP of prediction set is 0.0024, the coefficient of determination R is 0.9922, and the RPD is 11.36. In general, the method has good stability and applicability, which provides a new analytical method for the identification of Schisandra chinensis origin and quality evaluation.
Collapse
Affiliation(s)
- Lun Wu
- Institute of Traditional Chinese Medicine, Heilongjiang University of Chinese Medicine, Harbin 150040, China
| | - Yue Gao
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China
| | - Wen-Chen Ren
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China
| | - Yang Su
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China; Faculty of Microbiology and Immunogenetics, University of California, Los Angeles, CA 90095, USA.
| | - Jing Li
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China
| | - Ya-Qi Du
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China
| | - Qiu-Hong Wang
- School of Traditional Chinese Medicine, Guangdong Pharmaceutical University, Guangzhou 510000, China
| | - Hai-Xue Kuang
- School of Pharmacy, Heilongjiang University of Chinese Medicine, Key Laboratory of Medicinal Materials, Chinese Academy of Sciences, Harbin 150040, China
| |
Collapse
|
9
|
Weber JK, Morrone JA, Bagchi S, Pabon JDE, Kang SG, Zhang L, Cornell WD. Simplified, interpretable graph convolutional neural networks for small molecule activity prediction. J Comput Aided Mol Des 2021; 36:391-404. [PMID: 34817762 PMCID: PMC9325818 DOI: 10.1007/s10822-021-00421-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 09/24/2021] [Indexed: 12/11/2022]
Abstract
We here present a streamlined, explainable graph convolutional neural network (gCNN) architecture for small molecule activity prediction. We first conduct a hyperparameter optimization across nearly 800 protein targets that produces a simplified gCNN QSAR architecture, and we observe that such a model can yield performance improvements over both standard gCNN and RF methods on difficult-to-classify test sets. Additionally, we discuss how reductions in convolutional layer dimensions potentially speak to the “anatomical” needs of gCNNs with respect to radial coarse graining of molecular substructure. We augment this simplified architecture with saliency map technology that highlights molecular substructures relevant to activity, and we perform saliency analysis on nearly 100 data-rich protein targets. We show that resultant substructural clusters are useful visualization tools for understanding substructure-activity relationships. We go on to highlight connections between our models’ saliency predictions and observations made in the medicinal chemistry literature, focusing on four case studies of past lead finding and lead optimization campaigns.
Collapse
Affiliation(s)
- Jeffrey K Weber
- IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | | | - Sugato Bagchi
- IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | | | - Seung-Gu Kang
- IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | - Leili Zhang
- IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA
| | - Wendy D Cornell
- IBM Thomas J Watson Research Center, Yorktown Heights, NY, USA.
| |
Collapse
|
10
|
Carrera GVSM, Inês J, Bernardes CES, Klimenko K, Shimizu K, Canongia Lopes JN. The Solubility of Gases in Ionic Liquids: A Chemoinformatic Predictive and Interpretable Approach. Chemphyschem 2021; 22:2190-2200. [PMID: 34464013 DOI: 10.1002/cphc.202100632] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Indexed: 11/07/2022]
Abstract
This work comprises the study of solubilities of gases in ionic liquids (ILs) using a chemoinformatic approach. It is based on the codification, of the atomic inter-component interactions, cation/gas and anion/gas, which are used to obtain a pattern of activation in a Kohonen Neural Network (MOLMAP descriptors). A robust predictive model has been obtained with the Random Forest algorithm and used the maximum proximity as a confidence measure of a given chemical system compared to the training set. The encoding method has been validated with molecular dynamics. This encoding approach is a valuable estimator of attractive/repulsive interactions of a generical chemical system IL+gas. This method has been used as a fast/visual form of identification of the reasons behind the differences observed between the solubility of CO2 and O2 in 1-butyl-3-methylimidazolium hexafluorophosphate (BMIM PF6 ) at identical temperature and pressure (TP) conditions, The effect of variable cation and anion effect has been evaluated.
Collapse
Affiliation(s)
- Gonçalo V S M Carrera
- Chemistry Department LAQV-REQUIMTE, NOVA School of Science and Technology, 2829-516, Caparica, Portugal
| | - João Inês
- Chemistry Department LAQV-REQUIMTE, NOVA School of Science and Technology, 2829-516, Caparica, Portugal
| | - Carlos E S Bernardes
- Centro de Química Estrutural, Faculdade de Ciências, Universidade de Lisboa, 1749-016, Lisboa, Portugal
| | - Kyrylo Klimenko
- Chemistry Department LAQV-REQUIMTE, NOVA School of Science and Technology, 2829-516, Caparica, Portugal
| | - Karina Shimizu
- Centro de Química Estrutural, Department of Chemical and Biological Engineering, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001, Lisboa, Portugal
| | - José N Canongia Lopes
- Centro de Química Estrutural, Department of Chemical and Biological Engineering, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001, Lisboa, Portugal
| |
Collapse
|
11
|
Machine Learning Applied to the Modeling of Pharmacological and ADMET Endpoints. Methods Mol Biol 2021. [PMID: 34731464 DOI: 10.1007/978-1-0716-1787-8_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2023]
Abstract
The well-known concept of quantitative structure-activity relationships (QSAR) has been gaining significant interest in the recent years. Data, descriptors, and algorithms are the main pillars to build useful models that support more efficient drug discovery processes with in silico methods. Significant advances in all three areas are the reason for the regained interest in these models. In this book chapter we review various machine learning (ML) approaches that make use of measured in vitro/in vivo data of many compounds. We put these in context with other digital drug discovery methods and present some application examples.
Collapse
|
12
|
Bennett S, Szczypiński FT, Turcani L, Briggs ME, Greenaway RL, Jelfs KE. Materials Precursor Score: Modeling Chemists' Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors. J Chem Inf Model 2021; 61:4342-4356. [PMID: 34388347 PMCID: PMC8479809 DOI: 10.1021/acs.jcim.1c00375] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Indexed: 11/30/2022]
Abstract
Computation is increasingly being used to try to accelerate the discovery of new materials. One specific example of this is porous molecular materials, specifically porous organic cages, where the porosity of the materials predominantly comes from the internal cavities of the molecules themselves. The computational discovery of novel structures with useful properties is currently hindered by the difficulty in transitioning from a computational prediction to synthetic realization. Attempts at experimental validation are often time-consuming, expensive, and frequently, the key bottleneck of material discovery. In this work, we developed a computational screening workflow for porous molecules that includes consideration of the synthetic difficulty of material precursors, aimed at easing the transition between computational prediction and experimental realization. We trained a machine learning model by first collecting data on 12,553 molecules categorized either as "easy-to-synthesize" or "difficult-to-synthesize" by expert chemists with years of experience in organic synthesis. We used an approach to address the class imbalance present in our data set, producing a binary classifier able to categorize easy-to-synthesize molecules with few false positives. We then used our model during computational screening for porous organic molecules to bias toward precursors whose easier synthesis requirements would make them promising candidates for experimental realization and material development. We found that even by limiting precursors to those that are easier-to-synthesize, we are still able to identify cages with favorable, and even some rare, properties.
Collapse
Affiliation(s)
- Steven Bennett
- Department
of Chemistry, Imperial College London, Molecular Sciences Research Hub,
White City Campus, Wood Lane, London W12 0BZ, U.K.
| | - Filip T. Szczypiński
- Department
of Chemistry, Imperial College London, Molecular Sciences Research Hub,
White City Campus, Wood Lane, London W12 0BZ, U.K.
| | - Lukas Turcani
- Department
of Chemistry, Imperial College London, Molecular Sciences Research Hub,
White City Campus, Wood Lane, London W12 0BZ, U.K.
| | - Michael E. Briggs
- Materials
Innovation Factory, University of Liverpool, 51 Oxford Street, Liverpool L7 3NY, U.K.
| | - Rebecca L. Greenaway
- Department
of Chemistry, Imperial College London, Molecular Sciences Research Hub,
White City Campus, Wood Lane, London W12 0BZ, U.K.
| | - Kim E. Jelfs
- Department
of Chemistry, Imperial College London, Molecular Sciences Research Hub,
White City Campus, Wood Lane, London W12 0BZ, U.K.
| |
Collapse
|
13
|
Wang D, Yu J, Chen L, Li X, Jiang H, Chen K, Zheng M, Luo X. A hybrid framework for improving uncertainty quantification in deep learning-based QSAR regression modeling. J Cheminform 2021; 13:69. [PMID: 34544485 PMCID: PMC8454160 DOI: 10.1186/s13321-021-00551-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 09/05/2021] [Indexed: 11/24/2022] Open
Abstract
Reliable uncertainty quantification for statistical models is crucial in various downstream applications, especially for drug design and discovery where mistakes may incur a large amount of cost. This topic has therefore absorbed much attention and a plethora of methods have been proposed over the past years. The approaches that have been reported so far can be mainly categorized into two classes: distance-based approaches and Bayesian approaches. Although these methods have been widely used in many scenarios and shown promising performance with their distinct superiorities, being overconfident on out-of-distribution examples still poses challenges for the deployment of these techniques in real-world applications. In this study we investigated a number of consensus strategies in order to combine both distance-based and Bayesian approaches together with post-hoc calibration for improved uncertainty quantification in QSAR (Quantitative Structure-Activity Relationship) regression modeling. We employed a set of criteria to quantitatively assess the ranking and calibration ability of these models. Experiments based on 24 bioactivity datasets were designed to make critical comparison between the model we proposed and other well-studied baseline models. Our findings indicate that the hybrid framework proposed by us can robustly enhance the model ability of ranking absolute errors. Together with post-hoc calibration on the validation set, we show that well-calibrated uncertainty quantification results can be obtained in domain shift settings. The complementarity between different methods is also conceptually analyzed.
Collapse
Affiliation(s)
- Dingyan Wang
- Shanghai Key Laboratory of Forensic Medicine, Academy of Forensic Science, Shanghai, 200063, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Jie Yu
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Lifan Chen
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Xutong Li
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Hualiang Jiang
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Kaixian Chen
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Mingyue Zheng
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
| | - Xiaomin Luo
- Shanghai Key Laboratory of Forensic Medicine, Academy of Forensic Science, Shanghai, 200063, China.
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
| |
Collapse
|
14
|
Yang X, Zhang Z, Li Q, Cai Y. Quantitative structure-activity relationship models for genotoxicity prediction based on combination evaluation strategies for toxicological alternative experiments. Sci Rep 2021; 11:8030. [PMID: 33850191 PMCID: PMC8044236 DOI: 10.1038/s41598-021-87035-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Accepted: 03/23/2021] [Indexed: 11/10/2022] Open
Abstract
Mutagenicity exerts adverse effects on humans. Conventional methods cannot simultaneously predict the toxicity of a large number of compounds. Most mutagenicity prediction models are based on a single experimental type and lack other experimental combination data as support, resulting in limited application scope and predictive ability. In this study, we partitioned data from GENE-TOX, CPDB, and Chemical Carcinogenesis Research Information System according to the weight-of-evidence method for modelling. In our data set, in vivo and in vitro experiments in groups as well as prokaryotic and eukaryotic cell experiments were included in accordance with the ICH guideline. We compared the two experimental combinations mentioned in the weight-of-evidence method and reintegrated the experimental data into three groups. Nine sub-models and three fusion models were established using random forest (RF), support vector machine (SVM), and back propagation (BP) neural network algorithms. When fusing base models under the same algorithm according to the ensemble rules, all models showed excellent predictive performance. The RF, SVM, and BP fusion models reached a prediction accuracy rate of 83.4%, 80.5%, 79.0% respectively. The area under the curve (AUC) reached 0.853, 0.897, 0.865 respectively. Therefore, the established fusion QSAR models can serve as an early warning system for mutagenicity of compounds.
Collapse
Affiliation(s)
- Xiaotong Yang
- School of Public Health, Guangdong Pharmaceutical University, Guangzhou, China
| | - Zhengbao Zhang
- Guangdong Province Center for Disease Control and Prevention, Guangzhou, China
| | - Qing Li
- Guangdong Province Center for Disease Control and Prevention, Guangzhou, China.
| | - Yongming Cai
- College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China.
- Guangdong Provincial TCM Precision Medicine Big Data Engineering Technology Research Center, Guangzhou, China.
| |
Collapse
|
15
|
Bhagat SK, Tung TM, Yaseen ZM. Heavy metal contamination prediction using ensemble model: Case study of Bay sedimentation, Australia. JOURNAL OF HAZARDOUS MATERIALS 2021; 403:123492. [PMID: 32763636 DOI: 10.1016/j.jhazmat.2020.123492] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 07/11/2020] [Accepted: 07/13/2020] [Indexed: 06/11/2023]
Abstract
Lead (Pb) is a primary toxic heavy metal (HM) which present throughout the entire ecosystem. Some commonly observed challenges in HM (Pb) prediction using artificial intelligence (AI) models include overfitting, normalization, validation against classical AI models, and lack in learning/technology transfer. This study explores the extreme gradient boosting (XGBoost) model as a superior SuperLearning (SL) algorithms for Pb prediction. The proposed model was examined using historical data at the Bramble and Deception Bay (BB and DB) stations, Australia. The model was trained at one of the stations and transferred to a cross-station and vice versa. XGBoost showed higher reliability with less declination in (R2: coefficient of determination), i.e., 0.97 % over the testing phase, among others models at BB. At the cross-station (DB), the performance of the XGBoost model was decreased by 2.74 % (R2) against random forests (RF). The mean absolute error (MAE) observed 40 % (XGBoost) and 47 % (RF) less than artificial neural network (ANN). The XGBoost model performance declined by 3.44 % (R2) over testing (DB), which is minor among validated models. At the cross-station (BB), the XGBoost model showed the least decrement in terms of R2, i.e., 7.99 % against the ANN (8.31 %), RF (10.26 %), and support vector machine (SVM, 36.19 %).
Collapse
Affiliation(s)
- Suraj Kumar Bhagat
- Faculty of Civil Engineering, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| | - Tran Minh Tung
- Faculty of Civil Engineering, Ton Duc Thang University, Ho Chi Minh City, Viet Nam
| | - Zaher Mundher Yaseen
- Faculty of Civil Engineering, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| |
Collapse
|
16
|
Čmelo I, Voršilák M, Svozil D. Profiling and analysis of chemical compounds using pointwise mutual information. J Cheminform 2021; 13:3. [PMID: 33423694 PMCID: PMC7798221 DOI: 10.1186/s13321-020-00483-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 12/24/2020] [Indexed: 12/21/2022] Open
Abstract
Pointwise mutual information (PMI) is a measure of association used in information theory. In this paper, PMI is used to characterize several publicly available databases (DrugBank, ChEMBL, PubChem and ZINC) in terms of association strength between compound structural features resulting in database PMI interrelation profiles. As structural features, substructure fragments obtained by coding individual compounds as MACCS, PubChemKey and ECFP fingerprints are used. The analysis of publicly available databases reveals, in accord with other studies, unusual properties of DrugBank compounds which further confirms the validity of PMI profiling approach. Z-standardized relative feature tightness (ZRFT), a PMI-derived measure that quantifies how well the given compound's feature combinations fit these in a particular compound set, is applied for the analysis of compound synthetic accessibility (SA), as well as for the classification of compounds as easy (ES) and hard (HS) to synthesize. ZRFT value distributions are compared with these of SYBA and SAScore. The analysis of ZRFT values of structurally complex compounds in the SAVI database reveals oligopeptide structures that are mispredicted by SAScore as HS, while correctly predicted by ZRFT and SYBA as ES. Compared to SAScore, SYBA and random forest, ZRFT predictions are less accurate, though by a narrow margin (AccZRFT = 94.5%, AccSYBA = 98.8%, AccSAScore = 99.0%, AccRF = 97.3%). However, ZRFT ability to distinguish between ES and HS compounds is surprisingly high considering that while SYBA, SAScore and random forest are dedicated SA models, ZRFT is a generic measurement that merely quantifies the strength of interrelations between structural feature pairs. The results presented in the current work indicate that structural feature co-occurrence, quantified by PMI or ZRFT, contains a significant amount of information relevant to physico-chemical properties of organic compounds.
Collapse
Affiliation(s)
- I. Čmelo
- CZ-OPENSCREEN National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28 Prague, Czech Republic
| | - M. Voršilák
- CZ-OPENSCREEN National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28 Prague, Czech Republic
- CZ-OPENSCREEN National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR v. v. i., Vídeňská 1083, 142 20 Prague 4, Czech Republic
| | - D. Svozil
- CZ-OPENSCREEN National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28 Prague, Czech Republic
- CZ-OPENSCREEN National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR v. v. i., Vídeňská 1083, 142 20 Prague 4, Czech Republic
| |
Collapse
|
17
|
Mervin LH, Johansson S, Semenova E, Giblin KA, Engkvist O. Uncertainty quantification in drug design. Drug Discov Today 2020; 26:474-489. [PMID: 33253918 DOI: 10.1016/j.drudis.2020.11.027] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 07/13/2020] [Accepted: 11/23/2020] [Indexed: 01/03/2023]
Abstract
Machine learning and artificial intelligence are increasingly being applied to the drug-design process as a result of the development of novel algorithms, growing access, the falling cost of computation and the development of novel technologies for generating chemically and biologically relevant data. There has been recent progress in fields such as molecular de novo generation, synthetic route prediction and, to some extent, property predictions. Despite this, most research in these fields has focused on improving the accuracy of the technologies, rather than on quantifying the uncertainty in the predictions. Uncertainty quantification will become a key component in autonomous decision making and will be crucial for integrating machine learning and chemistry automation to create an autonomous design-make-test-analyse cycle. This review covers the empirical, frequentist and Bayesian approaches to uncertainty quantification, and outlines how they can be used for drug design. We also outline the impact of uncertainty quantification on decision making.
Collapse
Affiliation(s)
- Lewis H Mervin
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK.
| | - Simon Johansson
- Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden; Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Elizaveta Semenova
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Kathryn A Giblin
- Medicinal Chemistry, Research and Early Development, Oncology R&D, AstraZeneca, Cambridge, UK
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| |
Collapse
|
18
|
Göller AH, Kuhnke L, Montanari F, Bonin A, Schneckener S, Ter Laak A, Wichard J, Lobell M, Hillisch A. Bayer's in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov Today 2020; 25:1702-1709. [PMID: 32652309 DOI: 10.1016/j.drudis.2020.07.001] [Citation(s) in RCA: 85] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 06/16/2020] [Accepted: 07/02/2020] [Indexed: 12/20/2022]
Abstract
Over the past two decades, an in silico absorption, distribution, metabolism, and excretion (ADMET) platform has been created at Bayer Pharma with the goal to generate models for a variety of pharmacokinetic and physicochemical endpoints in early drug discovery. These tools are accessible to all scientists within the company and can be a useful in assisting with the selection and design of novel leads, as well as the process of lead optimization. Here. we discuss the development of machine-learning (ML) approaches with special emphasis on data, descriptors, and algorithms. We show that high company internal data quality and tailored descriptors, as well as a thorough understanding of the experimental endpoints, are essential to the utility of our models. We discuss the recent impact of deep neural networks and show selected application examples.
Collapse
Affiliation(s)
- Andreas H Göller
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096 Wuppertal, Germany
| | - Lara Kuhnke
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 13342 Berlin, Germany
| | - Floriane Montanari
- Bayer AG, Pharmaceuticals, R&D, Machine Learning Research, 13342 Berlin, Germany
| | - Anne Bonin
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096 Wuppertal, Germany
| | | | - Antonius Ter Laak
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 13342 Berlin, Germany
| | - Jörg Wichard
- Bayer AG, Pharmaceuticals, R&D, Genetic Toxicology, 13342 Berlin, Germany
| | - Mario Lobell
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096 Wuppertal, Germany
| | - Alexander Hillisch
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design, 42096 Wuppertal, Germany.
| |
Collapse
|
19
|
Shen J, Nicolaou CA. Molecular property prediction: recent trends in the era of artificial intelligence. DRUG DISCOVERY TODAY. TECHNOLOGIES 2020; 32-33:29-36. [PMID: 33386091 DOI: 10.1016/j.ddtec.2020.05.001] [Citation(s) in RCA: 48] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/10/2020] [Accepted: 04/06/2020] [Indexed: 12/18/2022]
Abstract
Artificial intelligence (AI) has become a powerful tool in many fields, including drug discovery. Among various AI applications, molecular property prediction can have more significant immediate impact to the drug discovery process since most algorithms and methods use predicted properties to evaluate, select, and generate molecules. Herein, we provide a brief review of the state-of-art molecular property prediction methodologies and discuss examples reported recently. We highlight key techniques that have been applied to molecular property prediction such as learned representation, multi-task learning, transfer learning, and federated learning. We also point out some critical but less discussed issues such as data set quality, benchmark, model performance evaluation, and prediction confidence quantification.
Collapse
Affiliation(s)
- Jie Shen
- Advanced Analytics and Data Sciences, Eli Lilly and Company, Indianapolis, IN 46285, United States.
| | - Christos A Nicolaou
- Discovery Chemistry Research & Technologies, Eli Lilly and Company, Indianapolis, IN 46285, United States.
| |
Collapse
|
20
|
Muratov EN, Bajorath J, Sheridan RP, Tetko IV, Filimonov D, Poroikov V, Oprea TI, Baskin II, Varnek A, Roitberg A, Isayev O, Curtarolo S, Fourches D, Cohen Y, Aspuru-Guzik A, Winkler DA, Agrafiotis D, Cherkasov A, Tropsha A. QSAR without borders. Chem Soc Rev 2020; 49:3525-3564. [PMID: 32356548 PMCID: PMC8008490 DOI: 10.1039/d0cs00098a] [Citation(s) in RCA: 338] [Impact Index Per Article: 84.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure-activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.
Collapse
Affiliation(s)
- Eugene N Muratov
- UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020; 12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open
Abstract
Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using Ki, Kd, IC50 and EC50 data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC50 data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC50 units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC50 units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC50 units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK. .,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| | - Ctibor Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Daniel Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| |
Collapse
|
22
|
Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 2020; 12:39. [PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 02/11/2023] Open
Abstract
An affinity fingerprint is the vector consisting of compound’s affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Morgan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.![]()
Collapse
Affiliation(s)
- C Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic
| | - I Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - W Dehaen
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - P Kříž
- Department of Mathematics, Faculty of Chemical Engineering, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - G J P van Westen
- Computational Drug Discovery, Drug Discovery and Safety, LACDR, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - I V Tetko
- Helmholtz Zentrum Muenchen - German Research Center for Environmental Health (GmbH) and BIGCHEM GmbH, Ingolstaedter Landstrasse 1, 85764, Neuherberg, Germany
| | - A Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - D Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic. .,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic.
| |
Collapse
|
23
|
Griffen EJ, Dossetter AG, Leach AG. Chemists: AI Is Here; Unite To Get the Benefits. J Med Chem 2020; 63:8695-8704. [DOI: 10.1021/acs.jmedchem.0c00163] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
- Edward J. Griffen
- MedChemica Ltd., Alderley Park, Macclesfield, Cheshire SK10 4TG, U.K
| | | | - Andrew G. Leach
- MedChemica Ltd., Alderley Park, Macclesfield, Cheshire SK10 4TG, U.K
| |
Collapse
|
24
|
SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J Cheminform 2020; 12:35. [PMID: 33431015 PMCID: PMC7238540 DOI: 10.1186/s13321-020-00439-2] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 05/09/2020] [Indexed: 12/11/2022] Open
Abstract
SYBA (SYnthetic Bayesian Accessibility) is a fragment-based method for the rapid classification of organic compounds as easy- (ES) or hard-to-synthesize (HS). It is based on a Bernoulli naïve Bayes classifier that is used to assign SYBA score contributions to individual fragments based on their frequencies in the database of ES and HS molecules. SYBA was trained on ES molecules available in the ZINC15 database and on HS molecules generated by the Nonpher methodology. SYBA was compared with a random forest, that was utilized as a baseline method, as well as with other two methods for synthetic accessibility assessment: SAScore and SCScore. When used with their suggested thresholds, SYBA improves over random forest classification, albeit marginally, and outperforms SAScore and SCScore. However, upon the optimization of SAScore threshold (that changes from 6.0 to – 4.5), SAScore yields similar results as SYBA. Because SYBA is based merely on fragment contributions, it can be used for the analysis of the contribution of individual molecular parts to compound synthetic accessibility. SYBA is publicly available at https://github.com/lich-uct/syba under the GNU General Public License.
Collapse
|
25
|
Zhang JD, Sach-Peltason L, Kramer C, Wang K, Ebeling M. Multiscale modelling of drug mechanism and safety. Drug Discov Today 2020; 25:519-534. [DOI: 10.1016/j.drudis.2019.12.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 12/06/2019] [Accepted: 12/23/2019] [Indexed: 12/19/2022]
|
26
|
Schneckener S, Grimbs S, Hey J, Menz S, Osmers M, Schaper S, Hillisch A, Göller AH. Prediction of Oral Bioavailability in Rats: Transferring Insights from in Vitro Correlations to (Deep) Machine Learning Models Using in Silico Model Outputs and Chemical Structure Parameters. J Chem Inf Model 2019; 59:4893-4905. [PMID: 31714067 DOI: 10.1021/acs.jcim.9b00460] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Oral administration of drug products is a strict requirement in many medical indications. Therefore, bioavailability prediction models are of high importance for prioritization of compound candidates in the drug discovery process. However, oral exposure and bioavailability are difficult to predict, as they are the result of various highly complex factors and/or processes influenced by the physicochemical properties of a compound, such as solubility, lipophilicity, or charge state, as well as by interactions with the organism, for instance, metabolism or membrane permeation. In this study, we assess whether it is possible to predict intravenous (iv) or oral drug exposure and oral bioavailability in rats. As input parameters, we use (i) six experimentally determined in vitro and physicochemical endpoints, namely, membrane permeation, free fraction, metabolic stability, solubility, pKa value, and lipophilicity; (ii) the outputs of six in silico absorption, distribution, metabolism, and excretion models trained on the same endpoints, or (iii) the chemical structure encoded as fingerprints or simplified molecular input line entry system strings. The underlying data set for the models is an unprecedented collection of almost 1900 data points with high-quality in vivo experiments performed in rats. We find that drug exposure after iv administration can be predicted similarly well using hybrid models with in vitro- or in silico-predicted endpoints as inputs, with fold change errors (FCE) of 2.28 and 2.08, respectively. The FCEs for exposure after oral administration are higher, and here, the prediction from in vitro inputs performs significantly better in comparison to in silico-based models with FCEs of 3.49 and 2.40, respectively, most probably reflecting the higher complexity of oral bioavailability. Simplifying the prediction task to a binary alert for low oral bioavailability, based only on chemical structure, we achieve accuracy and precision close to 70%.
Collapse
Affiliation(s)
- Sebastian Schneckener
- Bayer AG, Engineering & Technology, Applied Mathematics , 51368 Leverkusen , Germany
| | - Sergio Grimbs
- Bayer AG, Engineering & Technology, Applied Mathematics , 51368 Leverkusen , Germany
| | - Jessica Hey
- Bayer AG, Engineering & Technology, Applied Mathematics , 51368 Leverkusen , Germany
| | - Stephan Menz
- Bayer AG, R&D, Pharmaceuticals, Research Pharmacokinetics , 13342 Berlin , Germany
| | - Maren Osmers
- Bayer AG, R&D, Pharmaceuticals, Research Pharmacokinetics , 13342 Berlin , Germany
| | - Steffen Schaper
- Bayer AG, Engineering & Technology, Applied Mathematics , 51368 Leverkusen , Germany
| | - Alexander Hillisch
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design , 42096 Wuppertal , Germany
| | - Andreas H Göller
- Bayer AG, Pharmaceuticals, R&D, Computational Molecular Design , 42096 Wuppertal , Germany
| |
Collapse
|
27
|
Sidorov P, Naulaerts S, Ariey-Bonnet J, Pasquier E, Ballester PJ. Predicting Synergism of Cancer Drug Combinations Using NCI-ALMANAC Data. Front Chem 2019; 7:509. [PMID: 31380352 PMCID: PMC6646421 DOI: 10.3389/fchem.2019.00509] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 07/02/2019] [Indexed: 12/15/2022] Open
Abstract
Drug combinations are of great interest for cancer treatment. Unfortunately, the discovery of synergistic combinations by purely experimental means is only feasible on small sets of drugs. In silico modeling methods can substantially widen this search by providing tools able to predict which of all possible combinations in a large compound library are synergistic. Here we investigate to which extent drug combination synergy can be predicted by exploiting the largest available dataset to date (NCI-ALMANAC, with over 290,000 synergy determinations). Each cell line is modeled using primarily two machine learning techniques, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), on the datasets provided by NCI-ALMANAC. This large-scale predictive modeling study comprises more than 5,000 pair-wise drug combinations, 60 cell lines, 4 types of models, and 5 types of chemical features. The application of a powerful, yet uncommonly used, RF-specific technique for reliability prediction is also investigated. The evaluation of these models shows that it is possible to predict the synergy of unseen drug combinations with high accuracy (Pearson correlations between 0.43 and 0.86 depending on the considered cell line, with XGBoost providing slightly better predictions than RF). We have also found that restricting to the most reliable synergy predictions results in at least 2-fold error decrease with respect to employing the best learning algorithm without any reliability estimation. Alkylating agents, tyrosine kinase inhibitors and topoisomerase inhibitors are the drugs whose synergy with other partner drugs are better predicted by the models. Despite its leading size, NCI-ALMANAC comprises an extremely small part of all conceivable combinations. Given their accuracy and reliability estimation, the developed models should drastically reduce the number of required in vitro tests by predicting in silico which of the considered combinations are likely to be synergistic.
Collapse
Affiliation(s)
- Pavel Sidorov
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Stefan Naulaerts
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
- Department of Tumor Immunology, Institut de Duve, Bruxelles, Belgium
| | - Jérémy Ariey-Bonnet
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Eddy Pasquier
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| | - Pedro J. Ballester
- CRCM, INSERM, Cancer Research Center of Marseille, Institut Paoli-Calmettes, Aix-Marseille Univ, CNRS, Marseille, France
| |
Collapse
|
28
|
Yang S, Shen Y, Lu W, Yang Y, Wang H, Li L, Wu C, Du G. Evaluation and Identification of the Neuroprotective Compounds of Xiaoxuming Decoction by Machine Learning: A Novel Mode to Explore the Combination Rules in Traditional Chinese Medicine Prescription. BIOMED RESEARCH INTERNATIONAL 2019; 2019:6847685. [PMID: 31360720 PMCID: PMC6652039 DOI: 10.1155/2019/6847685] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/16/2019] [Revised: 05/13/2019] [Accepted: 05/26/2019] [Indexed: 12/18/2022]
Abstract
Xiaoxuming decoction (XXMD), a classic traditional Chinese medicine (TCM) prescription, has been used as a therapeutic in the treatment of stroke in clinical practice for over 1200 years. However, the pharmacological mechanisms of XXMD have not yet been elucidated. The purpose of this study was to develop neuroprotective models for identifying neuroprotective compounds in XXMD against hypoxia-induced and H2O2-induced brain cell damage. In this study, a phenotype-based classification method was designed by machine learning to identify neuroprotective compounds and to clarify the compatibility of XXMD components. Four different single classifiers (AB, kNN, CT, and RF) and molecular fingerprint descriptors were used to construct stacked naïve Bayesian models. Among them, the RF algorithm had a better performance with an average MCC value of 0.725±0.014 and 0.774±0.042 from 5-fold cross-validation and test set, respectively. The probability values calculated by four models were then integrated into a stacked Bayesian model. In total, two optimal models, s-NB-1-LPFP6 and s-NB-2-LPFP6, were obtained. The two validated optimal models revealed Matthews correlation coefficients (MCC) of 0.968 and 0.993 for 5-fold cross-validation and of 0.874 and 0.959 for the test set, respectively. Furthermore, the two models were used for virtual screening experiments to identify neuroprotective compounds in XXMD. Ten representative compounds with potential therapeutic effects against the two phenotypes were selected for further cell-based assays. Among the selected compounds, two compounds significantly inhibited H2O2-induced and Na2S2O4-induced neurotoxicity simultaneously. Together, our findings suggested that machine learning algorithms such as combination Bayesian models were feasible to predict neuroprotective compounds and to preliminarily demonstrate the pharmacological mechanisms of TCM.
Collapse
Affiliation(s)
- Shilun Yang
- School of Life Science and Biopharmaceutics, Shenyang Pharmaceutical University, No. 103, Wen hua Road, Shenyang 110016, China
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Yanjia Shen
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Wendan Lu
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Yinglin Yang
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Haigang Wang
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Li Li
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| | - Chunfu Wu
- School of Life Science and Biopharmaceutics, Shenyang Pharmaceutical University, No. 103, Wen hua Road, Shenyang 110016, China
| | - Guanhua Du
- School of Life Science and Biopharmaceutics, Shenyang Pharmaceutical University, No. 103, Wen hua Road, Shenyang 110016, China
- Beijing Key Laboratory of Drug Targets Identification and Drug Screening, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, No. 2, Nan wei Road, Beijing 100050, China
| |
Collapse
|
29
|
Cortés-Ciriano I, Bender A. Reliable Prediction Errors for Deep Neural Networks Using Test-Time Dropout. J Chem Inf Model 2019; 59:3330-3339. [PMID: 31241929 DOI: 10.1021/acs.jcim.9b00297] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
While the use of deep learning in drug discovery is gaining increasing attention, the lack of methods to compute reliable errors in prediction for Neural Networks prevents their application to guide decision making in domains where identifying unreliable predictions is essential, e.g., precision medicine. Here, we present a framework to compute reliable errors in prediction for Neural Networks using Test-Time Dropout and Conformal Prediction. Specifically, the algorithm consists of training a single Neural Network using dropout, and then applying it N times to both the validation and test sets, also employing dropout in this step. Therefore, for each instance in the validation and test sets an ensemble of predictions are generated. The residuals and absolute errors in prediction for the validation set are then used to compute prediction errors for the test set instances using Conformal Prediction. We show using 24 bioactivity data sets from ChEMBL 23 that Dropout Conformal Predictors are valid (i.e., the fraction of instances whose true value lies within the predicted interval strongly correlates with the confidence level) and efficient, as the predicted confidence intervals span a narrower set of values than those computed with Conformal Predictors generated using Random Forest (RF) models. Lastly, we show in retrospective virtual screening experiments that dropout and RF-based Conformal Predictors lead to comparable retrieval rates of active compounds. Overall, we propose a computationally efficient framework (as only N extra forward passes are required in addition to training a single network) to harness Test-Time Dropout and the Conformal Prediction framework, which is generally applicable to generate reliable prediction errors for Deep Neural Networks in drug discovery and beyond.
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| |
Collapse
|
30
|
Luque Ruiz I, Gómez-Nieto MÁ. Building of Robust and Interpretable QSAR Classification Models by Means of the Rivality Index. J Chem Inf Model 2019; 59:2785-2804. [DOI: 10.1021/acs.jcim.9b00264] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Irene Luque Ruiz
- Department of Computing and Numerical Analysis, University of Córdoba, Albert Einstein Building, Campus de Rabanales, E-14071, Córdoba, Spain
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis, University of Córdoba, Albert Einstein Building, Campus de Rabanales, E-14071, Córdoba, Spain
| |
Collapse
|
31
|
Bhhatarai B, Walters WP, Hop CECA, Lanza G, Ekins S. Opportunities and challenges using artificial intelligence in ADME/Tox. NATURE MATERIALS 2019; 18:418-422. [PMID: 31000801 PMCID: PMC6594826 DOI: 10.1038/s41563-019-0332-5] [Citation(s) in RCA: 57] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
A recent conference organized a panel of scientists representing small and big pharma companies, who work at the interface of machine learning (ML) and absorption, distribution, metabolism, excretion, and toxicology (ADME/Tox). With the recent rebirth of AI related to pharma, it is timely to present this collaborative commentary to capture the diverging opinions on the past, present and future role of AI for ADME/Tox and how it can be applied in newer areas such as nanomaterials.
Collapse
Affiliation(s)
- Barun Bhhatarai
- Novartis Institutes for Biomedical Research, Cambridge, MA, USA
| | | | | | | | - Sean Ekins
- Collaborations Pharmaceuticals Inc., Raleigh, NC, USA.
| |
Collapse
|
32
|
Fusani L, Cabrera AC. Active learning strategies with COMBINE analysis: new tricks for an old dog. J Comput Aided Mol Des 2019; 33:287-294. [PMID: 30564994 PMCID: PMC7087723 DOI: 10.1007/s10822-018-0181-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 12/14/2018] [Indexed: 11/09/2022]
Abstract
The COMBINE method was designed to study congeneric series of compounds including structural information of ligand-protein complexes. Although very successful, the method has not received the same level of attention than other alternatives to study Quantitative Structure Active Relationships (QSAR) mainly because lack of ways to measure the uncertainty of the predictions and the need for large datasets. Active learning, a semi-supervised learning approach that makes use of uncertainty to enhance models' performance while reducing the size of the training sets, has been used in this work to address both problems. We propose two estimators of uncertainty: the pool of regressors and the distance to the training set. The performance of the methods has been evaluated by testing the resulting active learning workflows in 3 diverse datasets: HIV-1 protease inhibitors, Taxol-derivatives and BRD4 inhibitors. The proposed strategies were successful in 80% of the cases for the taxol-derivatives and BRD4 inhibitors, while outperformed random selection in the case of the HIV-1 protease inhibitors time-split. Our results suggest that AL-COMBINE might be an effective way of producing consistently superior QSAR models with a limited number of samples.
Collapse
Affiliation(s)
- Lucia Fusani
- Molecular Design UK. GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY, UK
| | - Alvaro Cortes Cabrera
- Data Science and Computational Chemistry, Galchimia S.A. Severo Ochoa 2, Tres Cantos, 28760, Spain.
| |
Collapse
|
33
|
Berenger F, Yamanishi Y. A Distance-Based Boolean Applicability Domain for Classification of High Throughput Screening Data. J Chem Inf Model 2019; 59:463-476. [PMID: 30567434 DOI: 10.1021/acs.jcim.8b00499] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
In Quantitative Structure-Activity Relationship (QSAR) modeling, one must come up with an activity model but also with an applicability domain for that model. Some existing methods to create an applicability domain are complex, hard to implement, and/or difficult to interpret. Also, they often require the user to select a threshold value, or they embed an empirical constant. In this work, we propose a trivial to interpret and fully automatic Distance-Based Boolean Applicability Domain (DBBAD) algorithm for category QSAR. In retrospective experiments on High Throughput Screening data sets, this applicability domain improves the classification performance and early retrieval of support vector machine and random forest based classifiers, while improving the scaffold diversity among top-ranked active molecules.
Collapse
Affiliation(s)
- Francois Berenger
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering , Kyushu Institute of Technology , 680-4 Kawazu , Iizuka , Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering , Kyushu Institute of Technology , 680-4 Kawazu , Iizuka , Japan.,PRESTO, Japan Science and Technology Agency , Kawaguchi , Saitama 332-0012 , Japan
| |
Collapse
|
34
|
Liu R, Wallqvist A. Molecular Similarity-Based Domain Applicability Metric Efficiently Identifies Out-of-Domain Compounds. J Chem Inf Model 2018; 59:181-189. [DOI: 10.1021/acs.jcim.8b00597] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Ruifeng Liu
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702, United States
| | - Anders Wallqvist
- Department of Defense Biotechnology High Performance Computing Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702, United States
| |
Collapse
|
35
|
Ruiz IL, Gómez-Nieto MÁ. Study of the Applicability Domain of the QSAR Classification Models by Means of the Rivality and Modelability Indexes. Molecules 2018; 23:molecules23112756. [PMID: 30356020 PMCID: PMC6278359 DOI: 10.3390/molecules23112756] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 10/14/2018] [Accepted: 10/22/2018] [Indexed: 11/30/2022] Open
Abstract
The reliability of a QSAR classification model depends on its capacity to achieve confident predictions of new compounds not considered in the building of the model. The results of this external validation process show the applicability domain (AD) of the QSAR model and, therefore, the robustness of the model to predict the property/activity of new molecules. In this paper we propose the use of the rivality and modelability indexes for the study of the characteristics of the datasets to be correctly modeled by a QSAR algorithm and to predict the reliability of the built model to prognosticate the property/activity of new molecules. The calculation of these indexes has a very low computational cost, not requiring the building of a model, thus being good tools for the analysis of the datasets in the first stages of the building of QSAR classification models. In our study, we have selected two benchmark datasets with similar number of molecules but with very different modelability and we have corroborated the capacity of the predictability of the rivality and modelability indexes regarding the classification models built using Support Vector Machine and Random Forest algorithms with 5-fold cross-validation and leave-one-out techniques. The results have shown the excellent ability of both indexes to predict outliers and the applicability domain of the QSAR classification models. In all cases, these values accurately predicted the statistic parameters of the QSAR models generated by the algorithms.
Collapse
Affiliation(s)
- Irene Luque Ruiz
- Department of Computing and Numerical Analysis, Campus Universitario de Rabanales, Albert Einstein Building, University of Córdoba, E-14071 Córdoba, Spain.
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis, Campus Universitario de Rabanales, Albert Einstein Building, University of Córdoba, E-14071 Córdoba, Spain.
| |
Collapse
|
36
|
Abstract
Selective binding of a drug for its target vs other proteins is an important consideration in designing compounds with minimal risk of toxicity and therefore a key aspect of lead optimization. Screening all compounds against all known off-target proteins would be prohibitively expensive, so project teams must decide which compounds to screen in which assays. This chapter describes informatics-based methods that help prioritize testing, including screening minipanels and prioritization using predictive models (e-counterscreening).
Collapse
Affiliation(s)
- Daniel R McMasters
- Computational Chemistry, Vertex Pharmaceuticals, Boston, MA, United States.
| |
Collapse
|
37
|
Xie Y, Zhou RR, Xie HL, Yu Y, Zhang SH, Zhao CX, Huang JH, Huang LQ. Application of near infrared spectroscopy for rapid determination the geographical regions and polysaccharides contents of Lentinula edodes. Int J Biol Macromol 2018; 122:1115-1119. [PMID: 30218733 DOI: 10.1016/j.ijbiomac.2018.09.060] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2018] [Revised: 08/28/2018] [Accepted: 09/11/2018] [Indexed: 01/09/2023]
Abstract
In this study, a calibration model based on Near-infrared spectroscopy (NIR) technique and chemometrics method was developed for rapid and non-destructive detecting the polysaccharide contents of lentinula edodes samples collected from different regions. The polysaccharide contents of these samples were firstly determined by standard phenol-sulphruic acid method. Then, NIR spectra of these samples were collected by using Fourier transform infrared spectrometry. Based on these experimental data, a random forest method was further used to distinguish the regions of these samples, with a classification accuracy of 96.6%. After that, a rapid, accurate, and quantitative model was established for predicting the polysaccharide contents of these samples. In the model establishing process, some signal pre-treatment methods were optimized, and the validation results with highest determination coefficient (R2) and low root mean square errors of prediction (RMSEP) were, 0.925 and 0.720, respectively. These results showed that combined NIR technique with chemometrics was an effective and green method for lentinula edodes quality control.
Collapse
Affiliation(s)
- Yi Xie
- Hunan Academy of Chinese Medicine, Changsha, 410013, PR China
| | - Rong-Rong Zhou
- School of Pharmacy, Changchun University of Chinese Medicine, Changchun, 130117, PR China; National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, State Key Laboratory Breeding Base of Dao-di Herbs, Beijing 100700, PR China
| | - Hua-Lin Xie
- Hunan Academy of Chinese Medicine, Changsha, 410013, PR China
| | - Yi Yu
- Infinitus (China) Company Ltd, Guangzhou, 510663, PR China
| | - Shui-Han Zhang
- Hunan Academy of Chinese Medicine, Changsha, 410013, PR China
| | - Chen-Xi Zhao
- College of Biological and Environmental Engineering, Changsha University, Changsha, 410022, PR China
| | - Jian-Hua Huang
- Hunan Academy of Chinese Medicine, Changsha, 410013, PR China.
| | - Lu-Qi Huang
- National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, State Key Laboratory Breeding Base of Dao-di Herbs, Beijing 100700, PR China.
| |
Collapse
|
38
|
Clark RD. Predicting mammalian metabolism and toxicity of pesticides in silico. PEST MANAGEMENT SCIENCE 2018; 74:1992-2003. [PMID: 29762898 PMCID: PMC6099302 DOI: 10.1002/ps.4935] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 04/02/2018] [Accepted: 04/03/2018] [Indexed: 05/05/2023]
Abstract
Pesticides must be effective to be commercially viable but they must also be reasonably safe for those who manufacture them, apply them, or consume the food they are used to produce. Animal testing is key to ensuring safety, but it comes late in the agrochemical development process, is expensive, and requires relatively large amounts of material. Surrogate assays used as in vitro models require less material and shift identification of potential mammalian toxicity back to earlier stages in development. Modern in silico methods are cost-effective complements to such in vitro models that make it possible to predict mammalian metabolism, toxicity and exposure for a pesticide, crop residue or other metabolite before it has been synthesized. Their broader use could substantially reduce the amount of time and effort wasted in pesticide development. This contribution reviews the kind of in silico models that are currently available for vetting ideas about what to synthesize and how to focus development efforts; the limitations of those models; and the practical considerations that have slowed development in the area. Detailed discussions are provided of how bacterial mutagenicity, human cytochrome P450 (CYP) metabolism, and bioavailability in humans and rats can be predicted. © 2018 The Authors. Pest Management Science published by John Wiley & Sons Ltd on behalf of Society of Chemical Industry.
Collapse
|
39
|
Rakers C, Najnin RA, Polash AH, Takeda S, Brown J. Chemogenomic Active Learning's Domain of Applicability on Small, Sparse qHTS Matrices: A Study Using Cytochrome P450 and Nuclear Hormone Receptor Families. ChemMedChem 2018; 13:511-521. [DOI: 10.1002/cmdc.201700677] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Revised: 12/04/2017] [Indexed: 01/21/2023]
Affiliation(s)
- Christin Rakers
- Institute of Transformative bio-Molecules, WPI-ITbM; Nagoya University; Furo-cho Chikusa-ku Nagoya 464-8602 Japan
| | - Rifat Ara Najnin
- Department of Radiation Genetics; Kyoto University Graduate School of Medicine; Sakyo, Yoshida-konoemachi Building D, 3F Kyoto 606-8501 Japan
| | - Ahsan Habib Polash
- Department of Radiation Genetics; Kyoto University Graduate School of Medicine; Sakyo, Yoshida-konoemachi Building D, 3F Kyoto 606-8501 Japan
| | - Shunichi Takeda
- Department of Radiation Genetics; Kyoto University Graduate School of Medicine; Sakyo, Yoshida-konoemachi Building D, 3F Kyoto 606-8501 Japan
| | - J.B. Brown
- Laboratory for Molecular Biosciences; Kyoto University Graduate School of Medicine; Yoshida-konoemachi Building E 606-8501 Kyoto Sakyo Japan
| |
Collapse
|
40
|
Lei T, Sun H, Kang Y, Zhu F, Liu H, Zhou W, Wang Z, Li D, Li Y, Hou T. ADMET Evaluation in Drug Discovery. 18. Reliable Prediction of Chemical-Induced Urinary Tract Toxicity by Boosting Machine Learning Approaches. Mol Pharm 2017; 14:3935-3953. [DOI: 10.1021/acs.molpharmaceut.7b00631] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Tailong Lei
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Huiyong Sun
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Yu Kang
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Feng Zhu
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Hui Liu
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Wenfang Zhou
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Zhe Wang
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Dan Li
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| | - Youyong Li
- Institute
of Functional Nano and Soft Materials (FUNSOM), Soochow University, Suzhou, Jiangsu 215123, P. R. China
| | - Tingjun Hou
- College
of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
- State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, P. R. China
| |
Collapse
|
41
|
Perspectives from the NanoSafety Modelling Cluster on the validation criteria for (Q)SAR models used in nanotechnology. Food Chem Toxicol 2017; 112:478-494. [PMID: 28943385 DOI: 10.1016/j.fct.2017.09.037] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Revised: 08/31/2017] [Accepted: 09/19/2017] [Indexed: 11/20/2022]
Abstract
Nanotechnology and the production of nanomaterials have been expanding rapidly in recent years. Since many types of engineered nanoparticles are suspected to be toxic to living organisms and to have a negative impact on the environment, the process of designing new nanoparticles and their applications must be accompanied by a thorough risk analysis. (Quantitative) Structure-Activity Relationship ([Q]SAR) modelling creates promising options among the available methods for the risk assessment. These in silico models can be used to predict a variety of properties, including the toxicity of newly designed nanoparticles. However, (Q)SAR models must be appropriately validated to ensure the clarity, consistency and reliability of predictions. This paper is a joint initiative from recently completed European research projects focused on developing (Q)SAR methodology for nanomaterials. The aim was to interpret and expand the guidance for the well-known "OECD Principles for the Validation, for Regulatory Purposes, of (Q)SAR Models", with reference to nano-(Q)SAR, and present our opinions on the criteria to be fulfilled for models developed for nanoparticles.
Collapse
|
42
|
Klingspohn W, Mathea M, ter Laak A, Heinrich N, Baumann K. Efficiency of different measures for defining the applicability domain of classification models. J Cheminform 2017; 9:44. [PMID: 29086213 PMCID: PMC5543028 DOI: 10.1186/s13321-017-0230-2] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Accepted: 07/13/2017] [Indexed: 01/13/2023] Open
Abstract
The goal of defining an applicability domain for a predictive classification model is to identify the region in chemical space where the model's predictions are reliable. The boundary of the applicability domain is defined with the help of a measure that shall reflect the reliability of an individual prediction. Here, the available measures are differentiated into those that flag unusual objects and which are independent of the original classifier and those that use information of the trained classifier. The former set of techniques is referred to as novelty detection while the latter is designated as confidence estimation. A review of the available confidence estimators shows that most of these measures estimate the probability of class membership of the predicted objects which is inversely related to the error probability. Thus, class probability estimates are natural candidates for defining the applicability domain but were not comprehensively included in previous benchmark studies. The focus of the present study is to find the best measure for defining the applicability domain for a given binary classification technique and to determine the performance of novelty detection versus confidence estimation. Six different binary classification techniques in combination with ten data sets were studied to benchmark the various measures. The area under the receiver operating characteristic curve (AUC ROC) was employed as main benchmark criterion. It is shown that class probability estimates constantly perform best to differentiate between reliable and unreliable predictions. Previously proposed alternatives to class probability estimates do not perform better than the latter and are inferior in most cases. Interestingly, the impact of defining an applicability domain depends on the observed area under the receiver operator characteristic curve. That means that it depends on the level of difficulty of the classification problem (expressed as AUC ROC) and will be largest for intermediately difficult problems (range AUC ROC 0.7-0.9). In the ranking of classifiers, classification random forests performed best on average. Hence, classification random forests in combination with the respective class probability estimate are a good starting point for predictive binary chemoinformatic classifiers with applicability domain. Graphical abstract .
Collapse
Affiliation(s)
- Waldemar Klingspohn
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106 Brunswick, Germany
| | - Miriam Mathea
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106 Brunswick, Germany
| | - Antonius ter Laak
- Bayer Pharma Aktiengesellschaft, Computational Chemistry, Müllerstrasse 178, 13353 Berlin, Germany
| | - Nikolaus Heinrich
- Bayer Pharma Aktiengesellschaft, Computational Chemistry, Müllerstrasse 178, 13353 Berlin, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106 Brunswick, Germany
| |
Collapse
|
43
|
Sun J, Carlsson L, Ahlberg E, Norinder U, Engkvist O, Chen H. Applying Mondrian Cross-Conformal Prediction To Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets. J Chem Inf Model 2017. [PMID: 28628322 DOI: 10.1021/acs.jcim.7b00159] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18 publicly available data sets that have various imbalance levels varying from 1:10 to 1:1000 (ratio of active/inactive compounds). Our results show that MCCP in general performed well on bioactivity data sets with various imbalance levels. More importantly, the method not only provides confidence of prediction and prediction regions compared to standard machine learning methods but also produces valid predictions for the minority class. In addition, a compound similarity based nonconformity measure was investigated. Our results demonstrate that although it gives valid predictions, its efficiency is much worse than that of model dependent metrics.
Collapse
Affiliation(s)
| | | | | | - Ulf Norinder
- Swetox, Karolinska Institutet , Unit of Toxicology Sciences, Södertälje 15136, Sweden
| | | | | |
Collapse
|
44
|
Lei T, Chen F, Liu H, Sun H, Kang Y, Li D, Li Y, Hou T. ADMET Evaluation in Drug Discovery. Part 17: Development of Quantitative and Qualitative Prediction Models for Chemical-Induced Respiratory Toxicity. Mol Pharm 2017; 14:2407-2421. [PMID: 28595388 DOI: 10.1021/acs.molpharmaceut.7b00317] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
As a dangerous end point, respiratory toxicity can cause serious adverse health effects and even death. Meanwhile, it is a common and traditional issue in occupational and environmental protection. Pharmaceutical and chemical industries have a strong urge to develop precise and convenient computational tools to evaluate the respiratory toxicity of compounds as early as possible. Most of the reported theoretical models were developed based on the respiratory toxicity data sets with one single symptom, such as respiratory sensitization, and therefore these models may not afford reliable predictions for toxic compounds with other respiratory symptoms, such as pneumonia or rhinitis. Here, based on a diverse data set of mouse intraperitoneal respiratory toxicity characterized by multiple symptoms, a number of quantitative and qualitative predictions models with high reliability were developed by machine learning approaches. First, a four-tier dimension reduction strategy was employed to find an optimal set of 20 molecular descriptors for model building. Then, six machine learning approaches were used to develop the prediction models, including relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), extreme gradient boosting (XGBoost), naïve Bayes (NB), and linear discriminant analysis (LDA). Among all of the models, the SVM regression model shows the most accurate quantitative predictions for the test set (q2ext = 0.707), and the XGBoost classification model achieves the most accurate qualitative predictions for the test set (MCC of 0.644, AUC of 0.893, and global accuracy of 82.62%). The application domains were analyzed, and all of the tested compounds fall within the application domain coverage. We also examined the structural features of the compounds and important fragments with large prediction errors. In conclusion, the SVM regression model and the XGBoost classification model can be employed as accurate prediction tools for respiratory toxicity.
Collapse
Affiliation(s)
- Tailong Lei
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Fu Chen
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Hui Liu
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Huiyong Sun
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Yu Kang
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Dan Li
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| | - Youyong Li
- Institute of Functional Nano and Soft Materials (FUNSOM), Soochow University , Suzhou, Jiangsu 215123, P. R. China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China.,State Key Lab of CAD&CG, Zhejiang University , Hangzhou, Zhejiang 310058, P. R. China
| |
Collapse
|
45
|
Sheridan RP, Wang WM, Liaw A, Ma J, Gifford EM. Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships. J Chem Inf Model 2016; 56:2353-2360. [PMID: 27958738 DOI: 10.1021/acs.jcim.6b00591] [Citation(s) in RCA: 204] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
Collapse
Affiliation(s)
- Robert P Sheridan
- Modeling and Informatics Department, Merck & Co. Inc. , 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States
| | - Wei Min Wang
- Data Science Department, MSD International GmbH (Singapore Branch) , 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522
| | - Andy Liaw
- Biometrics Research Department, Merck & Co. Inc. , 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States
| | - Junshui Ma
- Biometrics Research Department, Merck & Co. Inc. , 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States
| | - Eric M Gifford
- Bioinformatics Department, MSD International GmbH (Singapore Branch) , 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522
| |
Collapse
|
46
|
Automatically updating predictive modeling workflows support decision-making in drug design. Future Med Chem 2016; 8:1779-96. [DOI: 10.4155/fmc-2016-0070] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Using predictive models for early decision-making in drug discovery has become standard practice. We suggest that model building needs to be automated with minimum input and low technical maintenance requirements. Models perform best when tailored to answering specific compound optimization related questions. If qualitative answers are required, 2-bin classification models are preferred. Integrating predictive modeling results with structural information stimulates better decision making. For in silico models supporting rapid structure–activity relationship cycles the performance deteriorates within weeks. Frequent automated updates of predictive models ensure best predictions. Consensus between multiple modeling approaches increases the prediction confidence. Combining qualified and nonqualified data optimally uses all available information. Dose predictions provide a holistic alternative to multiple individual property predictions for reaching complex decisions.
Collapse
|
47
|
Cortes-Ciriano I. Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR. J Chem Inf Model 2016; 56:1576-87. [DOI: 10.1021/acs.jcim.6b00136] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Affiliation(s)
- Isidro Cortes-Ciriano
- Département de Biologie
Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale, CNRS UMR 3825, 25, rue du Dr Roux, 75015 Paris, France
| |
Collapse
|
48
|
Kaneko H, Funatsu K. Applicability Domains and Consistent Structure Generation. Mol Inform 2016; 36. [DOI: 10.1002/minf.201600032] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2016] [Accepted: 04/25/2016] [Indexed: 11/08/2022]
Affiliation(s)
- Hiromasa Kaneko
- Department of Chemical System Engineering The University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo 113-8656 Japan
| | - Kimito Funatsu
- Department of Chemical System Engineering The University of Tokyo 7-3-1 Hongo Bunkyo-ku, Tokyo 113-8656 Japan
| |
Collapse
|
49
|
Norinder U, Boyer S. Conformal Prediction Classification of a Large Data Set of Environmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays. Chem Res Toxicol 2016; 29:1003-10. [DOI: 10.1021/acs.chemrestox.6b00037] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Affiliation(s)
- Ulf Norinder
- Swedish Toxicology Sciences Research Center, SE-151
36 Södertälje, Sweden
| | - Scott Boyer
- Swedish Toxicology Sciences Research Center, SE-151
36 Södertälje, Sweden
| |
Collapse
|
50
|
Norinder U, Rybacka A, Andersson PL. Conformal prediction to define applicability domain - A case study on predicting ER and AR binding. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:303-316. [PMID: 27088868 DOI: 10.1080/1062936x.2016.1172665] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A fundamental element when deriving a robust and predictive in silico model is not only the statistical quality of the model in question but, equally important, the estimate of its predictive boundaries. This work presents a new method, conformal prediction, for applicability domain estimation in the field of endocrine disruptors. The method is applied to binders and non-binders related to the oestrogen and androgen receptors. Ensembles of decision trees are used as statistical method and three different sets (dragon, rdkit and signature fingerprints) are investigated as chemical descriptors. The conformal prediction method results in valid models where there is an excellent balance in quality between the internally validated training set and the corresponding external test set, both in terms of validity and with respect to sensitivity and specificity. With this method the level of confidence can be readily altered by the user and the consequences thereof immediately inspected. Furthermore, the predictive boundaries for the derived models are rigorously defined by using the conformal prediction framework, thus no ambiguity exists as to the level of similarity needed for new compounds to be in or out of the predictive boundaries of the derived models where reliable predictions can be expected.
Collapse
Affiliation(s)
- U Norinder
- a Swedish Toxicology Sciences Research Center , Södertälje , Sweden
- b Department of Computer and Systems Sciences , Stockholm University , Kista , Sweden
| | - A Rybacka
- c Department of Chemistry , Umeå University , Umeå , Sweden
| | - P L Andersson
- c Department of Chemistry , Umeå University , Umeå , Sweden
| |
Collapse
|