1
|
Carracedo-Reboredo P, Aranzamendi E, He S, Arrasate S, Munteanu CR, Fernandez-Lozano C, Sotomayor N, Lete E, González-Díaz H. MATEO: intermolecular α-amidoalkylation theoretical enantioselectivity optimization. Online tool for selection and design of chiral catalysts and products. J Cheminform 2024; 16:9. [PMID: 38254200 PMCID: PMC10804835 DOI: 10.1186/s13321-024-00802-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 01/11/2024] [Indexed: 01/24/2024] Open
Abstract
The enantioselective Brønsted acid-catalyzed α-amidoalkylation reaction is a useful procedure is for the production of new drugs and natural products. In this context, Chiral Phosphoric Acid (CPA) catalysts are versatile catalysts for this type of reactions. The selection and design of new CPA catalysts for different enantioselective reactions has a dual interest because new CPA catalysts (tools) and chiral drugs or materials (products) can be obtained. However, this process is difficult and time consuming if approached from an experimental trial and error perspective. In this work, an Heuristic Perturbation-Theory and Machine Learning (HPTML) algorithm was used to seek a predictive model for CPA catalysts performance in terms of enantioselectivity in α-amidoalkylation reactions with R2 = 0.96 overall for training and validation series. It involved a Monte Carlo sampling of > 100,000 pairs of query and reference reactions. In addition, the computational and experimental investigation of a new set of intermolecular α-amidoalkylation reactions using BINOL-derived N-triflylphosphoramides as CPA catalysts is reported as a case of study. The model was implemented in a web server called MATEO: InterMolecular Amidoalkylation Theoretical Enantioselectivity Optimization, available online at: https://cptmltool.rnasa-imedir.com/CPTMLTools-Web/mateo . This new user-friendly online computational tool would enable sustainable optimization of reaction conditions that could lead to the design of new CPA catalysts along with new organic synthesis products.
Collapse
Affiliation(s)
- Paula Carracedo-Reboredo
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
- Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
| | - Eider Aranzamendi
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
| | - Shan He
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
- IKERDATA S.L., ZITEK, University of Basque Country UPVEHU, Rectorate Building, 48940, Leioa, Spain
| | - Sonia Arrasate
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
| | - Cristian R Munteanu
- Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
| | - Carlos Fernandez-Lozano
- Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
| | - Nuria Sotomayor
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain.
| | - Esther Lete
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain.
| | - Humberto González-Díaz
- Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain.
- IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain.
| |
Collapse
|
2
|
Durai P, Lee SJ, Lee JW, Pan CH, Park K. Iterative machine learning-based chemical similarity search to identify novel chemical inhibitors. J Cheminform 2023; 15:86. [PMID: 37742003 PMCID: PMC10517535 DOI: 10.1186/s13321-023-00760-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 09/12/2023] [Indexed: 09/25/2023] Open
Abstract
Machine learning-based chemical screening has made substantial progress in recent years. However, these predictions often have low accuracy and high uncertainty when identifying new active chemical scaffolds. Hence, a high proportion of retrieved compounds are not structurally novel. In this study, we proposed a strategy to address this issue by iteratively optimizing an evolutionary chemical binding similarity (ECBS) model using experimental validation data. Various data update and model retraining schemes were tested to efficiently incorporate new experimental data into ECBS models, resulting in a fine-tuned ECBS model with improved accuracy and coverage. To demonstrate the effectiveness of our approach, we identified the novel hit molecules for the mitogen-activated protein kinase kinase 1 (MEK1). These molecules showed sub-micromolar affinity (Kd 0.1-5.3 μM) to MEKs and were distinct from previously-known MEK1 inhibitors. We also determined the binding specificity of different MEK isoforms and proposed potential docking models. Furthermore, using de novo drug design tools, we utilized one of the new MEK inhibitors to generate additional drug-like molecules with improved binding scores. This resulted in the identification of several potential MEK1 inhibitors with better binding affinity scores. Our results demonstrated the potential of this approach for identifying novel hit molecules and optimizing their binding affinities.
Collapse
Affiliation(s)
- Prasannavenkatesh Durai
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Sue Jung Lee
- Natural Product Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Jae Wook Lee
- Natural Product Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Cheol-Ho Pan
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea
| | - Keunwan Park
- Natural Product Informatics Research Center, Korea Institute of Science and Technology, Gangneung, 25451, Republic of Korea.
- Department of YM-KIST Bio-Health Convergence, Yonsei University, Wonju, 26493, Republic of Korea.
| |
Collapse
|
3
|
Bajusz D, Keserű GM. Maximizing the integration of virtual and experimental screening in hit discovery. Expert Opin Drug Discov 2022; 17:629-640. [PMID: 35671403 DOI: 10.1080/17460441.2022.2085685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
INTRODUCTION Experimental and virtual screening contributes to the discovery of more than 50% of clinical candidates. Considering the similar concept and goals, early-phase drug discovery would benefit from the effective integration of these approaches. AREAS COVERED After reviewing the recent trends in both experimental and virtual screening, the authors discuss different integration strategies from parallel, focused, sequential, and iterative screening. Strategic considerations are demonstrated in a number of real-life case studies. EXPERT OPINION Experimental and virtual screening are complementary approaches that should be integrated in lead discovery settings. Virtual screening can access extremely large synthetically feasible chemical space that can be effectively searched on GPU clusters or cloud architectures. Experimental screening provides reliable datasets by quantitative HTS applications, and DNA-encoded libraries (DEL) have enlarged the chemical space covered by these technologies. These developments, together with the use of artificial intelligence methods, represent new options for their efficient integration. The case studies discussed here demonstrate the benefits of complementary strategies, such as focused and iterative screening.
Collapse
Affiliation(s)
- Dávid Bajusz
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Budapest, Hungary
| | - György M Keserű
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Budapest, Hungary
| |
Collapse
|
4
|
von Korff M, Sander T. Limits of Prediction for Machine Learning in Drug Discovery. Front Pharmacol 2022; 13:832120. [PMID: 35359835 PMCID: PMC8960959 DOI: 10.3389/fphar.2022.832120] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 02/10/2022] [Indexed: 11/13/2022] Open
Abstract
In drug discovery, molecules are optimized towards desired properties. In this context, machine learning is used for extrapolation in drug discovery projects. The limits of extrapolation for regression models are known. However, a systematic analysis of the effectiveness of extrapolation in drug discovery has not yet been performed. In response, this study examined the capabilities of six machine learning algorithms to extrapolate from 243 datasets. The response values calculated from the molecules in the datasets were molecular weight, cLogP, and the number of sp3-atoms. Three experimental set ups were chosen for response values. Shuffled data were used for interpolation, whereas data for extrapolation were sorted from high to low values, and the reverse. Extrapolation with sorted data resulted in much larger prediction errors than extrapolation with shuffled data. Additionally, this study demonstrated that linear machine learning methods are preferable for extrapolation.
Collapse
|
5
|
Gong Y, Xue D, Chuai G, Yu J, Liu Q. DeepReac+: deep active learning for quantitative modeling of organic chemical reactions. Chem Sci 2021; 12:14459-14472. [PMID: 34880997 PMCID: PMC8580052 DOI: 10.1039/d1sc02087k] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 10/08/2021] [Indexed: 11/21/2022] Open
Abstract
Various computational methods have been developed for quantitative modeling of organic chemical reactions; however, the lack of universality as well as the requirement of large amounts of experimental data limit their broad applications. Here, we present DeepReac+, an efficient and universal computational framework for prediction of chemical reaction outcomes and identification of optimal reaction conditions based on deep active learning. Under this framework, DeepReac is designed as a graph-neural-network-based model, which directly takes 2D molecular structures as inputs and automatically adapts to different prediction tasks. In addition, carefully-designed active learning strategies are incorporated to substantially reduce the number of necessary experiments for model training. We demonstrate the universality and high efficiency of DeepReac+ by achieving the state-of-the-art results with a minimum of labeled data on three diverse chemical reaction datasets in several scenarios. Collectively, DeepReac+ has great potential and utility in the development of AI-aided chemical synthesis. DeepReac+ is freely accessible at https://github.com/bm2-lab/DeepReac.
Collapse
Affiliation(s)
- Yukang Gong
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Dongyu Xue
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Guohui Chuai
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Jing Yu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| | - Qi Liu
- Department of Ophthalmology, Shanghai Tenth People's Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University Shanghai 200072 China
| |
Collapse
|
6
|
Wang D, Yu J, Chen L, Li X, Jiang H, Chen K, Zheng M, Luo X. A hybrid framework for improving uncertainty quantification in deep learning-based QSAR regression modeling. J Cheminform 2021; 13:69. [PMID: 34544485 PMCID: PMC8454160 DOI: 10.1186/s13321-021-00551-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 09/05/2021] [Indexed: 11/24/2022] Open
Abstract
Reliable uncertainty quantification for statistical models is crucial in various downstream applications, especially for drug design and discovery where mistakes may incur a large amount of cost. This topic has therefore absorbed much attention and a plethora of methods have been proposed over the past years. The approaches that have been reported so far can be mainly categorized into two classes: distance-based approaches and Bayesian approaches. Although these methods have been widely used in many scenarios and shown promising performance with their distinct superiorities, being overconfident on out-of-distribution examples still poses challenges for the deployment of these techniques in real-world applications. In this study we investigated a number of consensus strategies in order to combine both distance-based and Bayesian approaches together with post-hoc calibration for improved uncertainty quantification in QSAR (Quantitative Structure-Activity Relationship) regression modeling. We employed a set of criteria to quantitatively assess the ranking and calibration ability of these models. Experiments based on 24 bioactivity datasets were designed to make critical comparison between the model we proposed and other well-studied baseline models. Our findings indicate that the hybrid framework proposed by us can robustly enhance the model ability of ranking absolute errors. Together with post-hoc calibration on the validation set, we show that well-calibrated uncertainty quantification results can be obtained in domain shift settings. The complementarity between different methods is also conceptually analyzed.
Collapse
Affiliation(s)
- Dingyan Wang
- Shanghai Key Laboratory of Forensic Medicine, Academy of Forensic Science, Shanghai, 200063, China
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Jie Yu
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Lifan Chen
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Xutong Li
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Hualiang Jiang
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Kaixian Chen
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Mingyue Zheng
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
| | - Xiaomin Luo
- Shanghai Key Laboratory of Forensic Medicine, Academy of Forensic Science, Shanghai, 200063, China.
- University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing, 100049, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
| |
Collapse
|
7
|
Rácz A, Bajusz D, Miranda-Quintana RA, Héberger K. Machine learning models for classification tasks related to drug safety. Mol Divers 2021; 25:1409-1424. [PMID: 34110577 PMCID: PMC8342376 DOI: 10.1007/s11030-021-10239-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 05/27/2021] [Indexed: 12/23/2022]
Abstract
In this review, we outline the current trends in the field of machine learning-driven classification studies related to ADME (absorption, distribution, metabolism and excretion) and toxicity endpoints from the past six years (2015-2021). The study focuses only on classification models with large datasets (i.e. more than a thousand compounds). A comprehensive literature search and meta-analysis was carried out for nine different targets: hERG-mediated cardiotoxicity, blood-brain barrier penetration, permeability glycoprotein (P-gp) substrate/inhibitor, cytochrome P450 enzyme family, acute oral toxicity, mutagenicity, carcinogenicity, respiratory toxicity and irritation/corrosion. The comparison of the best classification models was targeted to reveal the differences between machine learning algorithms and modeling types, endpoint-specific performances, dataset sizes and the different validation protocols. Based on the evaluation of the data, we can say that tree-based algorithms are (still) dominating the field, with consensus modeling being an increasing trend in drug safety predictions. Although one can already find classification models with great performances to hERG-mediated cardiotoxicity and the isoenzymes of the cytochrome P450 enzyme family, these targets are still central to ADMET-related research efforts.
Collapse
Affiliation(s)
- Anita Rácz
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, Budapest, 1117, Hungary.
| | - Dávid Bajusz
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, Budapest, 1117, Hungary
| | | | - Károly Héberger
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, Budapest, 1117, Hungary.
| |
Collapse
|
8
|
Kleandrova VV, Speck-Planche A. The QSAR Paradigm in Fragment-Based Drug Discovery: From the Virtual Generation of Target Inhibitors to Multi-Scale Modeling. Mini Rev Med Chem 2021; 20:1357-1374. [PMID: 32013845 DOI: 10.2174/1389557520666200204123156] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 10/21/2019] [Accepted: 10/28/2019] [Indexed: 12/24/2022]
Abstract
Fragment-Based Drug Design (FBDD) has established itself as a promising approach in modern drug discovery, accelerating and improving lead optimization, while playing a crucial role in diminishing the high attrition rates at all stages in the drug development process. On the other hand, FBDD has benefited from the application of computational methodologies, where the models derived from the Quantitative Structure-Activity Relationships (QSAR) have become consolidated tools. This mini-review focuses on the evolution and main applications of the QSAR paradigm in the context of FBDD in the last five years. This report places particular emphasis on the QSAR models derived from fragment-based topological approaches to extract physicochemical and/or structural information, allowing to design potentially novel mono- or multi-target inhibitors from relatively large and heterogeneous databases. Here, we also discuss the need to apply multi-scale modeling, to exemplify how different datasets based on target inhibition can be simultaneously integrated and predicted together with other relevant endpoints such as the biological activity against non-biomolecular targets, as well as in vitro and in vivo toxicity and pharmacokinetic properties. In this context, seminal papers are briefly analyzed. As huge amounts of data continue to accumulate in the domains of the chemical, biological and biomedical sciences, it has become clear that drug discovery must be viewed as a multi-scale optimization process. An ideal multi-scale approach should integrate diverse chemical and biological data and also serve as a knowledge generator, enabling the design of potentially optimal chemicals that may become therapeutic agents.
Collapse
Affiliation(s)
- Valeria V Kleandrova
- Laboratory of Fundamental and Applied Research of Quality and Technology of Food Production, Moscow State University of Food Production, Volokolamskoe Shosse 11, 125080, Moscow, Russian Federation
| | - Alejandro Speck-Planche
- Department of Chemistry, Institute of Pharmacy, I.M. Sechenov First Moscow State Medical University, Trubetskaya Str., 8, b. 2, 119992, Moscow, Russian Federation
| |
Collapse
|
9
|
Watson O, Cortes-Ciriano I, Watson JA. A semi-supervised learning framework for quantitative structure-activity regression modelling. Bioinformatics 2021; 37:342-350. [PMID: 32777821 PMCID: PMC8058768 DOI: 10.1093/bioinformatics/btaa711] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2019] [Revised: 07/14/2020] [Accepted: 08/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Quantitative structure-activity relationship (QSAR) methods are increasingly used in assisting the process of preclinical, small molecule drug discovery. Regression models are trained on data consisting of a finite-dimensional representation of molecular structures and their corresponding target-specific activities. These supervised learning models can then be used to predict the activity of previously unmeasured novel compounds. RESULTS This work provides methods that solve three problems in QSAR modelling: (i) a method for comparing the information content between finite-dimensional representations of molecular structures (fingerprints) with respect to the target of interest, (ii) a method that quantifies how the accuracy of the model prediction degrades as a function of the distance between the testing and training data and (iii) a method to adjust for screening dependent selection bias inherent in many training datasets. For example, in the most extreme cases, only compounds which pass an activity-dependent screening threshold are reported. A semi-supervised learning framework combines (ii) and (iii) and can make predictions, which take into account the similarity of the testing compounds to those in the training data and adjust for the reporting selection bias. We illustrate the three methods using publicly available structure-activity data for a large set of compounds reported by GlaxoSmithKline (the Tres Cantos AntiMalarial Set, TCAMS) to inhibit asexual in vitro Plasmodium falciparum growth. AVAILABILITYAND IMPLEMENTATION https://github.com/owatson/PenalizedPrediction. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Oliver Watson
- Evariste Technologies Ltd, Goring on Thames RG8 9AL, UK
| | - Isidro Cortes-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK
| | - James A Watson
- Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford OX1 2JD, UK.,Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok 10400, Thailand
| |
Collapse
|
10
|
Miranda-Quintana RA, Bajusz D, Rácz A, Héberger K. Extended similarity indices: the benefits of comparing more than two objects simultaneously. Part 1: Theory and characteristics †. J Cheminform 2021; 13:32. [PMID: 33892802 PMCID: PMC8067658 DOI: 10.1186/s13321-021-00505-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 03/12/2021] [Indexed: 12/14/2022] Open
Abstract
Quantification of the similarity of objects is a key concept in many areas of computational science. This includes cheminformatics, where molecular similarity is usually quantified based on binary fingerprints. While there is a wide selection of available molecular representations and similarity metrics, there were no previous efforts to extend the computational framework of similarity calculations to the simultaneous comparison of more than two objects (molecules) at the same time. The present study bridges this gap, by introducing a straightforward computational framework for comparing multiple objects at the same time and providing extended formulas for as many similarity metrics as possible. In the binary case (i.e. when comparing two molecules pairwise) these are naturally reduced to their well-known formulas. We provide a detailed analysis on the effects of various parameters on the similarity values calculated by the extended formulas. The extended similarity indices are entirely general and do not depend on the fingerprints used. Two types of variance analysis (ANOVA) help to understand the main features of the indices: (i) ANOVA of mean similarity indices; (ii) ANOVA of sum of ranking differences (SRD). Practical aspects and applications of the extended similarity indices are detailed in the accompanying paper: Miranda-Quintana et al. J Cheminform. 2021. https://doi.org/10.1186/s13321-021-00504-4 . Python code for calculating the extended similarity metrics is freely available at: https://github.com/ramirandaq/MultipleComparisons .
Collapse
Affiliation(s)
| | - Dávid Bajusz
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117, Budapest, Hungary
| | - Anita Rácz
- Plasma Chemistry Research Group, ELKH Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117, Budapest, Hungary
| | - Károly Héberger
- Plasma Chemistry Research Group, ELKH Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117, Budapest, Hungary.
| |
Collapse
|
11
|
Lane TR, Foil DH, Minerali E, Urbina F, Zorn KM, Ekins S. Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery. Mol Pharm 2020; 18:403-415. [PMID: 33325717 DOI: 10.1021/acs.molpharmaceut.0c01013] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we and others have applied multiple machine learning algorithms and modeling metrics and, in some cases, compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our proprietary software Assay Central with random forest, k-nearest neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (three layers). Model performance was assessed using an array of fivefold cross-validation metrics including area-under-the-curve, F1 score, Cohen's kappa, and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets, all methods appeared comparable, while the distance from the top indicated that Assay Central and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case. If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods for evaluating and comparing such models.
Collapse
Affiliation(s)
- Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Daniel H Foil
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Eni Minerali
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Fabio Urbina
- Department of Cell Biology and Physiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7545, United States
| | - Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|
12
|
Jiang X, Li S, Zhang H, Wang LL. Discovery of potentially biased agonists of mu-opioid receptor (MOR) through molecular docking, pharmacophore modeling, and MD simulation. Comput Biol Chem 2020; 90:107405. [PMID: 33184004 DOI: 10.1016/j.compbiolchem.2020.107405] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 10/08/2020] [Accepted: 10/12/2020] [Indexed: 02/06/2023]
Abstract
Opioids are well known for their potent analgesic efficacy and severe side effects. Studies have shown that analgesic effects are mediated by the downstream G-protein-dependent pathway of the μ-opioid receptor (MOR), and another β-arrestin-dependent pathway mediates side effects such as respiratory depression, constipation and tolerance etc. TRV130 is a biased ligand for G-protein-dependent pathway, which has high analgesia and has fewer side effects than morphine. In this study, the structure similarity search was performed on the IBSSC database using Oliceridine (TRV130) and PZM21 as templates. The 3D structure-based pharmacophore model was built and combined molecular docking prediction mode was selected to filter out small molecules, Finally, based on affinity prediction, four candidate molecules were obtained. Molecular dynamics simulations explored the detailed interaction mechanism of proteins with small molecules under dynamics. These results suggest that these candidate molecules are potential MOR agonists.
Collapse
Affiliation(s)
- Xuan Jiang
- Key Laboratory of Medicinal Chemistry for Natural Resource (Yunnan University), Ministry of Education, School of Chemical Science and Technology, Yunnan University, Kunming, 650091, People's Republic of China; State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, People's Republic of China
| | - Shuxiang Li
- Key Laboratory of Medicinal Chemistry for Natural Resource (Yunnan University), Ministry of Education, School of Chemical Science and Technology, Yunnan University, Kunming, 650091, People's Republic of China; State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, People's Republic of China
| | - Hongbin Zhang
- Key Laboratory of Medicinal Chemistry for Natural Resource (Yunnan University), Ministry of Education, School of Chemical Science and Technology, Yunnan University, Kunming, 650091, People's Republic of China.
| | - Liang-Liang Wang
- State Key Laboratory of Phytochemistry and Plant Resources in West China, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, 650201, People's Republic of China.
| |
Collapse
|
13
|
Dreiman GHS, Bictash M, Fish PV, Griffin L, Svensson F. Changing the HTS Paradigm: AI-Driven Iterative Screening for Hit Finding. SLAS DISCOVERY 2020; 26:257-262. [PMID: 32808550 PMCID: PMC7838329 DOI: 10.1177/2472555220949495] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Iterative screening is a process in which screening is done in batches, with each batch filled by using machine learning to select the most promising compounds from the library based on the previous results. We believe iterative screening is poised to enhance the screening process by improving hit finding while at the same time reducing the number of compounds screened. In addition, we see this process as a key enabler of next-generation high-throughput screening (HTS), which uses more complex assays that better describe the biology but demand more resource per screened compound. To demonstrate the utility of these methods, we retrospectively analyze HTS data from PubChem with a focus on machine learning–based screening strategies that can be readily implemented in practice. Our results show that over a variety of HTS experimental paradigms, an iterative screening setup that screens a total of 35% of the screening collection over as few as three iterations has a median return rate of approximately 70% of the active compounds. Increasing the portion of the library screened to 50% yields median returns of approximately 80% of actives. Using six iterations increases these return rates to 78% and 90%, respectively. The best results were achieved with machine learning models that can be run on a standard desktop. By demonstrating that the utility of iterative screening holds true even with a small number of iterations, and without requiring significant computational resources, we provide a roadmap for the practical implementation of these techniques in hit finding.
Collapse
Affiliation(s)
- Gabriel H S Dreiman
- The Alzheimer's Research UK University College London Drug Discovery Institute, London, UK.,Department of Computer Science, University College London, London, UK
| | - Magda Bictash
- The Alzheimer's Research UK University College London Drug Discovery Institute, London, UK
| | - Paul V Fish
- The Alzheimer's Research UK University College London Drug Discovery Institute, London, UK
| | - Lewis Griffin
- Department of Computer Science, University College London, London, UK
| | - Fredrik Svensson
- The Alzheimer's Research UK University College London Drug Discovery Institute, London, UK
| |
Collapse
|
14
|
Watson OP, Cortes-Ciriano I, Taylor AR, Watson JA. A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery. Bioinformatics 2020; 35:4656-4663. [PMID: 31070704 PMCID: PMC6853675 DOI: 10.1093/bioinformatics/btz293] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 03/22/2019] [Accepted: 04/17/2019] [Indexed: 02/07/2023] Open
Abstract
Motivation Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. Results The quantile-activity bootstrap is proposed as a new model validation framework using quantile splits on the activity distribution function to construct training and testing sets. In addition, we propose two novel rank-based loss functions which penalize only the out-of-sample predicted ranks of high-activity molecules. The combination of these methods was used to assess the performance of neural nets, random forests, support vector machines (regression) and ridge regression applied to 25 diverse high-quality structure-activity datasets publicly available on ChEMBL. Model validation based on random partitioning of available data favours models that overfit and ‘memorize’ the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes extrapolation of models onto structurally different molecules outside of the training data. Simpler, traditional statistical methods such as ridge regression can outperform state-of-the-art machine learning methods in this setting. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Availability and implementation All software and data are available as Jupyter notebooks found at https://github.com/owatson/QuantileBootstrap. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Isidro Cortes-Ciriano
- Goring on Thames, Evariste Technologies Ltd., RG8 9AL UK.,Department of Chemistry, Centre for Molecular Science Informatics, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Aimee R Taylor
- Department of Epidemiology, Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, MA 02115 USA.,Infectious Disease Microbiome Program, Broad Institute, Cambridge, MA 02142 USA
| | - James A Watson
- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford OX3, 7LF UK.,Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok 10400, Thailand
| |
Collapse
|
15
|
Drakakis G, Cortés-Ciriano I, Alexander-Dann B, Bender A. Elucidating Compound Mechanism of Action and Predicting Cytotoxicity Using Machine Learning Approaches, Taking Prediction Confidence into Account. ACTA ACUST UNITED AC 2020; 11:e73. [PMID: 31483099 DOI: 10.1002/cpch.73] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The modes of action (MoAs) of drugs frequently are unknown, because many are small molecules initially identified from phenotypic screens, giving rise to the need to elucidate their MoAs. In addition, the high attrition rate for candidate drugs in preclinical studies due to intolerable toxicity has motivated the development of computational approaches to predict drug candidate (cyto)toxicity as early as possible in the drug-discovery process. Here, we provide detailed instructions for capitalizing on bioactivity predictions to elucidate the MoAs of small molecules and infer their underlying phenotypic effects. We illustrate how these predictions can be used to infer the underlying antidepressive effects of marketed drugs. We also provide the necessary functionalities to model cytotoxicity data using single and ensemble machine-learning algorithms. Finally, we give detailed instructions on how to calculate confidence intervals for individual predictions using the conformal prediction framework. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Georgios Drakakis
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Ben Alexander-Dann
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
16
|
Bosc N, Atkinson F, Félix E, Gaulton A, Hersey A, Leach AR. Reply to "Missed opportunities in large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery". J Cheminform 2019; 11:64. [PMID: 33430932 PMCID: PMC6831531 DOI: 10.1186/s13321-019-0388-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Accepted: 10/22/2019] [Indexed: 11/10/2022] Open
Abstract
In response to Krstajic's letter to the editor concerning our published paper, we here take the opportunity to reply, to re-iterate that no errors in our work were identified, to provide further details, and to re-emphasise the outputs of our study. Moreover, we highlight that all of the data are freely available for the wider scientific community (including the aforementioned correspondent) to undertake follow-on studies and comparisons.
Collapse
Affiliation(s)
- Nicolas Bosc
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Francis Atkinson
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Eloy Félix
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Anna Gaulton
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Anne Hersey
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Andrew R Leach
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
17
|
Predicting kinase inhibitors using bioactivity matrix derived informer sets. PLoS Comput Biol 2019; 15:e1006813. [PMID: 31381559 PMCID: PMC6695194 DOI: 10.1371/journal.pcbi.1006813] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 08/15/2019] [Accepted: 07/13/2019] [Indexed: 12/21/2022] Open
Abstract
Prediction of compounds that are active against a desired biological target is a common step in drug discovery efforts. Virtual screening methods seek some active-enriched fraction of a library for experimental testing. Where data are too scarce to train supervised learning models for compound prioritization, initial screening must provide the necessary data. Commonly, such an initial library is selected on the basis of chemical diversity by some pseudo-random process (for example, the first few plates of a larger library) or by selecting an entire smaller library. These approaches may not produce a sufficient number or diversity of actives. An alternative approach is to select an informer set of screening compounds on the basis of chemogenomic information from previous testing of compounds against a large number of targets. We compare different ways of using chemogenomic data to choose a small informer set of compounds based on previously measured bioactivity data. We develop this Informer-Based-Ranking (IBR) approach using the Published Kinase Inhibitor Sets (PKIS) as the chemogenomic data to select the informer sets. We test the informer compounds on a target that is not part of the chemogenomic data, then predict the activity of the remaining compounds based on the experimental informer data and the chemogenomic data. Through new chemical screening experiments, we demonstrate the utility of IBR strategies in a prospective test on three kinase targets not included in the PKIS. In the early stages of drug discovery efforts, computational models are used to predict activity and prioritize compounds for experimental testing. New targets commonly lack the data necessary to build effective models, and the screening needed to generate that experimental data can be costly. We seek to improve the efficiency of the initial screening phase, and of the process of prioritizing compounds for subsequent screening. We choose a small informer set of compounds based on publicly available prior screening data on distinct targets. We then collect experimental data on these informer compounds and use that data to predict the activity of other compounds in the set for the target of interest. Computational and statistical tools are needed to identify informer compounds and to prioritize other compounds for subsequent phases of screening. We find that selection of informer compounds on the basis of bioactivity data from previous screening efforts is superior to the traditional approach of selection of a chemically diverse subset of compounds. We demonstrate the success of this approach in retrospective tests on the Published Kinase Inhibitor Sets (PKIS) chemogenomic data and in prospective experimental screens against three additional non-human kinase targets.
Collapse
|
18
|
Oranje P, Gouka R, Burggraaff L, Vermeer M, Chalet C, Duchateau G, van der Pijl P, Geldof M, de Roo N, Clauwaert F, Vanpaeschen T, Nicolaï J, de Bruyn T, Annaert P, IJzerman AP, van Westen GJP. Novel natural and synthetic inhibitors of solute carriers SGLT1 and SGLT2. Pharmacol Res Perspect 2019; 7:e00504. [PMID: 31384471 PMCID: PMC6664820 DOI: 10.1002/prp2.504] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 06/05/2019] [Accepted: 06/06/2019] [Indexed: 12/12/2022] Open
Abstract
Selective analogs of the natural glycoside phloridzin are marketed drugs that reduce hyperglycemia in diabetes by inhibiting the active sodium glucose cotransporter SGLT2 in the kidneys. In addition, intestinal SGLT1 is now recognized as a target for glycemic control. To expand available type 2 diabetes remedies, we aimed to find novel SGLT1 inhibitors beyond the chemical space of glycosides. We screened a bioactive compound library for SGLT1 inhibitors and tested primary hits and additional structurally similar molecules on SGLT1 and SGLT2 (SGLT1/2). Novel SGLT1/2 inhibitors were discovered in separate chemical clusters of natural and synthetic compounds. These have IC50-values in the 10-100 μmol/L range. The most potent identified novel inhibitors from different chemical clusters are (SGLT1-IC50 Mean ± SD, SGLT2-IC50 Mean ± SD): (+)-pteryxin (12 ± 2 μmol/L, 9 ± 4 μmol/L), (+)-ε-viniferin (58 ± 18 μmol/L, 110 μmol/L), quinidine (62 μmol/L, 56 μmol/L), cloperastine (9 ± 3 μmol/L, 9 ± 7 μmol/L), bepridil (10 ± 5 μmol/L, 14 ± 12 μmol/L), trihexyphenidyl (12 ± 1 μmol/L, 20 ± 13 μmol/L) and bupivacaine (23 ± 14 μmol/L, 43 ± 29 μmol/L). The discovered natural inhibitors may be further investigated as new potential (prophylactic) agents for controlling dietary glucose uptake. The new diverse structure activity data can provide a starting point for the optimization of novel SGLT1/2 inhibitors and support the development of virtual SGLT1/2 inhibitor screening models.
Collapse
Affiliation(s)
- Paul Oranje
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Robin Gouka
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Lindsey Burggraaff
- Division of Drug Discovery & Safety, Leiden Academic Centre for Drug ResearchLeiden UniversityLeidenThe Netherlands
| | - Mario Vermeer
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Clément Chalet
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Guus Duchateau
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | | | - Marian Geldof
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Niels de Roo
- Unilever Research & DevelopmentVlaardingenThe Netherlands
| | - Fenja Clauwaert
- Drug Delivery and Disposition, Department of Pharmaceutical and Pharmacological SciencesKU LeuvenLeuvenBelgium
| | - Toon Vanpaeschen
- Drug Delivery and Disposition, Department of Pharmaceutical and Pharmacological SciencesKU LeuvenLeuvenBelgium
| | - Johan Nicolaï
- Drug Delivery and Disposition, Department of Pharmaceutical and Pharmacological SciencesKU LeuvenLeuvenBelgium
| | - Tom de Bruyn
- Drug Delivery and Disposition, Department of Pharmaceutical and Pharmacological SciencesKU LeuvenLeuvenBelgium
| | - Pieter Annaert
- Drug Delivery and Disposition, Department of Pharmaceutical and Pharmacological SciencesKU LeuvenLeuvenBelgium
| | - Adriaan P. IJzerman
- Division of Drug Discovery & Safety, Leiden Academic Centre for Drug ResearchLeiden UniversityLeidenThe Netherlands
| | - Gerard J. P. van Westen
- Division of Drug Discovery & Safety, Leiden Academic Centre for Drug ResearchLeiden UniversityLeidenThe Netherlands
| |
Collapse
|
19
|
Cortés-Ciriano I, Bender A. Reliable Prediction Errors for Deep Neural Networks Using Test-Time Dropout. J Chem Inf Model 2019; 59:3330-3339. [PMID: 31241929 DOI: 10.1021/acs.jcim.9b00297] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
While the use of deep learning in drug discovery is gaining increasing attention, the lack of methods to compute reliable errors in prediction for Neural Networks prevents their application to guide decision making in domains where identifying unreliable predictions is essential, e.g., precision medicine. Here, we present a framework to compute reliable errors in prediction for Neural Networks using Test-Time Dropout and Conformal Prediction. Specifically, the algorithm consists of training a single Neural Network using dropout, and then applying it N times to both the validation and test sets, also employing dropout in this step. Therefore, for each instance in the validation and test sets an ensemble of predictions are generated. The residuals and absolute errors in prediction for the validation set are then used to compute prediction errors for the test set instances using Conformal Prediction. We show using 24 bioactivity data sets from ChEMBL 23 that Dropout Conformal Predictors are valid (i.e., the fraction of instances whose true value lies within the predicted interval strongly correlates with the confidence level) and efficient, as the predicted confidence intervals span a narrower set of values than those computed with Conformal Predictors generated using Random Forest (RF) models. Lastly, we show in retrospective virtual screening experiments that dropout and RF-based Conformal Predictors lead to comparable retrieval rates of active compounds. Overall, we propose a computationally efficient framework (as only N extra forward passes are required in addition to training a single network) to harness Test-Time Dropout and the Conformal Prediction framework, which is generally applicable to generate reliable prediction errors for Deep Neural Networks in drug discovery and beyond.
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| |
Collapse
|
20
|
Miyao T, Funatsu K. Iterative Screening Methods for Identification of Chemical Compounds with Specific Values of Various Properties. J Chem Inf Model 2019; 59:2626-2641. [PMID: 31058504 DOI: 10.1021/acs.jcim.9b00093] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Identification of chemical compounds having desirable properties is a central goal of screening campaigns. Iterative screening is a means of surveying a set of compounds, during which their property values are determined and used as feedback for regression models. Quantitative models that assess the relationships between chemical structures and property/activity are repeatedly updated through this type of cycle, and the efficient sampling of compounds for the subsequent test is a key factor in the early identification of target compounds. Nevertheless, methodological approaches to comparisons and to establishing the degree of extrapolation of sampled compounds, including the effects of applicability domains, are still required. In the present study, we conducted a series of virtual experiments to assess the characteristics of different iterative screening methods. Genetic algorithm-based partial least-squares regression, support vector regression, Bayesian optimization with Gaussian Process (GP), and batch-based Bayesian optimization with GP (GP_batch) were all compared, based on the analysis of one million compounds extracted from the ZINC database. Our results show that, irrespective of the diversity of the initial set of compounds, it was possible to identify a compound having the desired property value using the appropriate screening method. However, overall, the GP_batch method was found to be preferable when evaluating properties either which are difficult to predict or for which a key factor is present in the set of molecular descriptors.
Collapse
Affiliation(s)
- Tomoyuki Miyao
- Data Science Center and Graduate School of Science and Technology , Nara Institute of Science and Technology , 8916-5 Takayama-cho , Ikoma , Nara 630-0192 , Japan
| | - Kimito Funatsu
- Data Science Center and Graduate School of Science and Technology , Nara Institute of Science and Technology , 8916-5 Takayama-cho , Ikoma , Nara 630-0192 , Japan.,Department of Chemical System Engineering, School of Engineering , The University of Tokyo , 7-3-1 Hongo , Bunkyo-ku , Tokyo 113-8656 , Japan
| |
Collapse
|
21
|
Cortés-Ciriano I, Bender A. KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 2019; 11:41. [PMID: 31218493 PMCID: PMC6582521 DOI: 10.1186/s13321-019-0364-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/09/2019] [Indexed: 02/08/2023] Open
Abstract
The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| |
Collapse
|
22
|
Jansen JM, De Pascale G, Fong S, Lindvall M, Moser HE, Pfister K, Warne B, Wartchow C. Biased Complement Diversity Selection for Effective Exploration of Chemical Space in Hit-Finding Campaigns. J Chem Inf Model 2019; 59:1709-1714. [DOI: 10.1021/acs.jcim.9b00048] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Johanna M. Jansen
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Gianfranco De Pascale
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Susan Fong
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Mika Lindvall
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Heinz E. Moser
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Keith Pfister
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Bob Warne
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Charles Wartchow
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| |
Collapse
|
23
|
Raschka S. Automated discovery of GPCR bioactive ligands. Curr Opin Struct Biol 2019; 55:17-24. [PMID: 30909105 DOI: 10.1016/j.sbi.2019.02.011] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2018] [Accepted: 02/19/2019] [Indexed: 12/22/2022]
Abstract
While G-protein-coupled receptors (GPCRs) constitute the largest class of membrane proteins, structures and endogenous ligands of a large portion of GPCRs remain unknown. Because of the involvement of GPCRs in various signaling pathways and physiological roles, the identification of endogenous ligands as well as designing novel drugs is of high interest to the research and medical communities. Along with highlighting the recent advances in structure-based ligand discovery, including docking and molecular dynamics, this article focuses on the latest advances for automating the discovery of bioactive ligands using machine learning. Machine learning is centered around the development and applications of algorithms that can learn from data automatically. Such an approach offers immense opportunities for bioactivity prediction as well as quantitative structure-activity relationship studies. This review describes the most recent and successful applications of machine learning for bioactive ligand discovery, concluding with an outlook on deep learning methods that are capable of automatically extracting salient information from structural data as a promising future direction for rapid and efficient bioactive ligand discovery.
Collapse
Affiliation(s)
- Sebastian Raschka
- Department of Statistics, University of Wisconsin-Madison, 1300 Medical Sciences Center, Madison, WI 53706, USA.
| |
Collapse
|
24
|
Cortés-Ciriano I, Bender A. Deep Confidence: A Computationally Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks. J Chem Inf Model 2018; 59:1269-1281. [DOI: 10.1021/acs.jcim.8b00542] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|