1
|
Shen J, Nicolaou CA. Molecular property prediction: recent trends in the era of artificial intelligence. DRUG DISCOVERY TODAY. TECHNOLOGIES 2020; 32-33:29-36. [PMID: 33386091 DOI: 10.1016/j.ddtec.2020.05.001] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/10/2020] [Accepted: 04/06/2020] [Indexed: 12/18/2022]
Abstract
Artificial intelligence (AI) has become a powerful tool in many fields, including drug discovery. Among various AI applications, molecular property prediction can have more significant immediate impact to the drug discovery process since most algorithms and methods use predicted properties to evaluate, select, and generate molecules. Herein, we provide a brief review of the state-of-art molecular property prediction methodologies and discuss examples reported recently. We highlight key techniques that have been applied to molecular property prediction such as learned representation, multi-task learning, transfer learning, and federated learning. We also point out some critical but less discussed issues such as data set quality, benchmark, model performance evaluation, and prediction confidence quantification.
Collapse
Affiliation(s)
- Jie Shen
- Advanced Analytics and Data Sciences, Eli Lilly and Company, Indianapolis, IN 46285, United States.
| | - Christos A Nicolaou
- Discovery Chemistry Research & Technologies, Eli Lilly and Company, Indianapolis, IN 46285, United States.
| |
Collapse
|
2
|
Irwin BWJ, Levell JR, Whitehead TM, Segall MD, Conduit GJ. Practical Applications of Deep Learning To Impute Heterogeneous Drug Discovery Data. J Chem Inf Model 2020; 60:2848-2857. [PMID: 32478517 DOI: 10.1021/acs.jcim.0c00443] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Contemporary deep learning approaches still struggle to bring a useful improvement in the field of drug discovery because of the challenges of sparse, noisy, and heterogeneous data that are typically encountered in this context. We use a state-of-the-art deep learning method, Alchemite, to impute data from drug discovery projects, including multitarget biochemical activities, phenotypic activities in cell-based assays, and a variety of absorption, distribution, metabolism, and excretion (ADME) endpoints. The resulting model gives excellent predictions for activity and ADME endpoints, offering an average increase in R2 of 0.22 versus quantitative structure-activity relationship methods. The model accuracy is robust to combining data across uncorrelated endpoints and projects with different chemical spaces, enabling a single model to be trained for all compounds and endpoints. We demonstrate improvements in accuracy on the latest chemistry and data when updating models with new data as an ongoing medicinal chemistry project progresses.
Collapse
Affiliation(s)
- Benedict W J Irwin
- Optibrium Limited, Cambridge Innovation Park, Denny End Rd, Cambridge CB25 9PB, U.K.,Cavendish Laboratory, University of Cambridge, 19 JJ Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Julian R Levell
- Constellation Pharmaceuticals Inc., 215 First St Suite 200, Cambridge, Massachusetts 02142, United States
| | - Thomas M Whitehead
- Intellegens Limited, Eagle Labs, 28 Chesterton Road, Cambridge CB4 3AZ, U.K
| | - Matthew D Segall
- Optibrium Limited, Cambridge Innovation Park, Denny End Rd, Cambridge CB25 9PB, U.K
| | - Gareth J Conduit
- Intellegens Limited, Eagle Labs, 28 Chesterton Road, Cambridge CB4 3AZ, U.K.,Cavendish Laboratory, University of Cambridge, 19 JJ Thomson Avenue, Cambridge CB3 0HE, U.K
| |
Collapse
|
3
|
Nicolaou CA, Humblet C, Hu H, Martin EM, Dorsey FC, Castle TM, Burton KI, Hu H, Hendle J, Hickey MJ, Duerksen J, Wang J, Erickson JA. Idea2Data: Toward a New Paradigm for Drug Discovery. ACS Med Chem Lett 2019; 10:278-286. [PMID: 30891127 DOI: 10.1021/acsmedchemlett.8b00488] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 02/04/2019] [Indexed: 12/14/2022] Open
Abstract
Increasing the success rate and throughput of drug discovery will require efficiency improvements throughout the process that is currently used in the pharmaceutical community, including the crucial step of identifying hit compounds to act as drivers for subsequent optimization. Hit identification can be carried out through large compound collection screening and often involves the generation and testing of many hypotheses based on available knowledge. In practice, hypothesis generation can involve the selection of promising chemical structures from compound collections using predictive models built from previous screening/assay results. Available physical collections, typically used during hit identification, are of the order of 106 compounds but represent only a small fraction of the small molecule drug-like chemical space. In an effort to survey a larger portion of chemical space and eliminate inefficiencies during hit identification, we introduce a new process, termed Idea2Data (I2D) that tightly integrates computational and experimental components of the drug discovery process. I2D provides the ability to connect a vast virtual collection of compounds readily synthesizable on automated synthesis systems with computational predictive models for the identification of promising structures. This new paradigm enables researchers to process billions of virtual molecules and select structures that can be prepared on automated systems and made available for biological testing, allowing for timely hypothesis testing and follow-up. Since its introduction, I2D has positively impacted several portfolio efforts through identification of new chemical scaffolds and functionalization of existing scaffolds. In this Innovations paper, we describe the I2D process and present an application for the discovery of new ULK inhibitors.
Collapse
Affiliation(s)
- Christos A. Nicolaou
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Christine Humblet
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Hong Hu
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Eva M. Martin
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Frank C. Dorsey
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Thomas M. Castle
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Keith Ian Burton
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Haitao Hu
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Jorg Hendle
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Michael J. Hickey
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Joel Duerksen
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Jibo Wang
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Jon A. Erickson
- Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| |
Collapse
|
4
|
Whitehead TM, Irwin BWJ, Hunt P, Segall MD, Conduit GJ. Imputation of Assay Bioactivity Data Using Deep Learning. J Chem Inf Model 2019; 59:1197-1204. [PMID: 30753070 DOI: 10.1021/acs.jcim.8b00768] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We describe a novel deep learning neural network method and its application to impute assay pIC50 values. Unlike conventional machine learning approaches, this method is trained on sparse bioactivity data as input, typical of that found in public and commercial databases, enabling it to learn directly from correlations between activities measured in different assays. In two case studies on public domain data sets we show that the neural network method outperforms traditional quantitative structure-activity relationship (QSAR) models and other leading approaches. Furthermore, by focusing on only the most confident predictions the accuracy is increased to R2 > 0.9 using our method, as compared to R2 = 0.44 when reporting all predictions.
Collapse
Affiliation(s)
- T M Whitehead
- Intellegens , Eagle Labs , Chesterton Road , Cambridge CB4 3AZ , United Kingdom
| | - B W J Irwin
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - P Hunt
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - M D Segall
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - G J Conduit
- Intellegens , Eagle Labs , Chesterton Road , Cambridge CB4 3AZ , United Kingdom.,Cavendish Laboratory , University of Cambridge , J.J. Thomson Avenue , Cambridge CB3 0HE , United Kingdom
| |
Collapse
|
5
|
Zhou Y, Cahya S, Combs SA, Nicolaou CA, Wang J, Desai PV, Shen J. Exploring Tunable Hyperparameters for Deep Neural Networks with Industrial ADME Data Sets. J Chem Inf Model 2019; 59:1005-1016. [PMID: 30586300 DOI: 10.1021/acs.jcim.8b00671] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Deep learning has drawn significant attention in different areas including drug discovery. It has been proposed that it could outperform other machine learning algorithms, especially with big data sets. In the field of pharmaceutical industry, machine learning models are built to understand quantitative structure-activity relationships (QSARs) and predict molecular activities, including absorption, distribution, metabolism, and excretion (ADME) properties, using only molecular structures. Previous reports have demonstrated the advantages of using deep neural networks (DNNs) for QSAR modeling. One of the challenges while building DNN models is identifying the hyperparameters that lead to better generalization of the models. In this study, we investigated several tunable hyperparameters of deep neural network models on 24 industrial ADME data sets. We analyzed the sensitivity and influence of five different hyperparameters including the learning rate, weight decay for L2 regularization, dropout rate, activation function, and the use of batch normalization. This paper focuses on strategies and practices for DNN model building. Further, the optimized model for each data set was built and compared with the benchmark models used in production. Based on our benchmarking results, we propose several practices for building DNN QSAR models.
Collapse
Affiliation(s)
- Yadi Zhou
- Department of Chemistry and Biochemistry , Ohio University , Athens , Ohio 45701 , United States
| | | | | | | | | | | | | |
Collapse
|
6
|
Martin EJ, Polyakov VR, Tian L, Perez RC. Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC 50s for Realistically Novel Compounds. J Chem Inf Model 2017. [PMID: 28651433 DOI: 10.1021/acs.jcim.7b00166] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based "realistic" training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR models in real projects. The original Profile-QSAR (pQSAR) method greatly broadened the domain of applicability over conventional models by using as independent variables a profile of activity predictions from all historical assays in a large protein family. However, the accuracy still fell short of experiment on realistic test sets. The improved "pQSAR 2.0" method replaces probabilities of activity from naïve Bayes categorical models at several thresholds with predicted IC50s from RFR models. Unexpectedly, the high accuracy also requires removing the RFR model for the actual assay of interest from the independent variable profile. With these improvements, pQSAR 2.0 activity predictions are now statistically comparable to medium-throughput four-concentration IC50 measurements even on the realistic test set. Beyond the yes/no activity predictions from a typical high-throughput screen (HTS) or conventional virtual screen, these semiquantitative IC50 predictions allow for predicted potency, ligand efficiency, lipophilic efficiency, and selectivity against antitargets, greatly facilitating hitlist triaging and enabling virtual screening panels such as toxicity panels and overall promiscuity predictions.
Collapse
Affiliation(s)
- Eric J Martin
- Novartis Institutes for Biomedical Research , 5300 Chiron Way, Emeryville, California 94608-2916, United States
| | - Valery R Polyakov
- Novartis Institutes for Biomedical Research , 5300 Chiron Way, Emeryville, California 94608-2916, United States
| | - Li Tian
- Novartis Institutes for Biomedical Research , 5300 Chiron Way, Emeryville, California 94608-2916, United States
| | - Rolando C Perez
- Novartis Institutes for Biomedical Research , 5300 Chiron Way, Emeryville, California 94608-2916, United States
| |
Collapse
|
7
|
Narayanan D, Gani OABSM, Gruber FXE, Engh RA. Data driven polypharmacological drug design for lung cancer: analyses for targeting ALK, MET, and EGFR. J Cheminform 2017; 9:43. [PMID: 29086093 PMCID: PMC5496928 DOI: 10.1186/s13321-017-0229-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2016] [Accepted: 06/18/2017] [Indexed: 12/14/2022] Open
Abstract
Drug design of protein kinase inhibitors is now greatly enabled by thousands of publicly available X-ray structures, extensive ligand binding data, and optimized scaffolds coming off patent. The extensive data begin to enable design against a spectrum of targets (polypharmacology); however, the data also reveal heterogeneities of structure, subtleties of chemical interactions, and apparent inconsistencies between diverse data types. As a result, incorporation of all relevant data requires expert choices to combine computational and informatics methods, along with human insight. Here we consider polypharmacological targeting of protein kinases ALK, MET, and EGFR (and its drug resistant mutant T790M) in non small cell lung cancer as an example. Both EGFR and ALK represent sources of primary oncogenic lesions, while drug resistance arises from MET amplification and EGFR mutation. A drug which inhibits these targets will expand relevant patient populations and forestall drug resistance. Crizotinib co-targets ALK and MET. Analysis of the crystal structures reveals few shared interaction types, highlighting proton-arene and key CH–O hydrogen bonding interactions. These are not typically encoded into molecular mechanics force fields. Cheminformatics analyses of binding data show EGFR to be dissimilar to ALK and MET, but its structure shows how it may be co-targeted with the addition of a covalent trap. This suggests a strategy for the design of a focussed chemical library based on a pan-kinome scaffold. Tests of model compounds show these to be compatible with the goal of ALK, MET, and EGFR polypharmacology.
Collapse
Affiliation(s)
- Dilip Narayanan
- The Norwegian Structural Biology Center, Department of Chemistry, Faculty of Science, UiT The Arctic University of Norway, Tromsø, Norway
| | - Osman A B S M Gani
- The Norwegian Structural Biology Center, Department of Chemistry, Faculty of Science, UiT The Arctic University of Norway, Tromsø, Norway
| | - Franz X E Gruber
- The Norwegian Structural Biology Center, Department of Chemistry, Faculty of Science, UiT The Arctic University of Norway, Tromsø, Norway
| | - Richard A Engh
- The Norwegian Structural Biology Center, Department of Chemistry, Faculty of Science, UiT The Arctic University of Norway, Tromsø, Norway.
| |
Collapse
|
8
|
Tarasova OA, Urusova AF, Filimonov DA, Nicklaus MC, Zakharov AV, Poroikov VV. QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors. J Chem Inf Model 2015; 55:1388-99. [PMID: 26046311 DOI: 10.1021/acs.jcim.5b00019] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
Collapse
Affiliation(s)
- Olga A Tarasova
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Aleksandra F Urusova
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Dmitry A Filimonov
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Marc C Nicklaus
- ‡CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States
| | - Alexey V Zakharov
- ‡CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States
| | - Vladimir V Poroikov
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| |
Collapse
|
9
|
Discovery of selective RIO2 kinase small molecule ligand. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1854:1630-6. [PMID: 25891899 DOI: 10.1016/j.bbapap.2015.04.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2015] [Revised: 04/03/2015] [Accepted: 04/08/2015] [Indexed: 11/23/2022]
Abstract
We report the discovery and initial optimization of diphenpyramide and several of its analogs as hRIO2 kinase ligands. One of these analogs is the most selective hRIO2 ligand reported to date. Diphenpyramide is a Cyclooxygenase 1 and 2 inhibitor that was used as an anti-inflammatory agent. The RIO2 kinase affinity of diphenpyramide was discovered by serendipity while profiling of 13 marketed drugs on a large 456 kinase assay panel. The inhibition values also suggested a relative selectivity of diphenpyramide for RIO2 against the other kinases in the panel. Subsequently three available and eight newly synthesized analogs were assayed, one of which showed a 10 fold increased hRIO2 binding affinity. Additionally, this compound shows significantly better selectivity over assayed kinases, when compared to currently known RIO2 inhibitors. As RIO2 is involved in the biosynthesis of the ribosome and cell cycle regulation, our selective ligand may be useful for the delineation of the biological role of this kinase. This article is part of a Special Issue entitled: Inhibitors of Protein Kinases.
Collapse
|
10
|
Perspective on computational and structural aspects of kinase discovery from IPK2014. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1854:1595-604. [PMID: 25861861 DOI: 10.1016/j.bbapap.2015.03.014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Revised: 03/29/2015] [Accepted: 03/30/2015] [Indexed: 01/16/2023]
Abstract
Recent advances in understanding the activity and selectivity of kinase inhibitors and their relationships to protein structure are presented. Conformational selection in kinases is studied from empirical, data-driven and simulation approaches. Ligand binding and its affinity are, in many cases, determined by the predetermined active and inactive conformation of kinases. Binding affinity and selectivity predictions highlight the current state of the art and advances in computational chemistry as it applies to kinase inhibitor discovery. Kinome wide inhibitor profiling and cell panel profiling lead to a better understanding of selectivity and allow for target validation and patient tailoring hypotheses. This article is part of a Special Issue entitled: Inhibitors of Protein Kinases.
Collapse
|
11
|
Minie M, Chopra G, Sethi G, Horst J, White G, Roy A, Hatti K, Samudrala R. CANDO and the infinite drug discovery frontier. Drug Discov Today 2014; 19:1353-63. [PMID: 24980786 DOI: 10.1016/j.drudis.2014.06.018] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Revised: 06/18/2014] [Accepted: 06/19/2014] [Indexed: 12/21/2022]
Abstract
The Computational Analysis of Novel Drug Opportunities (CANDO) platform (http://protinfo.org/cando) uses similarity of compound-proteome interaction signatures to infer homology of compound/drug behavior. We constructed interaction signatures for 3733 human ingestible compounds covering 48,278 protein structures mapping to 2030 indications based on basic science methodologies to predict and analyze protein structure, function, and interactions developed by us and others. Our signature comparison and ranking approach yielded benchmarking accuracies of 12-25% for 1439 indications with at least two approved compounds. We prospectively validated 49/82 'high value' predictions from nine studies covering seven indications, with comparable or better activity to existing drugs, which serve as novel repurposed therapeutics. Our approach may be generalized to compounds beyond those approved by the FDA, and can also consider mutations in protein structures to enable personalization. Our platform provides a holistic multiscale modeling framework of complex atomic, molecular, and physiological systems with broader applications in medicine and engineering.
Collapse
Affiliation(s)
- Mark Minie
- University of Washington, Department of Bioengineering, Seattle, WA 98109, United States
| | - Gaurav Chopra
- University of Washington, Department of Microbiology, Seattle, WA 98109, United States; University of California, San Francisco, Diabetes Center, San Francisco, CA 94143, United States
| | - Geetika Sethi
- University of Washington, Department of Microbiology, Seattle, WA 98109, United States
| | - Jeremy Horst
- University of California, School of Medicine, San Francisco, CA 94143, United States
| | - George White
- University of Washington, Department of Microbiology, Seattle, WA 98109, United States
| | - Ambrish Roy
- Georgia Institute of Technology, Center for the Study of Systems Biology, Atlanta, GA 30318, United States
| | - Kaushik Hatti
- Molecular Biophysics Unit, Indian Institute of Science Bangalore, 560012, India
| | - Ram Samudrala
- University of Washington, Department of Microbiology, Seattle, WA 98109, United States.
| |
Collapse
|