1
|
Ambreen S, Umar M, Noor A, Jain H, Ali R. Advanced AI and ML frameworks for transforming drug discovery and optimization: With innovative insights in polypharmacology, drug repurposing, combination therapy and nanomedicine. Eur J Med Chem 2025; 284:117164. [PMID: 39721292 DOI: 10.1016/j.ejmech.2024.117164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 12/28/2024]
Abstract
Artificial Intelligence (AI) and Machine Learning (ML) are transforming drug discovery by overcoming traditional challenges like high costs, time-consuming, and frequent failures. AI-driven approaches streamline key phases, including target identification, lead optimization, de novo drug design, and drug repurposing. Frameworks such as deep neural networks (DNNs), convolutional neural networks (CNNs), and deep reinforcement learning (DRL) models have shown promise in identifying drug targets, optimizing delivery systems, and accelerating drug repurposing. Generative adversarial networks (GANs) and variational autoencoders (VAEs) aid de novo drug design by creating novel drug-like compounds with desired properties. Case studies, such as DDR1 kinase inhibitors designed using generative models and CDK20 inhibitors developed via structure-based methods, highlight AI's ability to produce highly specific therapeutics. Models like SNF-CVAE and DeepDR further advance drug repurposing by uncovering new therapeutic applications for existing drugs. Advanced ML algorithms enhance precision in predicting drug efficacy, toxicity, and ADME-Tox properties, reducing development costs and improving drug-target interactions. AI also supports polypharmacology by optimizing multi-target drug interactions and enhances combination therapy through predictions of drug synergies and antagonisms. In nanomedicine, AI models like CURATE.AI and the Hartung algorithm optimize personalized treatments by predicting toxicological risks and real-time dosing adjustments with high accuracy. Despite its potential, challenges like data quality, model interpretability, and ethical concerns must be addressed. High-quality datasets, transparent models, and unbiased algorithms are essential for reliable AI applications. As AI continues to evolve, it is poised to revolutionize drug discovery and personalized medicine, advancing therapeutic development and patient care.
Collapse
Affiliation(s)
- Subiya Ambreen
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Mohammad Umar
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Aaisha Noor
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Himangini Jain
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Ruhi Ali
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India.
| |
Collapse
|
2
|
García-Ortegón M, Seal S, Rasmussen C, Bender A, Bacallado S. Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization. J Cheminform 2024; 16:115. [PMID: 39443970 PMCID: PMC11515514 DOI: 10.1186/s13321-024-00904-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 09/13/2024] [Indexed: 10/25/2024] Open
Abstract
Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting. SCIENTIFIC CONTRIBUTION: Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.
Collapse
Affiliation(s)
- Miguel García-Ortegón
- Statistical Laboratory, University of Cambridge, Wilberforce Rd, Cambridge, CB3 0WA, UK.
- Department of Engineering, University of Cambridge, Trumpington St, Cambridge, CB2 1PZ, UK.
- Department of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK.
| | - Srijit Seal
- Imaging Platform, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, USA
| | - Carl Rasmussen
- Department of Engineering, University of Cambridge, Trumpington St, Cambridge, CB2 1PZ, UK
| | - Andreas Bender
- Department of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK
| | - Sergio Bacallado
- Statistical Laboratory, University of Cambridge, Wilberforce Rd, Cambridge, CB3 0WA, UK
| |
Collapse
|
3
|
Miljković F, Bajorath J. Kinase Drug Discovery: Impact of Open Science and Artificial Intelligence. Mol Pharm 2024; 21:4849-4859. [PMID: 39240193 DOI: 10.1021/acs.molpharmaceut.4c00659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2024]
Abstract
Given their central role in signal transduction, protein kinases (PKs) were first implicated in cancer development, caused by aberrant intracellular signaling events. Since then, PKs have become major targets in different therapeutic areas. The preferred approach to therapeutic intervention of PK-dependent diseases is the use of small molecules to inhibit their catalytic phosphate group transfer activity. PK inhibitors (PKIs) are among the most intensely pursued drug candidates, with currently 80 approved compounds and several hundred in clinical trials. Following the elucidation of the human kinome and development of robust PK expression systems and high-throughput assays, large volumes of PK/PKI data have been produced in industrial and academic environments, more so than for many other pharmaceutical targets. In addition, hundreds of X-ray structures of PKs and their complexes with PKIs have been reported. Substantial amounts of PK/PKI data have been made publicly available in part as a result of open science initiatives. PK drug discovery is further supported through the incorporation of data science approaches, including the development of various specialized databases and online resources. Compound and activity data wealth compared to other targets has also made PKs a focal point for the application of artificial intelligence (AI) in pharmaceutical research. Herein, we discuss the interplay of open and data science in PK drug discovery and review exemplary studies that have substantially contributed to its development, including kinome profiling or the analysis of PKI promiscuity versus selectivity. We also take a close look at how AI approaches are beginning to impact PK drug discovery in light of their increasing data orientation.
Collapse
Affiliation(s)
- Filip Miljković
- Medicinal Chemistry, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Pepparedsleden 1, SE-43183 Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, Lamarr Institute for Machine Learning and Artificial Intelligence, LIMES Program Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany
| |
Collapse
|
4
|
Bharadwaj S, Deepika K, Kumar A, Jaiswal S, Miglani S, Singh D, Fartyal P, Kumar R, Singh S, Singh MP, Gaidhane AM, Kumar B, Jha V. Exploring the Artificial Intelligence and Its Impact in Pharmaceutical Sciences: Insights Toward the Horizons Where Technology Meets Tradition. Chem Biol Drug Des 2024; 104:e14639. [PMID: 39396920 DOI: 10.1111/cbdd.14639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 09/03/2024] [Accepted: 09/24/2024] [Indexed: 10/15/2024]
Abstract
The technological revolutions in computers and the advancement of high-throughput screening technologies have driven the application of artificial intelligence (AI) for faster discovery of drug molecules with more efficiency, and cost-friendly finding of hit or lead molecules. The ability of software and network frameworks to interpret molecular structures' representations and establish relationships/correlations has enabled various research teams to develop numerous AI platforms for identifying new lead molecules or discovering new targets for already established drug molecules. The prediction of biological activity, ADME properties, and toxicity parameters in early stages have reduced the chances of failure and associated costs in later clinical stages, which was observed at a high rate in the tedious, expensive, and laborious drug discovery process. This review focuses on the different AI and machine learning (ML) techniques with their applications mainly focused on the pharmaceutical industry. The applications of AI frameworks in the identification of molecular target, hit identification/hit-to-lead optimization, analyzing drug-receptor interactions, drug repurposing, polypharmacology, synthetic accessibility, clinical trial design, and pharmaceutical developments are discussed in detail. We have also compiled the details of various startups in AI in this field. This review will provide a comprehensive analysis and outline various state-of-the-art AI/ML techniques to the readers with their framework applications. This review also highlights the challenges in this field, which need to be addressed for further success in pharmaceutical applications.
Collapse
Affiliation(s)
- Shruti Bharadwaj
- Center for SeNSE, Indian Institute of Technology Delhi (IIT), New Delhi, India
| | - Kumari Deepika
- Department of Computer Engineering, Pune Institute of Computer Technology, Pune, India
| | - Asim Kumar
- Amity Institute of Pharmacy (AIP), Amity University Haryana, Manesar, India
| | - Shivani Jaiswal
- Institute of Pharmaceutical Research, GLA University, Mathura, India
| | - Shaweta Miglani
- Department of Education, Central University of Punjab, Bathinda, India
| | - Damini Singh
- IES Institute of Pharmacy, IES University, Bhopal, Madhya Pradesh, India
| | - Prachi Fartyal
- Department of Mathematics, Govt PG College Bajpur (US Nagar), Bazpur, Uttarakhand, India
| | - Roshan Kumar
- Department of Microbiology, Graphic Era (Deemed to be University), Dehradun, India
- Department of Microbiology, Central University of Punjab, VPO-Ghudda, Punjab, India
| | - Shareen Singh
- Centre for Research Impact & Outcome, Chitkara College of Pharmacy, Chitkara University, Rajpura, Punjab, India
| | - Mahendra Pratap Singh
- Center for Global Health Research, Saveetha Medical College and Hospital, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, India
| | - Abhay M Gaidhane
- Jawaharlal Nehru Medical College, and Global Health Academy, School of Epidemiology and Public Health, Datta Meghe Institute of Higher Education, Wardha, India
| | - Bhupinder Kumar
- Department of Pharmaceutical Science, Hemvati Nandan Bahuguna Garhwal (A Central) University, Srinagar, Uttarakhand, India
| | - Vibhu Jha
- Institute of Cancer Therapeutics, School of Pharmacy and Medical Sciences, Faculty of Life Sciences, University of Bradford, Bradford, UK
| |
Collapse
|
5
|
Li X, Zhou Chen Y, Kalia A, Zhu H, Liu LP, Hassoun S. An Ensemble Spectral Prediction (ESP) model for metabolite annotation. Bioinformatics 2024; 40:btae490. [PMID: 39180771 PMCID: PMC11344591 DOI: 10.1093/bioinformatics/btae490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 06/25/2024] [Indexed: 08/26/2024] Open
Abstract
MOTIVATION A key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities. Currently, only a small fraction of measurements can be assigned identities. Two complementary computational approaches have emerged to address the annotation problem: mapping candidate molecules to spectra, and mapping query spectra to molecular candidates. In essence, the candidate molecule with the spectrum that best explains the query spectrum is recommended as the target molecule. Despite candidate ranking being fundamental in both approaches, limited prior works incorporated rank learning tasks in determining the target molecule. RESULTS We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation. ESP takes advantage of prior neural network-based annotation models that utilize multilayer perceptron (MLP) networks and Graph Neural Networks (GNNs). Based on the ranking results of the MLP- and GNN-based models, ESP learns a weighting for the outputs of MLP and GNN spectral predictors to generate a spectral prediction for a query molecule. Importantly, training data is stratified by molecular formula to provide candidate sets during model training. Further, baseline MLP and GNN models are enhanced by considering peak dependencies through label mixing and multi-tasking on spectral topic distributions. When trained on the NIST 2020 dataset and evaluated on the relevant candidate sets from PubChem, ESP improves average rank by 23.7% and 37.2% over the MLP and GNN baselines, respectively, demonstrating performance gain over state-of-the-art neural network approaches. However, MLP approaches remain strong contenders when considering top five ranks. Importantly, we show that annotation performance is dependent on the training dataset, the number of molecules in the candidate set and candidate similarity to the target molecule. AVAILABILITY AND IMPLEMENTATION The ESP code, a trained model, and a Jupyter notebook that guide users on using the ESP tool is available at https://github.com/HassounLab/ESP.
Collapse
Affiliation(s)
- Xinmeng Li
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
| | - Yan Zhou Chen
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
| | - Apurva Kalia
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
| | - Hao Zhu
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
| | - Li-ping Liu
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
| | - Soha Hassoun
- Department of Computer Science, Tufts University, Medford, MA, 02155, United States
- Department of Chemical and Biological Engineering, Tufts University, Medford, MA, 02155, United States
| |
Collapse
|
6
|
Kumar N, Acharya V. Advances in machine intelligence-driven virtual screening approaches for big-data. Med Res Rev 2024; 44:939-974. [PMID: 38129992 DOI: 10.1002/med.21995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 07/15/2023] [Accepted: 10/29/2023] [Indexed: 12/23/2023]
Abstract
Virtual screening (VS) is an integral and ever-evolving domain of drug discovery framework. The VS is traditionally classified into ligand-based (LB) and structure-based (SB) approaches. Machine intelligence or artificial intelligence has wide applications in the drug discovery domain to reduce time and resource consumption. In combination with machine intelligence algorithms, VS has emerged into revolutionarily progressive technology that learns within robust decision orders for data curation and hit molecule screening from large VS libraries in minutes or hours. The exponential growth of chemical and biological data has evolved as "big-data" in the public domain demands modern and advanced machine intelligence-driven VS approaches to screen hit molecules from ultra-large VS libraries. VS has evolved from an individual approach (LB and SB) to integrated LB and SB techniques to explore various ligand and target protein aspects for the enhanced rate of appropriate hit molecule prediction. Current trends demand advanced and intelligent solutions to handle enormous data in drug discovery domain for screening and optimizing hits or lead with fewer or no false positive hits. Following the big-data drift and tremendous growth in computational architecture, we presented this review. Here, the article categorized and emphasized individual VS techniques, detailed literature presented for machine learning implementation, modern machine intelligence approaches, and limitations and deliberated the future prospects.
Collapse
Affiliation(s)
- Neeraj Kumar
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| | - Vishal Acharya
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| |
Collapse
|
7
|
Wallach I, Bernard D, Nguyen K, Ho G, Morrison A, Stecula A, Rosnik A, O’Sullivan AM, Davtyan A, Samudio B, Thomas B, Worley B, Butler B, Laggner C, Thayer D, Moharreri E, Friedland G, Truong H, van den Bedem H, Ng HL, Stafford K, Sarangapani K, Giesler K, Ngo L, Mysinger M, Ahmed M, Anthis NJ, Henriksen N, Gniewek P, Eckert S, de Oliveira S, Suterwala S, PrasadPrasad SVK, Shek S, Contreras S, Hare S, Palazzo T, O’Brien TE, Van Grack T, Williams T, Chern TR, Kenyon V, Lee AH, Cann AB, Bergman B, Anderson BM, Cox BD, Warrington JM, Sorenson JM, Goldenberg JM, Young MA, DeHaan N, Pemberton RP, Schroedl S, Abramyan TM, Gupta T, Mysore V, Presser AG, Ferrando AA, Andricopulo AD, Ghosh A, Ayachi AG, Mushtaq A, Shaqra AM, Toh AKL, Smrcka AV, Ciccia A, de Oliveira AS, Sverzhinsky A, de Sousa AM, Agoulnik AI, Kushnir A, Freiberg AN, Statsyuk AV, Gingras AR, Degterev A, Tomilov A, Vrielink A, Garaeva AA, Bryant-Friedrich A, Caflisch A, Patel AK, Rangarajan AV, Matheeussen A, Battistoni A, Caporali A, Chini A, Ilari A, Mattevi A, Foote AT, Trabocchi A, Stahl A, Herr AB, Berti A, Freywald A, Reidenbach AG, Lam A, Cuddihy AR, White A, Taglialatela A, Ojha AK, Cathcart AM, Motyl AAL, Borowska A, D’Antuono A, Hirsch AKH, Porcelli AM, Minakova A, Montanaro A, Müller A, Fiorillo A, Virtanen A, O’Donoghue AJ, Del Rio Flores A, Garmendia AE, Pineda-Lucena A, Panganiban AT, Samantha A, Chatterjee AK, Haas AL, Paparella AS, John ALS, Prince A, ElSheikh A, Apfel AM, Colomba A, O’Dea A, Diallo BN, Ribeiro BMRM, Bailey-Elkin BA, Edelman BL, Liou B, Perry B, Chua BSK, Kováts B, Englinger B, Balakrishnan B, Gong B, Agianian B, Pressly B, Salas BPM, Duggan BM, Geisbrecht BV, Dymock BW, Morten BC, Hammock BD, Mota BEF, Dickinson BC, Fraser C, Lempicki C, Novina CD, Torner C, Ballatore C, Bon C, Chapman CJ, Partch CL, Chaton CT, Huang C, Yang CY, Kahler CM, Karan C, Keller C, Dieck CL, Huimei C, Liu C, Peltier C, Mantri CK, Kemet CM, Müller CE, Weber C, Zeina CM, Muli CS, Morisseau C, Alkan C, Reglero C, Loy CA, Wilson CM, Myhr C, Arrigoni C, Paulino C, Santiago C, Luo D, Tumes DJ, Keedy DA, Lawrence DA, Chen D, Manor D, Trader DJ, Hildeman DA, Drewry DH, Dowling DJ, Hosfield DJ, Smith DM, Moreira D, Siderovski DP, Shum D, Krist DT, Riches DWH, Ferraris DM, Anderson DH, Coombe DR, Welsbie DS, Hu D, Ortiz D, Alramadhani D, Zhang D, Chaudhuri D, Slotboom DJ, Ronning DR, Lee D, Dirksen D, Shoue DA, Zochodne DW, Krishnamurthy D, Duncan D, Glubb DM, Gelardi ELM, Hsiao EC, Lynn EG, Silva EB, Aguilera E, Lenci E, Abraham ET, Lama E, Mameli E, Leung E, Giles E, Christensen EM, Mason ER, Petretto E, Trakhtenberg EF, Rubin EJ, Strauss E, Thompson EW, Cione E, Lisabeth EM, Fan E, Kroon EG, Jo E, García-Cuesta EM, Glukhov E, Gavathiotis E, Yu F, Xiang F, Leng F, Wang F, Ingoglia F, van den Akker F, Borriello F, Vizeacoumar FJ, Luh F, Buckner FS, Vizeacoumar FS, Bdira FB, Svensson F, Rodriguez GM, Bognár G, Lembo G, Zhang G, Dempsey G, Eitzen G, Mayer G, Greene GL, Garcia GA, Lukacs GL, Prikler G, Parico GCG, Colotti G, De Keulenaer G, Cortopassi G, Roti G, Girolimetti G, Fiermonte G, Gasparre G, Leuzzi G, Dahal G, Michlewski G, Conn GL, Stuchbury GD, Bowman GR, Popowicz GM, Veit G, de Souza GE, Akk G, Caljon G, Alvarez G, Rucinski G, Lee G, Cildir G, Li H, Breton HE, Jafar-Nejad H, Zhou H, Moore HP, Tilford H, Yuan H, Shim H, Wulff H, Hoppe H, Chaytow H, Tam HK, Van Remmen H, Xu H, Debonsi HM, Lieberman HB, Jung H, Fan HY, Feng H, Zhou H, Kim HJ, Greig IR, Caliandro I, Corvo I, Arozarena I, Mungrue IN, Verhamme IM, Qureshi IA, Lotsaris I, Cakir I, Perry JJP, Kwiatkowski J, Boorman J, Ferreira J, Fries J, Kratz JM, Miner J, Siqueira-Neto JL, Granneman JG, Ng J, Shorter J, Voss JH, Gebauer JM, Chuah J, Mousa JJ, Maynes JT, Evans JD, Dickhout J, MacKeigan JP, Jossart JN, Zhou J, Lin J, Xu J, Wang J, Zhu J, Liao J, Xu J, Zhao J, Lin J, Lee J, Reis J, Stetefeld J, Bruning JB, Bruning JB, Coles JG, Tanner JJ, Pascal JM, So J, Pederick JL, Costoya JA, Rayman JB, Maciag JJ, Nasburg JA, Gruber JJ, Finkelstein JM, Watkins J, Rodríguez-Frade JM, Arias JAS, Lasarte JJ, Oyarzabal J, Milosavljevic J, Cools J, Lescar J, Bogomolovas J, Wang J, Kee JM, Kee JM, Liao J, Sistla JC, Abrahão JS, Sishtla K, Francisco KR, Hansen KB, Molyneaux KA, Cunningham KA, Martin KR, Gadar K, Ojo KK, Wong KS, Wentworth KL, Lai K, Lobb KA, Hopkins KM, Parang K, Machaca K, Pham K, Ghilarducci K, Sugamori KS, McManus KJ, Musta K, Faller KME, Nagamori K, Mostert KJ, Korotkov KV, Liu K, Smith KS, Sarosiek K, Rohde KH, Kim KK, Lee KH, Pusztai L, Lehtiö L, Haupt LM, Cowen LE, Byrne LJ, Su L, Wert-Lamas L, Puchades-Carrasco L, Chen L, Malkas LH, Zhuo L, Hedstrom L, Hedstrom L, Walensky LD, Antonelli L, Iommarini L, Whitesell L, Randall LM, Fathallah MD, Nagai MH, Kilkenny ML, Ben-Johny M, Lussier MP, Windisch MP, Lolicato M, Lolli ML, Vleminckx M, Caroleo MC, Macias MJ, Valli M, Barghash MM, Mellado M, Tye MA, Wilson MA, Hannink M, Ashton MR, Cerna MVC, Giorgis M, Safo MK, Maurice MS, McDowell MA, Pasquali M, Mehedi M, Serafim MSM, Soellner MB, Alteen MG, Champion MM, Skorodinsky M, O’Mara ML, Bedi M, Rizzi M, Levin M, Mowat M, Jackson MR, Paige M, Al-Yozbaki M, Giardini MA, Maksimainen MM, De Luise M, Hussain MS, Christodoulides M, Stec N, Zelinskaya N, Van Pelt N, Merrill NM, Singh N, Kootstra NA, Singh N, Gandhi NS, Chan NL, Trinh NM, Schneider NO, Matovic N, Horstmann N, Longo N, Bharambe N, Rouzbeh N, Mahmoodi N, Gumede NJ, Anastasio NC, Khalaf NB, Rabal O, Kandror O, Escaffre O, Silvennoinen O, Bishop OT, Iglesias P, Sobrado P, Chuong P, O’Connell P, Martin-Malpartida P, Mellor P, Fish PV, Moreira POL, Zhou P, Liu P, Liu P, Wu P, Agogo-Mawuli P, Jones PL, Ngoi P, Toogood P, Ip P, von Hundelshausen P, Lee PH, Rowswell-Turner RB, Balaña-Fouce R, Rocha REO, Guido RVC, Ferreira RS, Agrawal RK, Harijan RK, Ramachandran R, Verma R, Singh RK, Tiwari RK, Mazitschek R, Koppisetti RK, Dame RT, Douville RN, Austin RC, Taylor RE, Moore RG, Ebright RH, Angell RM, Yan R, Kejriwal R, Batey RA, Blelloch R, Vandenberg RJ, Hickey RJ, Kelm RJ, Lake RJ, Bradley RK, Blumenthal RM, Solano R, Gierse RM, Viola RE, McCarthy RR, Reguera RM, Uribe RV, do Monte-Neto RL, Gorgoglione R, Cullinane RT, Katyal S, Hossain S, Phadke S, Shelburne SA, Geden SE, Johannsen S, Wazir S, Legare S, Landfear SM, Radhakrishnan SK, Ammendola S, Dzhumaev S, Seo SY, Li S, Zhou S, Chu S, Chauhan S, Maruta S, Ashkar SR, Shyng SL, Conticello SG, Buroni S, Garavaglia S, White SJ, Zhu S, Tsimbalyuk S, Chadni SH, Byun SY, Park S, Xu SQ, Banerjee S, Zahler S, Espinoza S, Gustincich S, Sainas S, Celano SL, Capuzzi SJ, Waggoner SN, Poirier S, Olson SH, Marx SO, Van Doren SR, Sarilla S, Brady-Kalnay SM, Dallman S, Azeem SM, Teramoto T, Mehlman T, Swart T, Abaffy T, Akopian T, Haikarainen T, Moreda TL, Ikegami T, Teixeira TR, Jayasinghe TD, Gillingwater TH, Kampourakis T, Richardson TI, Herdendorf TJ, Kotzé TJ, O’Meara TR, Corson TW, Hermle T, Ogunwa TH, Lan T, Su T, Banjo T, O’Mara TA, Chou T, Chou TF, Baumann U, Desai UR, Pai VP, Thai VC, Tandon V, Banerji V, Robinson VL, Gunasekharan V, Namasivayam V, Segers VFM, Maranda V, Dolce V, Maltarollo VG, Scoffone VC, Woods VA, Ronchi VP, Van Hung Le V, Clayton WB, Lowther WT, Houry WA, Li W, Tang W, Zhang W, Van Voorhis WC, Donaldson WA, Hahn WC, Kerr WG, Gerwick WH, Bradshaw WJ, Foong WE, Blanchet X, Wu X, Lu X, Qi X, Xu X, Yu X, Qin X, Wang X, Yuan X, Zhang X, Zhang YJ, Hu Y, Aldhamen YA, Chen Y, Li Y, Sun Y, Zhu Y, Gupta YK, Pérez-Pertejo Y, Li Y, Tang Y, He Y, Tse-Dinh YC, Sidorova YA, Yen Y, Li Y, Frangos ZJ, Chung Z, Su Z, Wang Z, Zhang Z, Liu Z, Inde Z, Artía Z, Heifets A. AI is a viable alternative to high throughput screening: a 318-target study. Sci Rep 2024; 14:7526. [PMID: 38565852 PMCID: PMC10987645 DOI: 10.1038/s41598-024-54655-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 02/15/2024] [Indexed: 04/04/2024] Open
Abstract
High throughput screening (HTS) is routinely used to identify bioactive small molecules. This requires physical compounds, which limits coverage of accessible chemical space. Computational approaches combined with vast on-demand chemical libraries can access far greater chemical space, provided that the predictive accuracy is sufficient to identify useful molecules. Through the largest and most diverse virtual HTS campaign reported to date, comprising 318 individual projects, we demonstrate that our AtomNet® convolutional neural network successfully finds novel hits across every major therapeutic area and protein class. We address historical limitations of computational screening by demonstrating success for target proteins without known binders, high-quality X-ray crystal structures, or manual cherry-picking of compounds. We show that the molecules selected by the AtomNet® model are novel drug-like scaffolds rather than minor modifications to known bioactive compounds. Our empirical results suggest that computational methods can substantially replace HTS as the first step of small-molecule drug discovery.
Collapse
|
8
|
Abbasi M, Carvalho FG, Ribeiro B, Arrais JP. Predicting drug activity against cancer through genomic profiles and SMILES. Artif Intell Med 2024; 150:102820. [PMID: 38553160 DOI: 10.1016/j.artmed.2024.102820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 01/09/2024] [Accepted: 02/21/2024] [Indexed: 04/02/2024]
Abstract
Due to the constant increase in cancer rates, the disease has become a leading cause of death worldwide, enhancing the need for its detection and treatment. In the era of personalized medicine, the main goal is to incorporate individual variability in order to choose more precisely which therapy and prevention strategies suit each person. However, predicting the sensitivity of tumors to anticancer treatments remains a challenge. In this work, we propose two deep neural network models to predict the impact of anticancer drugs in tumors through the half-maximal inhibitory concentration (IC50). These models join biological and chemical data to apprehend relevant features of the genetic profile and the drug compounds, respectively. In order to predict the drug response in cancer cell lines, this study employed different DL methods, resorting to Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs). In the first stage, two autoencoders were pre-trained with high-dimensional gene expression and mutation data of tumors. Afterward, this genetic background is transferred to the prediction models that return the IC50 value that portrays the potency of a substance in inhibiting a cancer cell line. When comparing RSEM Expected counts and TPM as methods for displaying gene expression data, RSEM has been shown to perform better in deep models and CNNs model can obtain better insight in these types of data. Moreover, the obtained results reflect the effectiveness of the extracted deep representations in the prediction of the IC50 value that portrays the potency of a substance in inhibiting a tumor, achieving a performance of a mean squared error of 1.06 and surpassing previous state-of-the-art models.
Collapse
Affiliation(s)
- Maryam Abbasi
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal; Polytechnic Institute of Coimbra, Applied Research Institute, Coimbra, Portugal; Research Centre for Natural Resources Environment and Society (CERNAS), Polytechnic Institute of Coimbra, Coimbra, Portugal.
| | - Filipa G Carvalho
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
| | - Bernardete Ribeiro
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
| | - Joel P Arrais
- Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, University of Coimbra, Coimbra, Portugal
| |
Collapse
|
9
|
Ong WJG, Kirubakaran P, Karanicolas J. Poor Generalization by Current Deep Learning Models for Predicting Binding Affinities of Kinase Inhibitors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.04.556234. [PMID: 37732243 PMCID: PMC10508770 DOI: 10.1101/2023.09.04.556234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
The extreme surge of interest over the past decade surrounding the use of neural networks has inspired many groups to deploy them for predicting binding affinities of drug-like molecules to their receptors. A model that can accurately make such predictions has the potential to screen large chemical libraries and help streamline the drug discovery process. However, despite reports of models that accurately predict quantitative inhibition using protein kinase sequences and inhibitors' SMILES strings, it is still unclear whether these models can generalize to previously unseen data. Here, we build a Convolutional Neural Network (CNN) analogous to those previously reported and evaluate the model over four datasets commonly used for inhibitor/kinase predictions. We find that the model performs comparably to those previously reported, provided that the individual data points are randomly split between the training set and the test set. However, model performance is dramatically deteriorated when all data for a given inhibitor is placed together in the same training/testing fold, implying that information leakage underlies the models' performance. Through comparison to simple models in which the SMILES strings are tokenized, or in which test set predictions are simply copied from the closest training set data points, we demonstrate that there is essentially no generalization whatsoever in this model. In other words, the model has not learned anything about molecular interactions, and does not provide any benefit over much simpler and more transparent models. These observations strongly point to the need for richer structure-based encodings, to obtain useful prospective predictions of not-yet-synthesized candidate inhibitors.
Collapse
Affiliation(s)
- Wern Juin Gabriel Ong
- Cancer Signaling & Microenvironment Program, Fox Chase Cancer Center, Philadelphia, PA 19111
- Bowdoin College, Brunswick, ME 04011
| | - Palani Kirubakaran
- Cancer Signaling & Microenvironment Program, Fox Chase Cancer Center, Philadelphia, PA 19111
| | - John Karanicolas
- Cancer Signaling & Microenvironment Program, Fox Chase Cancer Center, Philadelphia, PA 19111
| |
Collapse
|
10
|
Luukkonen S, Meijer E, Tricarico GA, Hofmans J, Stouten PFW, van Westen GJP, Lenselink EB. Large-Scale Modeling of Sparse Protein Kinase Activity Data. J Chem Inf Model 2023. [PMID: 37294674 DOI: 10.1021/acs.jcim.3c00132] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Protein kinases are a protein family that plays an important role in several complex diseases such as cancer and cardiovascular and immunological diseases. Protein kinases have conserved ATP binding sites, which when targeted can lead to similar activities of inhibitors against different kinases. This can be exploited to create multitarget drugs. On the other hand, selectivity (lack of similar activities) is desirable in order to avoid toxicity issues. There is a vast amount of protein kinase activity data in the public domain, which can be used in many different ways. Multitask machine learning models are expected to excel for these kinds of data sets because they can learn from implicit correlations between tasks (in this case activities against a variety of kinases). However, multitask modeling of sparse data poses two major challenges: (i) creating a balanced train-test split without data leakage and (ii) handling missing data. In this work, we construct a protein kinase benchmark set composed of two balanced splits without data leakage, using random and dissimilarity-driven cluster-based mechanisms, respectively. This data set can be used for benchmarking and developing protein kinase activity prediction models. Overall, the performance on the dissimilarity-driven cluster-based split is lower than on random split-based sets for all models, indicating poor generalizability of models. Nevertheless, we show that multitask deep learning models, on this very sparse data set, outperform single-task deep learning and tree-based models. Finally, we demonstrate that data imputation does not improve the performance of (multitask) models on this benchmark set.
Collapse
Affiliation(s)
- Sohvi Luukkonen
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | - Erik Meijer
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | | | - Johan Hofmans
- Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium
| | - Pieter F W Stouten
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
- Galapagos NV, Generaal De Wittelaan L11 A3, 2800 Mechelen, Belgium
- Stouten Pharma Consultancy BV, Kempenarestraat 47, 2860 Sint-Katelijne-Waver, Belgium
| | - Gerard J P van Westen
- Leiden Academic Centre of Drug Research, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
| | | |
Collapse
|
11
|
Zhu X, Polyakov VR, Bajjuri K, Hu H, Maderna A, Tovee CA, Ward SC. Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset. J Chem Inf Model 2023; 63:2948-2959. [PMID: 37125691 DOI: 10.1021/acs.jcim.3c00308] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
Predicting solubility of small molecules is a very difficult undertaking due to the lack of reliable and consistent experimental solubility data. It is well known that for a molecule in a crystal lattice to be dissolved, it must, first, dissociate from the lattice and then, second, be solvated. The melting point of a compound is proportional to the lattice energy, and the octanol-water partition coefficient (log P) is a measure of the compound's solvation efficiency. The CCDC's melting point dataset of almost one hundred thousand compounds was utilized to create widely applicable machine learning models of small molecule melting points. Using the general solubility equation, the aqueous thermodynamic solubilities of the same compounds can be predicted. The global model could be easily localized by adding additional melting point measurements for a chemical series of interest.
Collapse
Affiliation(s)
- Xiangwei Zhu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Valery R Polyakov
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Krishna Bajjuri
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Huiyong Hu
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Andreas Maderna
- Sutro Biopharma, 111 Oyster Point Blvd, South San Francisco, California 94080, United States
| | - Clare A Tovee
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| | - Suzanna C Ward
- Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge CB2 1EZ, U.K
| |
Collapse
|
12
|
Sosnina EA, Sosnin S, Fedorov MV. Improvement of multi-task learning by data enrichment: application for drug discovery. J Comput Aided Mol Des 2023; 37:183-200. [PMID: 36943645 DOI: 10.1007/s10822-023-00500-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 02/21/2023] [Indexed: 03/23/2023]
Abstract
Multi-task learning in deep neural networks has become a topic of growing importance in many research fields, including drug discovery. However, applying multi-task learning poses new challenges in improving prediction performance. This study investigated the potential of training data enrichment to enhance multi-task model prediction quality in drug discovery. The study evaluated four scenarios with varying degrees of information capacity of the training data and applied two types of test data to evaluate prediction performance. We used three datasets: ViralChEMBL, which consisted of binary activities of compounds against viral species, was applied for the classification task; pQSAR(159) and pQSAR(4267), which consisted of bio-activities of compounds and assays from the research of the profile-QSAR method, were applied for regression tasks. We built multi-task models based on the feed-forward DNNs using the PyTorch framework. Our findings showed that training data enrichment could be an effective means of enhancing prediction performance in multi-task learning, but the degree of improvement depends on the quality of the training data. The more unique compounds and targets the training data included, the more new compound-target interactions are required for prediction improvement. Also, we found out that even using multi-task learning, one could not predict the interactions of compounds that are highly dissimilar from those used for model training. The study provides some recommendations for effectively employing multi-task learning in drug discovery to improve prediction accuracy and facilitate the discovery of novel drug candidates.
Collapse
Affiliation(s)
- Ekaterina A Sosnina
- Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, Moscow, Russia, 143026.
| | - Sergey Sosnin
- Department of Pharmaceutical Sciences, Faculty of Life Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1190, Vienna, Austria
| | - Maxim V Fedorov
- Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, Moscow, Russia, 143026
- Sirius University of Science and Technology, Olympiisky Prospect 1, Sochi, Russia, 354340
| |
Collapse
|
13
|
Yang J, Zhang D, Cai Y, Yu K, Li M, Liu L, Chen X. Computational Prediction of Drug Phenotypic Effects Based on Substructure-Phenotype Associations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:256-265. [PMID: 35239490 DOI: 10.1109/tcbb.2022.3155453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identifying drug phenotypic effects, including therapeutic effects and adverse drug reactions (ADRs), is an inseparable part for evaluating the potentiality of new drug candidates (NDCs). However, current computational methods for predicting phenotypic effects of NDCs are mainly based on the overall structure of an NDC or a related target. These approaches often lead to inconsistencies between the structures and functions and limit the prediction space of NDCs. In this study, first, we constructed quantitative associations of substructure-domain, domain-ADR, and domain-ATC (Anatomical Therapeutic Chemical Classification System code) through L1LOG and L1SVM machine learning models. These associations represent relationships between phenotypes (ADRs and ATCs) and local structures of drugs and proteins. Then, based on these established associations, substructure-phenotype relationships were constructed which were utilized to quantify drug-phenotype relationships. Thus, this approach could achieve high-throughput and effective evaluations of the druggability of NDCs by referring to the established substructure-phenotype relationships and structural information of NDCs without additional prior knowledge. Using this computational pipeline, 83,205 drug-ATC relationships (including 1,479 drugs and 178 ATCs) and 306,421 drug-ADR relationships (including 1,752 drugs and 454 ADRs) were predicted in total. The prediction results were validated at four levels: five-fold cross validation, public databases, literature, and molecular docking. Furthermore, three case studies demonstrated the feasibility of our method. 79 ATCs and 269 ADRs were predicted to be related to Maraviroc, an approved drug, including the existing antiviral effect in clinical use. Additionally, we also found risk substructures of severe ADRs, for example, SUB215 (>= 1, saturated or only aromatic carbon ring size 7) can result in shock. And we analyzed the mechanism of action (MOA) of interested drugs based on the established drug-substructure-domain-protein associations. In a word, this approach through establishing drug-substructure-phenotype relationships can achieve quantitative prediction of phenotypes for a given NDC or drug without any prior knowledge except its structure information. Using that way, we can directly obtain the relationships between substructure and phenotype of a compound, which is more convenient to analyze the phenotypic mechanism of drugs and accelerate the process of rational drug design.
Collapse
|
14
|
Liang L, Liu Y, Kang B, Wang R, Sun MY, Wu Q, Meng XF, Lin JP. Large-scale comparison of machine learning algorithms for target prediction of natural products. Brief Bioinform 2022; 23:6675751. [PMID: 36007240 DOI: 10.1093/bib/bbac359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 07/26/2022] [Accepted: 07/31/2022] [Indexed: 11/13/2022] Open
Abstract
Natural products (NPs) and their derivatives are important resources for drug discovery. There are many in silico target prediction methods that have been reported, however, very few of them distinguish NPs from synthetic molecules. Considering the fact that NPs and synthetic molecules are very different in many characteristics, it is necessary to build specific target prediction models of NPs. Therefore, we collected the activity data of NPs and their derivatives from the public databases and constructed four datasets, including the NP dataset, the NPs and its first-class derivatives dataset, the NPs and all its derivatives and the ChEMBL26 compounds dataset. Conditions, including activity thresholds and input features, were explored to access the performance of eight machine learning methods of target prediction of NPs, including support vector machines (SVM), extreme gradient boosting, random forests, K-nearest neighbor, naive Bayes, feedforward neural networks (FNN), convolutional neural networks and recurrent neural networks. As a result, the NPs and all their derivatives datasets were selected to build the best NP-specific models. Furthermore, the consensus models, as well as the voting models, were additionally applied to improve the prediction performance. More evaluations were made on the external validation set and the results demonstrated that (1) the NP-specific model performed better on the target prediction of NPs than the traditional models training on the whole compounds of ChEMBL26. (2) The consensus model of FNN + SVM possessed the best overall performance, and the voting model can significantly improve recall and specificity.
Collapse
Affiliation(s)
- Lu Liang
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China
| | - Ye Liu
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China
| | - Bo Kang
- National Supercomputer Center in Tianjin, 10 Xinhuanxi Road, Tianjin Binhai New Area, Tianjin 300457, China
| | - Ru Wang
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China
| | - Meng-Yu Sun
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China
| | - Qi Wu
- National Supercomputer Center in Tianjin, 10 Xinhuanxi Road, Tianjin Binhai New Area, Tianjin 300457, China
| | - Xiang-Fei Meng
- National Supercomputer Center in Tianjin, 10 Xinhuanxi Road, Tianjin Binhai New Area, Tianjin 300457, China
| | - Jian-Ping Lin
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin 300353, China.,Biodesign Center, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, 32 West 7th Avenue, Tianjin Airport Economic Area, Tianjin 300308, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin 300457, China
| |
Collapse
|
15
|
Artificial intelligence and machine-learning approaches in structure and ligand-based discovery of drugs affecting central nervous system. Mol Divers 2022; 27:959-985. [PMID: 35819579 DOI: 10.1007/s11030-022-10489-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 06/21/2022] [Indexed: 12/11/2022]
Abstract
CNS disorders are indications with a very high unmet medical needs, relatively smaller number of available drugs, and a subpar satisfaction level among patients and caregiver. Discovery of CNS drugs is extremely expensive affair with its own unique challenges leading to extremely high attrition rates and low efficiency. With explosion of data in information age, there is hardly any aspect of life that has not been touched by data driven technologies such as artificial intelligence (AI) and machine learning (ML). Drug discovery is no exception, emergence of big data via genomic, proteomic, biological, and chemical technologies has driven pharmaceutical giants to collaborate with AI oriented companies to revolutionise drug discovery, with the goal of increasing the efficiency of the process. In recent years many examples of innovative applications of AI and ML techniques in CNS drug discovery has been reported. Research on therapeutics for diseases such as schizophrenia, Alzheimer's and Parkinsonism has been provided with a new direction and thrust from these developments. AI and ML has been applied to both ligand-based and structure-based drug discovery and design of CNS therapeutics. In this review, we have summarised the general aspects of AI and ML from the perspective of drug discovery followed by a comprehensive coverage of the recent developments in the applications of AI/ML techniques in CNS drug discovery.
Collapse
|
16
|
Walter M, Allen LN, de la Vega de León A, Webb SJ, Gillet VJ. Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction. J Cheminform 2022; 14:32. [PMID: 35672779 PMCID: PMC9172131 DOI: 10.1186/s13321-022-00611-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/12/2022] [Indexed: 11/21/2022] Open
Abstract
Recently, imputation techniques have been adapted to predict activity values among sparse bioactivity matrices, showing improvements in predictive performance over traditional QSAR models. These models are able to use experimental activity values for auxiliary assays when predicting the activity of a test compound on a specific assay. In this study, we tested three different multi-task imputation techniques on three classification-based toxicity datasets: two of small scale (12 assays each) and one large scale with 417 assays. Moreover, we analyzed in detail the improvements shown by the imputation models. We found that test compounds that were dissimilar to training compounds, as well as test compounds with a large number of experimental values for other assays, showed the largest improvements. We also investigated the impact of sparsity on the improvements seen as well as the relatedness of the assays being considered. Our results show that even a small amount of additional information can provide imputation methods with a strong boost in predictive performance over traditional single task and multi-task predictive models.
Collapse
|
17
|
Rodríguez-Pérez R, Miljković F, Bajorath J. Machine Learning in Chemoinformatics and Medicinal Chemistry. Annu Rev Biomed Data Sci 2022; 5:43-65. [PMID: 35440144 DOI: 10.1146/annurev-biodatasci-122120-124216] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Novartis Institutes for Biomedical Research, Novartis Campus, Basel, Switzerland
| | - Filip Miljković
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Data Science and AI, Imaging and Data Analytics, Clinical Pharmacology and Safety Sciences, R&D AstraZeneca, Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany;
| |
Collapse
|
18
|
Kang SG, Morrone JA, Weber JK, Cornell WD. Analysis of Training and Seed Bias in Small Molecules Generated with a Conditional Graph-Based Variational Autoencoder─Insights for Practical AI-Driven Molecule Generation. J Chem Inf Model 2022; 62:801-816. [PMID: 35130440 DOI: 10.1021/acs.jcim.1c01545] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The application of deep learning to generative molecule design has shown early promise for accelerating lead series development. However, questions remain concerning how factors like training, data set, and seed bias impact the technology's utility to medicinal and computational chemists. In this work, we analyze the impact of seed and training bias on the output of an activity-conditioned graph-based variational autoencoder (VAE). Leveraging a massive, labeled data set corresponding to the dopamine D2 receptor, our graph-based generative model is shown to excel in producing desired conditioned activities and favorable unconditioned physical properties in generated molecules. We implement an activity-swapping method that allows for the activation, deactivation, or retention of activity of molecular seeds, and we apply independent deep learning classifiers to verify the generative results. Overall, we uncover relationships between noise, molecular seeds, and training set selection across a range of latent-space sampling procedures, providing important insights for practical AI-driven molecule generation.
Collapse
Affiliation(s)
- Seung-Gu Kang
- Computational Biology Center, IBM Thomas J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10594, United States
| | - Joseph A Morrone
- Computational Biology Center, IBM Thomas J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10594, United States
| | - Jeffrey K Weber
- Computational Biology Center, IBM Thomas J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10594, United States
| | - Wendy D Cornell
- Computational Biology Center, IBM Thomas J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, New York 10594, United States
| |
Collapse
|
19
|
Born J, Huynh T, Stroobants A, Cornell WD, Manica M. Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model. J Chem Inf Model 2021; 62:240-257. [PMID: 34905358 DOI: 10.1021/acs.jcim.1c00889] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Recent advances in deep learning have enabled the development of large-scale multimodal models for virtual screening and de novo molecular design. The human kinome with its abundant sequence and inhibitor data presents an attractive opportunity to develop proteochemometric models that exploit the size and internal diversity of this family of targets. Here, we challenge a standard practice in sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented by a sequence of 29 discontiguous residues defining the ATP binding site. In kinase-ligand binding affinity prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models, data sets, and performance metrics and holds true when predicting pIC50 for both unseen ligands and kinases. Our interpretability analysis reveals a potential explanation for the superiority of the active site models: whereas only mild statistical effects about the extraction of three-dimensional (3D) interaction sites take place in the full sequence models, the active site models are equipped with an implicit but strong inductive bias about the 3D structure stemming from the discontiguity of the active sites. Moreover, in direct comparisons, our models perform similarly or better than previous state-of-the-art approaches in affinity prediction. We then investigate a de novo molecular design task and find that the active site provides benefits in the computational efficiency, but otherwise, both kinase representations yield similar optimized affinities (for both SMILES- and SELFIES-based molecular generators). Our work challenges the assumption that the full primary structure is indispensable for modeling human kinases.
Collapse
Affiliation(s)
- Jannis Born
- IBM Research Europe, 8804 Rüschlikon, Switzerland.,Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| | - Tien Huynh
- IBM Research, Yorktown Heights, New York 10598, United States
| | - Astrid Stroobants
- Department of Chemistry, Imperial College London, SW7 2AZ London, United Kingdom
| | - Wendy D Cornell
- IBM Research, Yorktown Heights, New York 10598, United States
| | | |
Collapse
|
20
|
Simm J, Humbeck L, Zalewski A, Sturm N, Heyndrickx W, Moreau Y, Beck B, Schuffenhauer A. Splitting chemical structure data sets for federated privacy-preserving machine learning. J Cheminform 2021; 13:96. [PMID: 34876230 PMCID: PMC8650276 DOI: 10.1186/s13321-021-00576-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 11/22/2021] [Indexed: 11/10/2022] Open
Abstract
With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.
Collapse
Affiliation(s)
- Jaak Simm
- KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium
| | - Lina Humbeck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Adam Zalewski
- Amgen Research (Munich) GmbH, Staffelseestraße 2, 81477, Munich, Germany
| | - Noe Sturm
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland
| | - Wouter Heyndrickx
- Janssen Pharmaceutica N.V., Janssen Pharmaceutica, Turnhoutseweg 30, 2340, Beerse, Belgium
| | - Yves Moreau
- KU Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, 3001, Heverlee, Belgium
| | - Bernd Beck
- Medicinal Chemistry Department, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Str. 65, 88397, Biberach an der Riss, Germany
| | - Ansgar Schuffenhauer
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
| |
Collapse
|
21
|
Mathai N, Chen Y, Kirchmair J. Validation strategies for target prediction methods. Brief Bioinform 2021; 21:791-802. [PMID: 31220208 PMCID: PMC7299289 DOI: 10.1093/bib/bbz026] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 01/14/2019] [Accepted: 02/17/2019] [Indexed: 12/11/2022] Open
Abstract
Computational methods for target prediction, based on molecular similarity and network-based approaches, machine learning, docking and others, have evolved as valuable and powerful tools to aid the challenging task of mode of action identification for bioactive small molecules such as drugs and drug-like compounds. Critical to discerning the scope and limitations of a target prediction method is understanding how its performance was evaluated and reported. Ideally, large-scale prospective experiments are conducted to validate the performance of a model; however, this expensive and time-consuming endeavor is often not feasible. Therefore, to estimate the predictive power of a method, statistical validation based on retrospective knowledge is commonly used. There are multiple statistical validation techniques that vary in rigor. In this review we discuss the validation strategies employed, highlighting the usefulness and constraints of the validation schemes and metrics that are employed to measure and describe performance. We address the limitations of measuring only generalized performance, given that the underlying bioactivity and structural data are biased towards certain small-molecule scaffolds and target families, and suggest additional aspects of performance to consider in order to produce more detailed and realistic estimates of predictive power. Finally, we describe the validation strategies that were employed by some of the most thoroughly validated and accessible target prediction methods.
Collapse
Affiliation(s)
- Neann Mathai
- Department of Chemistry, University of Bergen, Bergen, Norway.,Computational Biology Unit (CBU), University of Bergen, Bergen, Norway.,Center for Bioinformatics (ZBH), Department of Computer Science, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, Hamburg, Germany
| | - Ya Chen
- Center for Bioinformatics (ZBH), Department of Computer Science, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, Hamburg, Germany
| | - Johannes Kirchmair
- Department of Chemistry, University of Bergen, Bergen, Norway.,Computational Biology Unit (CBU), University of Bergen, Bergen, Norway.,Center for Bioinformatics (ZBH), Department of Computer Science, Faculty of Mathematics, Informatics and Natural Sciences, Universität Hamburg, Hamburg, Germany
| |
Collapse
|
22
|
Zuo Z, Wang P, Chen X, Tian L, Ge H, Qian D. SWnet: a deep learning model for drug response prediction from cancer genomic signatures and compound chemical structures. BMC Bioinformatics 2021; 22:434. [PMID: 34507532 PMCID: PMC8434731 DOI: 10.1186/s12859-021-04352-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 08/31/2021] [Indexed: 12/13/2022] Open
Abstract
Background One of the major challenges in precision medicine is accurate prediction of individual patient’s response to drugs. A great number of computational methods have been developed to predict compounds activity using genomic profiles or chemical structures, but more exploration is yet to be done to combine genetic mutation, gene expression, and cheminformatics in one machine learning model. Results We presented here a novel deep-learning model that integrates gene expression, genetic mutation, and chemical structure of compounds in a multi-task convolutional architecture. We applied our model to the Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) datasets. We selected relevant cancer-related genes based on oncology genetics database and L1000 landmark genes, and used their expression and mutations as genomic features in model training. We obtain the cheminformatics features for compounds from PubChem or ChEMBL. Our finding is that combining gene expression, genetic mutation, and cheminformatics features greatly enhances the predictive performance. Conclusion We implemented an extended Graph Neural Network for molecular graphs and Convolutional Neural Network for gene features. With the employment of multi-tasking and self-attention functions to monitor the similarity between compounds, our model outperforms recently published methods using the same training and testing datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04352-9.
Collapse
Affiliation(s)
- Zhaorui Zuo
- Institute of Medical Robotics, Shanghai Jiao Tong University, 2F of the Translational Medicine Building, No. 800 Dongchuan Road, Shanghai, 200000, China
| | - Penglei Wang
- Institute of Medical Robotics, Shanghai Jiao Tong University, 2F of the Translational Medicine Building, No. 800 Dongchuan Road, Shanghai, 200000, China
| | - Xiaowei Chen
- Novartis Institutes for Biomedical Research, 4218 Jinke Road, Pudong, Shanghai, 201203, China
| | - Li Tian
- Novartis Institutes for Biomedical Research, 4218 Jinke Road, Pudong, Shanghai, 201203, China
| | - Hui Ge
- Novartis Institutes for Biomedical Research, 4218 Jinke Road, Pudong, Shanghai, 201203, China.
| | - Dahong Qian
- Institute of Medical Robotics, Shanghai Jiao Tong University, 2F of the Translational Medicine Building, No. 800 Dongchuan Road, Shanghai, 200000, China.
| |
Collapse
|
23
|
Visani GM, Hughes MC, Hassoun S. Enzyme Promiscuity Prediction Using Hierarchy-Informed Multi-Label Classification. Bioinformatics 2021; 37:2017–2024. [PMID: 33515234 PMCID: PMC8337005 DOI: 10.1093/bioinformatics/btab054] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 12/30/2020] [Accepted: 01/22/2021] [Indexed: 11/25/2022] Open
Abstract
MOTIVATION As experimental efforts are costly and time consuming, computational characterization of enzyme capabilities is an attractive alternative. We present and evaluate several machine-learning models to predict which of 983 distinct enzymes, as defined via the Enzyme Commission (EC) numbers, are likely to interact with a given query molecule. Our data consists of enzyme-substrate interactions from the BRENDA database. Some interactions are attributed to natural selection and involve the enzyme's natural substrates. The majority of the interactions however involve non-natural substrates, thus reflecting promiscuous enzymatic activities. RESULTS We frame this "enzyme promiscuity prediction" problem as a multi-label classification task. We maximally utilize inhibitor and unlabelled data to train prediction models that can take advantage of known hierarchical relationships between enzyme classes. We report that a hierarchical multi-label neural network, EPP-HMCNF, is the best model for solving this problem, outperforming k-nearest neighbours similarity-based and other machine learning models. We show that inhibitor information during training consistently improves predictive power, particularly for EPP-HMCNF. We also show that all promiscuity prediction models perform worse under a realistic data split when compared to a random data split, and when evaluating performance on non-natural substrates compared to natural substrates. AVAILABILITY AND IMPLEMENTATION We provide Python code for EPP-HMCNF and other models in a repository termed EPP (Enzyme Promiscuity Prediction) at https://github.com/hassounlab/EPP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gian Marco Visani
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | - Michael C Hughes
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
| | - Soha Hassoun
- Department of Computer Science, Tufts University, Medford, MA 02155, USA
- Department of Chemical and Biological Engineering, Tufts University, Medford, MA 02155, USA
| |
Collapse
|
24
|
Vatansever S, Schlessinger A, Wacker D, Kaniskan HÜ, Jin J, Zhou M, Zhang B. Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: State-of-the-arts and future directions. Med Res Rev 2021; 41:1427-1473. [PMID: 33295676 PMCID: PMC8043990 DOI: 10.1002/med.21764] [Citation(s) in RCA: 131] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 10/30/2020] [Accepted: 11/20/2020] [Indexed: 01/11/2023]
Abstract
Neurological disorders significantly outnumber diseases in other therapeutic areas. However, developing drugs for central nervous system (CNS) disorders remains the most challenging area in drug discovery, accompanied with the long timelines and high attrition rates. With the rapid growth of biomedical data enabled by advanced experimental technologies, artificial intelligence (AI) and machine learning (ML) have emerged as an indispensable tool to draw meaningful insights and improve decision making in drug discovery. Thanks to the advancements in AI and ML algorithms, now the AI/ML-driven solutions have an unprecedented potential to accelerate the process of CNS drug discovery with better success rate. In this review, we comprehensively summarize AI/ML-powered pharmaceutical discovery efforts and their implementations in the CNS area. After introducing the AI/ML models as well as the conceptualization and data preparation, we outline the applications of AI/ML technologies to several key procedures in drug discovery, including target identification, compound screening, hit/lead generation and optimization, drug response and synergy prediction, de novo drug design, and drug repurposing. We review the current state-of-the-art of AI/ML-guided CNS drug discovery, focusing on blood-brain barrier permeability prediction and implementation into therapeutic discovery for neurological diseases. Finally, we discuss the major challenges and limitations of current approaches and possible future directions that may provide resolutions to these difficulties.
Collapse
Affiliation(s)
- Sezen Vatansever
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Avner Schlessinger
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Daniel Wacker
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of NeuroscienceIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - H. Ümit Kaniskan
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Jian Jin
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Ming‐Ming Zhou
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Bin Zhang
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| |
Collapse
|
25
|
Martin EJ, Zhu XW. Collaborative Profile-QSAR: A Natural Platform for Building Collaborative Models among Competing Companies. J Chem Inf Model 2021; 61:1603-1616. [PMID: 33844519 DOI: 10.1021/acs.jcim.0c01342] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Massively multitask bioactivity models that transfer learning between thousands of assays have been shown to work dramatically better than separate models trained on each individual assay. In particular, the applicability domain for a given model can expand from compounds similar to those tested in that specific assay to those tested across the full complement of contributing assays. If many large companies would share their assay data and train models on the superset, predictions should be better than what each company can do alone. However, a company's compounds, targets, and activities are among their most guarded trade secrets. Strategies have been proposed to share just the individual collaborators' models, without exposing any of the training data. Profile-QSAR (pQSAR) is a two-level, multitask, stacked model. It uses profiles of level-1 predictions from single-task models for thousands of assays as compound descriptors for level-2 models. This work describes its simple and natural adaptation to safe collaboration by model sharing. Broad model sharing has not yet been implemented across multiple large companies, so there are numerous unanswered questions. Novartis was formed from several mergers and acquisitions. In principle, this should allow an internal simulation of model sharing. In practice, the lack of metadata about the origins of compounds and assays made this difficult. Nevertheless, we have attempted to simulate this process and propose some findings: multitask pQSAR is always an improvement over single-task models; collaborative multitask modeling did not improve predictions on internal compounds; collaboration did improve predictions for external compounds but far less than the purely internal multitask modeling for internal compounds; collaborative models for external compounds increasingly improve as overlap between compound collections increases; combining profiles from inside and outside the company is not best, with internal predictions better using only the inside profile and external using only the outside profile, but a consensus of models using all three profiles is best on external compounds and a good compromise on internal compounds. We anticipate similar results from other model-sharing approaches. Indeed, since collaborative pQSAR through model sharing is mathematically identical to pQSAR using actual shared data, we believe our conclusions should apply to collaborative modeling by any current method even including the unlikely scenario of directly sharing all chemical structures and assay data.
Collapse
Affiliation(s)
- Eric J Martin
- Novartis Institute for Biomedical Research, 5959 Horton Street, Emeryville, California 94608-2916, United States
| | - Xiang-Wei Zhu
- Novartis Institute for Biomedical Research, 5959 Horton Street, Emeryville, California 94608-2916, United States
| |
Collapse
|
26
|
Schuffenhauer A, Schneider N, Hintermann S, Auld D, Blank J, Cotesta S, Engeloch C, Fechner N, Gaul C, Giovannoni J, Jansen J, Joslin J, Krastel P, Lounkine E, Manchester J, Monovich LG, Pelliccioli AP, Schwarze M, Shultz MD, Stiefl N, Baeschlin DK. Evolution of Novartis' Small Molecule Screening Deck Design. J Med Chem 2020; 63:14425-14447. [PMID: 33140646 DOI: 10.1021/acs.jmedchem.0c01332] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
This article summarizes the evolution of the screening deck at the Novartis Institutes for BioMedical Research (NIBR). Historically, the screening deck was an assembly of all available compounds. In 2015, we designed a first deck to facilitate access to diverse subsets with optimized properties. We allocated the compounds as plated subsets on a 2D grid with property based ranking in one dimension and increasing structural redundancy in the other. The learnings from the 2015 screening deck were applied to the design of a next generation in 2019. We found that using traditional leadlikeness criteria (mainly MW, clogP) reduces the hit rates of attractive chemical starting points in subset screening. Consequently, the 2019 deck relies on solubility and permeability to select preferred compounds. The 2019 design also uses NIBR's experimental assay data and inferred biological activity profiles in addition to structural diversity to define redundancy across the compound sets.
Collapse
Affiliation(s)
- Ansgar Schuffenhauer
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Samuel Hintermann
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Douglas Auld
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Jutta Blank
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Simona Cotesta
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Caroline Engeloch
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Christoph Gaul
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Jerome Giovannoni
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Johanna Jansen
- Novartis Institutes for BioMedical Research-Emeryville, 5300 Chiron Way, Emeryville, California 94608-2916, United States
| | - John Joslin
- Genomics Institute of the Novartis Foundation, San Diego, California 92121, United States
| | - Philipp Krastel
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Eugen Lounkine
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - John Manchester
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Lauren G Monovich
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Anna Paola Pelliccioli
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Manuel Schwarze
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Michael D Shultz
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Daniel K Baeschlin
- Novartis Institutes for BioMedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| |
Collapse
|
27
|
Stecula A, Hussain MS, Viola RE. Discovery of Novel Inhibitors of a Critical Brain Enzyme Using a Homology Model and a Deep Convolutional Neural Network. J Med Chem 2020; 63:8867-8875. [DOI: 10.1021/acs.jmedchem.0c00473] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Affiliation(s)
- Adrian Stecula
- Atomwise Inc., San Francisco, California 94103, United States
| | - Muhammad S. Hussain
- Department of Chemistry and Biochemistry, University of Toledo, Toledo, Ohio 43606, United States
| | - Ronald E. Viola
- Department of Chemistry and Biochemistry, University of Toledo, Toledo, Ohio 43606, United States
| |
Collapse
|
28
|
Morris P, St. Clair R, Hahn WE, Barenholtz E. Predicting Binding from Screening Assays with Transformer Network Embeddings. J Chem Inf Model 2020; 60:4191-4199. [DOI: 10.1021/acs.jcim.9b01212] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Paul Morris
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, Florida 33431, United States
| | - Rachel St. Clair
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, Florida 33431, United States
| | - William Edward Hahn
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, Florida 33431, United States
| | - Elan Barenholtz
- Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, Florida 33431, United States
| |
Collapse
|
29
|
Irwin BWJ, Levell JR, Whitehead TM, Segall MD, Conduit GJ. Practical Applications of Deep Learning To Impute Heterogeneous Drug Discovery Data. J Chem Inf Model 2020; 60:2848-2857. [PMID: 32478517 DOI: 10.1021/acs.jcim.0c00443] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Contemporary deep learning approaches still struggle to bring a useful improvement in the field of drug discovery because of the challenges of sparse, noisy, and heterogeneous data that are typically encountered in this context. We use a state-of-the-art deep learning method, Alchemite, to impute data from drug discovery projects, including multitarget biochemical activities, phenotypic activities in cell-based assays, and a variety of absorption, distribution, metabolism, and excretion (ADME) endpoints. The resulting model gives excellent predictions for activity and ADME endpoints, offering an average increase in R2 of 0.22 versus quantitative structure-activity relationship methods. The model accuracy is robust to combining data across uncorrelated endpoints and projects with different chemical spaces, enabling a single model to be trained for all compounds and endpoints. We demonstrate improvements in accuracy on the latest chemistry and data when updating models with new data as an ongoing medicinal chemistry project progresses.
Collapse
Affiliation(s)
- Benedict W J Irwin
- Optibrium Limited, Cambridge Innovation Park, Denny End Rd, Cambridge CB25 9PB, U.K.,Cavendish Laboratory, University of Cambridge, 19 JJ Thomson Avenue, Cambridge CB3 0HE, U.K
| | - Julian R Levell
- Constellation Pharmaceuticals Inc., 215 First St Suite 200, Cambridge, Massachusetts 02142, United States
| | - Thomas M Whitehead
- Intellegens Limited, Eagle Labs, 28 Chesterton Road, Cambridge CB4 3AZ, U.K
| | - Matthew D Segall
- Optibrium Limited, Cambridge Innovation Park, Denny End Rd, Cambridge CB25 9PB, U.K
| | - Gareth J Conduit
- Intellegens Limited, Eagle Labs, 28 Chesterton Road, Cambridge CB4 3AZ, U.K.,Cavendish Laboratory, University of Cambridge, 19 JJ Thomson Avenue, Cambridge CB3 0HE, U.K
| |
Collapse
|
30
|
Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020; 12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open
Abstract
Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using Ki, Kd, IC50 and EC50 data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC50 data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC50 units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC50 units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC50 units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK. .,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| | - Ctibor Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Daniel Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| |
Collapse
|
31
|
Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 2020; 12:39. [PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 02/11/2023] Open
Abstract
An affinity fingerprint is the vector consisting of compound’s affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Morgan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.![]()
Collapse
Affiliation(s)
- C Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic
| | - I Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - W Dehaen
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - P Kříž
- Department of Mathematics, Faculty of Chemical Engineering, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - G J P van Westen
- Computational Drug Discovery, Drug Discovery and Safety, LACDR, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - I V Tetko
- Helmholtz Zentrum Muenchen - German Research Center for Environmental Health (GmbH) and BIGCHEM GmbH, Ingolstaedter Landstrasse 1, 85764, Neuherberg, Germany
| | - A Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - D Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic. .,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic.
| |
Collapse
|
32
|
Norinder U, Spjuth O, Svensson F. Using Predicted Bioactivity Profiles to Improve Predictive Modeling. J Chem Inf Model 2020; 60:2830-2837. [PMID: 32374618 DOI: 10.1021/acs.jcim.0c00250] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Predictive modeling is a cornerstone in early drug development. Using information for multiple domains or across prediction tasks has the potential to improve the performance of predictive modeling. However, aggregating data often leads to incomplete data matrices that might be limiting for modeling. In line with previous studies, we show that by generating predicted bioactivity profiles, and using these as additional features, prediction accuracy of biological endpoints can be improved. Using conformal prediction, a type of confidence predictor, we present a robust framework for the calculation of these profiles and the evaluation of their impact. We report on the outcomes from several approaches to generate the predicted profiles on 16 datasets in cytotoxicity and bioactivity and show that efficiency is improved the most when including the p-values from conformal prediction as bioactivity profiles.
Collapse
Affiliation(s)
- Ulf Norinder
- Department of Computer and Systems Sciences, Stockholm University, Box 7003, SE-164 07 Kista, Sweden.,Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124 Uppsala, Sweden.,MTM Research Centre, School of Science and Technology, Örebro University, SE-70182 Örebro, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124 Uppsala, Sweden.,Science for Life Laboratory, Uppsala University, Box 591, SE-75124 Uppsala, Sweden
| | - Fredrik Svensson
- The Alzheimer's Research UK University College London Drug Discovery Institute, The Cruciform Building, Gower Street, WC1E 6BT London, U.K
| |
Collapse
|
33
|
Imputation versus prediction: applications in machine learning for drug discovery. FUTURE DRUG DISCOVERY 2020. [DOI: 10.4155/fdd-2020-0008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Imputation is a powerful statistical method that is distinct from the predictive modelling techniques more commonly used in drug discovery. Imputation uses sparse experimental data in an incomplete dataset to predict missing values by leveraging correlations between experimental assays. This contrasts with quantitative structure–activity relationship methods that use only descriptor – assay correlations. We summarize three recent imputation strategies – heterogeneous deep imputation, assay profile methods and matrix factorization – and compare these with quantitative structure–activity relationship methods, including deep learning, in drug discovery settings. We comment on the value added by imputation methods when used in an ongoing project and find that imputation produces stronger models, earlier in the project, over activity and absorption, distribution, metabolism and elimination end points.
Collapse
|
34
|
Abstract
Virtual screening is no longer merely a matter of identifying the subset of compounds from a large collection likely to be active against a particular endpoint. This viewpoint shares some distinctive practices at Novartis, where virtual screening combines multiple computational tools that marry the competing goals of biasing the selection of compounds toward multiple desired properties, while diversifying the selection to sample the available chemistry space, identifying quality compounds that inform drug discovery. Topics include the various considerations needed for a successful virtual screening practice: triaging, compound quality, accuracy and test sets, activity prediction including multitask modeling, virtual profiling, automation, multiproperty bias, diversity and property spaces, and biased-diversity designs.
Collapse
Affiliation(s)
- Eric J Martin
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| | - Johanna M Jansen
- Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States
| |
Collapse
|
35
|
|
36
|
Schroedl S. Current methods and challenges for deep learning in drug discovery. DRUG DISCOVERY TODAY. TECHNOLOGIES 2019; 32-33:9-17. [PMID: 33386100 DOI: 10.1016/j.ddtec.2020.07.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 06/17/2020] [Accepted: 07/24/2020] [Indexed: 12/18/2022]
Abstract
Driven by rapid advances in computer hardware and publicly available datasets over the past decade, deep learning has achieved tremendous success in the transformation of many computational disciplines. These novel technologies have had considerable impact on computer-aided drug design as well, throughout all stages of the development pipeline. A flexible toolbox of neural architectures has been developed that are well-suited to represent the sequential, topological, or geometrical concepts of chemistry and biology; and that are able to either discriminate existing molecules or to generate new ones from scratch. For some biochemical prediction tasks, the state of the art has been advanced; however, for complex and practically relevant projects, the outcomes are less clear-cut. Current deep learning methods rely on massive amounts of labeled examples, but drug discovery data is comparatively limited in quantity and quality. These problems need to be resolved and existing sources used more effectively to demonstrate that deep learning can revolutionize the field in general.
Collapse
|
37
|
Zakharov AV, Zhao T, Nguyen DT, Peryea T, Sheils T, Yasgar A, Huang R, Southall N, Simeonov A. Novel Consensus Architecture To Improve Performance of Large-Scale Multitask Deep Learning QSAR Models. J Chem Inf Model 2019; 59:4613-4624. [PMID: 31584270 DOI: 10.1021/acs.jcim.9b00526] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Advances in the development of high-throughput screening and automated chemistry have rapidly accelerated the production of chemical and biological data, much of them freely accessible through literature aggregator services such as ChEMBL and PubChem. Here, we explore how to use this comprehensive mapping of chemical biology space to support the development of large-scale quantitative structure-activity relationship (QSAR) models. We propose a new deep learning consensus architecture (DLCA) that combines consensus and multitask deep learning approaches together to generate large-scale QSAR models. This method improves knowledge transfer across different target/assays while also integrating contributions from models based on different descriptors. The proposed approach was validated and compared with proteochemometrics, multitask deep learning, and Random Forest methods paired with various descriptors types. DLCA models demonstrated improved prediction accuracy for both regression and classification tasks. The best models together with their modeling sets are provided through publicly available web services at https://predictor.ncats.io .
Collapse
Affiliation(s)
- Alexey V Zakharov
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Tongan Zhao
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Dac-Trung Nguyen
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Tyler Peryea
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Timothy Sheils
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Adam Yasgar
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Ruili Huang
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Noel Southall
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| | - Anton Simeonov
- National Center for Advancing Translational Sciences (NCATS) , National Institutes of Health , 9800 Medical Center Drive , Rockville , Maryland 20850 , United States
| |
Collapse
|
38
|
Martin EJ, Polyakov VR, Zhu XW, Tian L, Mukherjee P, Liu X. All-Assay-Max2 pQSAR: Activity Predictions as Accurate as Four-Concentration IC 50s for 8558 Novartis Assays. J Chem Inf Model 2019; 59:4450-4459. [PMID: 31518124 DOI: 10.1021/acs.jcim.9b00375] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Profile-quantitative structure-activity relationship (pQSAR) is a massively multitask, two-step machine learning method with unprecedented scope, accuracy, and applicability domain. In step one, a "profile" of conventional single-assay random forest regression models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 substructural fingerprints as compound descriptors. In step two, a panel of partial least squares (PLS) models are built using the profile of pIC50 predictions from those random forest regression models as compound descriptors (hence the name). Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11 805 diverse Novartis (NVS) IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including random forest regression models whose predictions correlate with the assay being modeled. The random forest regression and pQSAR models were evaluated with our "realistically novel" held-out test set, whose median average similarity to the nearest training set member across the 11 805 assays was only 0.34, comparable to the novelty of compounds actually selected from virtual screens. For the 11 805 single-assay random forest regression models, the median correlation of prediction with the experiment was only rext2 = 0.05, virtually random, and only 8% of the models achieved our standard success threshold of rext2 = 0.30. For pQSAR, the median correlation was rext2 = 0.53, comparable to four-concentration experimental IC50s, and 72% of the models met our rext2 > 0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target subclasses, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million NVS compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others. Several such actual applications are described.
Collapse
Affiliation(s)
- Eric J Martin
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States
| | - Valery R Polyakov
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States
| | - Xiang-Wei Zhu
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States
| | - Li Tian
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States.,China Novartis Institutes for BioMedical Research Company, Limited , 2F, Building 4, Novartis Campus, No. 4218 Jinke Road , Zhangjiang, Pudong, Shanghai 201203 , China
| | - Prasenjit Mukherjee
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States
| | - Xin Liu
- Novartis Institute for Biomedical Research , 5300 Chiron Way , Emeryville , California 94608-2916 , United States.,China Novartis Institutes for BioMedical Research Company, Limited , 2F, Building 4, Novartis Campus, No. 4218 Jinke Road , Zhangjiang, Pudong, Shanghai 201203 , China
| |
Collapse
|
39
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 383] [Impact Index Per Article: 63.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
40
|
Bhhatarai B, Walters WP, Hop CECA, Lanza G, Ekins S. Opportunities and challenges using artificial intelligence in ADME/Tox. NATURE MATERIALS 2019; 18:418-422. [PMID: 31000801 PMCID: PMC6594826 DOI: 10.1038/s41563-019-0332-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
A recent conference organized a panel of scientists representing small and big pharma companies, who work at the interface of machine learning (ML) and absorption, distribution, metabolism, excretion, and toxicology (ADME/Tox). With the recent rebirth of AI related to pharma, it is timely to present this collaborative commentary to capture the diverging opinions on the past, present and future role of AI for ADME/Tox and how it can be applied in newer areas such as nanomaterials.
Collapse
Affiliation(s)
- Barun Bhhatarai
- Novartis Institutes for Biomedical Research, Cambridge, MA, USA
| | | | | | | | - Sean Ekins
- Collaborations Pharmaceuticals Inc., Raleigh, NC, USA.
| |
Collapse
|
41
|
Whitehead TM, Irwin BWJ, Hunt P, Segall MD, Conduit GJ. Imputation of Assay Bioactivity Data Using Deep Learning. J Chem Inf Model 2019; 59:1197-1204. [PMID: 30753070 DOI: 10.1021/acs.jcim.8b00768] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We describe a novel deep learning neural network method and its application to impute assay pIC50 values. Unlike conventional machine learning approaches, this method is trained on sparse bioactivity data as input, typical of that found in public and commercial databases, enabling it to learn directly from correlations between activities measured in different assays. In two case studies on public domain data sets we show that the neural network method outperforms traditional quantitative structure-activity relationship (QSAR) models and other leading approaches. Furthermore, by focusing on only the most confident predictions the accuracy is increased to R2 > 0.9 using our method, as compared to R2 = 0.44 when reporting all predictions.
Collapse
Affiliation(s)
- T M Whitehead
- Intellegens , Eagle Labs , Chesterton Road , Cambridge CB4 3AZ , United Kingdom
| | - B W J Irwin
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - P Hunt
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - M D Segall
- Optibrium , F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road , Cambridge CB25 9PB , United Kingdom
| | - G J Conduit
- Intellegens , Eagle Labs , Chesterton Road , Cambridge CB4 3AZ , United Kingdom.,Cavendish Laboratory , University of Cambridge , J.J. Thomson Avenue , Cambridge CB3 0HE , United Kingdom
| |
Collapse
|
42
|
Feinberg EN, Sur D, Wu Z, Husic BE, Mai H, Li Y, Sun S, Yang J, Ramsundar B, Pande VS. PotentialNet for Molecular Property Prediction. ACS CENTRAL SCIENCE 2018; 4:1520-1530. [PMID: 30555904 PMCID: PMC6276035 DOI: 10.1021/acscentsci.8b00507] [Citation(s) in RCA: 238] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Indexed: 05/11/2023]
Abstract
The arc of drug discovery entails a multiparameter optimization problem spanning vast length scales. The key parameters range from solubility (angstroms) to protein-ligand binding (nanometers) to in vivo toxicity (meters). Through feature learning-instead of feature engineering-deep neural networks promise to outperform both traditional physics-based and knowledge-based machine learning models for predicting molecular properties pertinent to drug discovery. To this end, we present the PotentialNet family of graph convolutions. These models are specifically designed for and achieve state-of-the-art performance for protein-ligand binding affinity. We further validate these deep neural networks by setting new standards of performance in several ligand-based tasks. In parallel, we introduce a new metric, the Regression Enrichment Factor EFχ (R), to measure the early enrichment of computational models for chemical data. Finally, we introduce a cross-validation strategy based on structural homology clustering that can more accurately measure model generalizability, which crucially distinguishes the aims of machine learning for drug discovery from standard machine learning tasks.
Collapse
Affiliation(s)
- Evan N. Feinberg
- Program
in Biophysics, Stanford University, Stanford, California 94305, United States
- E-mail:
| | - Debnil Sur
- Department
of Computer Science, Stanford University, Stanford, California 94305, United States
| | - Zhenqin Wu
- Department
of Chemistry, Stanford University, Stanford, California 94305, United States
| | - Brooke E. Husic
- Department
of Chemistry, Stanford University, Stanford, California 94305, United States
| | - Huanghao Mai
- Department
of Computer Science, Stanford University, Stanford, California 94305, United States
| | - Yang Li
- School
of Mathematical Sciences and College of Life Sciences, Nankai University, Tianjin 300071, China
| | - Saisai Sun
- School
of Mathematical Sciences and College of Life Sciences, Nankai University, Tianjin 300071, China
| | - Jianyi Yang
- School
of Mathematical Sciences and College of Life Sciences, Nankai University, Tianjin 300071, China
| | - Bharath Ramsundar
- Department
of Computer Science, Stanford University, Stanford, California 94305, United States
| | - Vijay S. Pande
- Department
of Bioengineering, Stanford University, Stanford, California 94305, United States
- E-mail:
| |
Collapse
|
43
|
Cortés-Ciriano I, Firth NC, Bender A, Watson O. Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening. J Chem Inf Model 2018; 58:2000-2014. [PMID: 30130102 DOI: 10.1021/acs.jcim.8b00376] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The versatility of similarity searching and quantitative structure-activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Nicholas C Firth
- Centre for Medical Image Computing, Department of Computer Science , UCL , London WC1E 6BT , United Kingdom.,Evariste Technologies Ltd , Goring on Thames RG8 9AL , United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , United Kingdom
| | - Oliver Watson
- Evariste Technologies Ltd , Goring on Thames RG8 9AL , United Kingdom
| |
Collapse
|
44
|
Kallen J, Bergsdorf C, Arnaud B, Bernhard M, Brichet M, Cobos-Correa A, Elhajouji A, Freuler F, Galimberti I, Guibourdenche C, Haenni S, Holzinger S, Hunziker J, Izaac A, Kaufmann M, Leder L, Martus HJ, von Matt P, Polyakov V, Roethlisberger P, Roma G, Stiefl N, Uteng M, Lerchner A. X-ray Structures and Feasibility Assessment of CLK2 Inhibitors for Phelan-McDermid Syndrome. ChemMedChem 2018; 13:1997-2007. [PMID: 29985556 DOI: 10.1002/cmdc.201800344] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Indexed: 01/15/2023]
Abstract
CLK2 inhibition has been proposed as a potential mechanism to improve autism and neuronal functions in Phelan-McDermid syndrome (PMDS). Herein, the discovery of a very potent indazole CLK inhibitor series and the CLK2 X-ray structure of the most potent analogue are reported. This new indazole series was identified through a biochemical CLK2 Caliper assay screen with 30k compounds selected by an in silico approach. Novel high-resolution X-ray structures of all CLKs, including the first CLK4 X-ray structure, bound to known CLK2 inhibitor tool compounds (e.g., TG003, CX-4945), are also shown and yield insight into inhibitor selectivity in the CLK family. The efficacy of the new CLK2 inhibitors from the indazole series was demonstrated in the mouse brain slice assay, and potential safety concerns were investigated. Genotoxicity findings in the human lymphocyte micronucleus test (MNT) assay are shown by using two structurally different CLK inhibitors to reveal a major concern for pan-CLK inhibition in PMDS.
Collapse
Affiliation(s)
- Joerg Kallen
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Christian Bergsdorf
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Bertrand Arnaud
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Mario Bernhard
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Murielle Brichet
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Amanda Cobos-Correa
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Azeddine Elhajouji
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Felix Freuler
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Ivan Galimberti
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Christel Guibourdenche
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Simon Haenni
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Sandra Holzinger
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Juerg Hunziker
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Aude Izaac
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Markus Kaufmann
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Lukas Leder
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Hans-Joerg Martus
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Peter von Matt
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Valery Polyakov
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Patrik Roethlisberger
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Guglielmo Roma
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Nikolaus Stiefl
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Marianne Uteng
- Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Andreas Lerchner
- Global Discovery Chemistry, Novartis Institutes for BioMedical Research, Novartis Campus, 4002, Basel, Switzerland
| |
Collapse
|
45
|
Valeur E, Jimonet P. New Modalities, Technologies, and Partnerships in Probe and Lead Generation: Enabling a Mode-of-Action Centric Paradigm. J Med Chem 2018; 61:9004-9029. [DOI: 10.1021/acs.jmedchem.8b00378] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Affiliation(s)
- Eric Valeur
- Medicinal Chemistry, Cardiovascular, Renal and Metabolism, IMED Biotech Unit, AstraZeneca, Pepparedsleden 1, Mölndal 431 83, Sweden
| | - Patrick Jimonet
- External Innovation Drug Discovery, Global Business Development & Licensing, Sanofi, 13 quai Jules Guesde, 94400 Vitry-sur-Seine, France
| |
Collapse
|
46
|
Wallach I, Heifets A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J Chem Inf Model 2018; 58:916-932. [PMID: 29698607 DOI: 10.1021/acs.jcim.7b00403] [Citation(s) in RCA: 106] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems, that accounts for the similarity among inactive molecules as well as active ones. We investigated seven widely used benchmarks for virtual screening and classification, and we show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously applied unbiasing techniques. Therefore, it may be the case that the previously reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy.
Collapse
Affiliation(s)
- Izhar Wallach
- Atomwise Inc. , 221 Main Street, Suite 1350 , San Francisco , California 94105 , United States
| | - Abraham Heifets
- Atomwise Inc. , 221 Main Street, Suite 1350 , San Francisco , California 94105 , United States
| |
Collapse
|
47
|
Kooistra AJ, Vass M, McGuire R, Leurs R, de Esch IJP, Vriend G, Verhoeven S, de Graaf C. 3D-e-Chem: Structural Cheminformatics Workflows for Computer-Aided Drug Discovery. ChemMedChem 2018; 13:614-626. [PMID: 29337438 PMCID: PMC5900740 DOI: 10.1002/cmdc.201700754] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 01/11/2018] [Indexed: 01/06/2023]
Abstract
eScience technologies are needed to process the information available in many heterogeneous types of protein-ligand interaction data and to capture these data into models that enable the design of efficacious and safe medicines. Here we present scientific KNIME tools and workflows that enable the integration of chemical, pharmacological, and structural information for: i) structure-based bioactivity data mapping, ii) structure-based identification of scaffold replacement strategies for ligand design, iii) ligand-based target prediction, iv) protein sequence-based binding site identification and ligand repurposing, and v) structure-based pharmacophore comparison for ligand repurposing across protein families. The modular setup of the workflows and the use of well-established standards allows the re-use of these protocols and facilitates the design of customized computer-aided drug discovery workflows.
Collapse
Affiliation(s)
- Albert J. Kooistra
- Centre for Molecular and Biomolecular Informatics (CMBI)Radboud University Medical Center (RadboudUMC)NijmegenThe Netherlands
- Division of Medicinal Chemistry, Faculty of Science, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS)Vrije Universiteit AmsterdamAmsterdamThe Netherlands
| | - Márton Vass
- Division of Medicinal Chemistry, Faculty of Science, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS)Vrije Universiteit AmsterdamAmsterdamThe Netherlands
| | - Ross McGuire
- Centre for Molecular and Biomolecular Informatics (CMBI)Radboud University Medical Center (RadboudUMC)NijmegenThe Netherlands
- BioAxis Research, Pivot ParkOssThe Netherlands
| | - Rob Leurs
- Division of Medicinal Chemistry, Faculty of Science, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS)Vrije Universiteit AmsterdamAmsterdamThe Netherlands
| | - Iwan J. P. de Esch
- Division of Medicinal Chemistry, Faculty of Science, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS)Vrije Universiteit AmsterdamAmsterdamThe Netherlands
| | - Gert Vriend
- Centre for Molecular and Biomolecular Informatics (CMBI)Radboud University Medical Center (RadboudUMC)NijmegenThe Netherlands
| | | | - Chris de Graaf
- Division of Medicinal Chemistry, Faculty of Science, Amsterdam Institute for Molecules, Medicines and Systems (AIMMS)Vrije Universiteit AmsterdamAmsterdamThe Netherlands
| |
Collapse
|