1
|
Esaki T, Yonezawa T, Ikeda K. A new workflow for the effective curation of membrane permeability data from open ADME information. J Cheminform 2024; 16:30. [PMID: 38481269 PMCID: PMC10938840 DOI: 10.1186/s13321-024-00826-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/10/2024] [Indexed: 03/17/2024] Open
Abstract
Membrane permeability is an in vitro parameter that represents the apparent permeability (Papp) of a compound, and is a key absorption, distribution, metabolism, and excretion parameter in drug development. Although the Caco-2 cell lines are the most used cell lines to measure Papp, other cell lines, such as the Madin-Darby Canine Kidney (MDCK), LLC-Pig Kidney 1 (LLC-PK1), and Ralph Russ Canine Kidney (RRCK) cell lines, can also be used to estimate Papp. Therefore, constructing in silico models for Papp estimation using the MDCK, LLC-PK1, and RRCK cell lines requires collecting extensive amounts of in vitro Papp data. An open database offers extensive measurements of various compounds covering a vast chemical space; however, concerns were reported on the use of data published in open databases without the appropriate accuracy and quality checks. Ensuring the quality of datasets for training in silico models is critical because artificial intelligence (AI, including deep learning) was used to develop models to predict various pharmacokinetic properties, and data quality affects the performance of these models. Hence, careful curation of the collected data is imperative. Herein, we developed a new workflow that supports automatic curation of Papp data measured in the MDCK, LLC-PK1, and RRCK cell lines collected from ChEMBL using KNIME. The workflow consisted of four main phases. Data were extracted from ChEMBL and filtered to identify the target protocols. A total of 1661 high-quality entries were retained after checking 436 articles. The workflow is freely available, can be updated, and has high reusability. Our study provides a novel approach for data quality analysis and accelerates the development of helpful in silico models for effective drug discovery. Scientific Contribution: The cost of building highly accurate predictive models can be significantly reduced by automating the collection of reliable measurement data. Our tool reduces the time and effort required for data collection and will enable researchers to focus on constructing high-performance in silico models for other types of analysis. To the best of our knowledge, no such tool is available in the literature.
Collapse
Affiliation(s)
- Tsuyoshi Esaki
- Faculty of Data Science, Shiga University, 1-1-1 Banba, Hikone, Shiga, 522-8522, Japan.
- Faculty of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe, Kyoto, 610-0394, Japan.
| | - Tomoki Yonezawa
- Faculty of Pharmacy, Keio University, 1-5-30 Shibakoen, Minato-ku, Tokyo, 105-8512, Japan
| | - Kazuyoshi Ikeda
- Faculty of Pharmacy, Keio University, 1-5-30 Shibakoen, Minato-ku, Tokyo, 105-8512, Japan
- HPC-and AI-Driven Drug Development Platform Division, RIKEN Center for Computational Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 4230-0045, Japan
| |
Collapse
|
2
|
Gonzalez-Ponce K, Horta Andrade C, Hunter F, Kirchmair J, Martinez-Mayorga K, Medina-Franco JL, Rarey M, Tropsha A, Varnek A, Zdrazil B. School of cheminformatics in Latin America. J Cheminform 2023; 15:82. [PMID: 37726809 PMCID: PMC10507835 DOI: 10.1186/s13321-023-00758-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/10/2023] [Indexed: 09/21/2023] Open
Abstract
We report the major highlights of the School of Cheminformatics in Latin America, Mexico City, November 24-25, 2022. Six lectures, one workshop, and one roundtable with four editors were presented during an online public event with speakers from academia, big pharma, and public research institutions. One thousand one hundred eighty-one students and academics from seventy-nine countries registered for the meeting. As part of the meeting, advances in enumeration and visualization of chemical space, applications in natural product-based drug discovery, drug discovery for neglected diseases, toxicity prediction, and general guidelines for data analysis were discussed. Experts from ChEMBL presented a workshop on how to use the resources of this major compounds database used in cheminformatics. The school also included a round table with editors of cheminformatics journals. The full program of the meeting and the recordings of the sessions are publicly available at https://www.youtube.com/@SchoolChemInfLA/featured .
Collapse
Affiliation(s)
- Karla Gonzalez-Ponce
- Institute of Chemistry, Campus Merida, National Autonomous University of Mexico, Merida‑Tetiz Highway, Km. 4.5, Ucu, Yucatan, Mexico
| | - Carolina Horta Andrade
- LabMol - Laboratory for Molecular Modeling and Drug Design, Faculdade de Farmacia, Universidade Federal de Goias, Goiania, GO, Brazil
| | - Fiona Hunter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Johannes Kirchmair
- Division of Pharmaceutical Chemistry, Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 2D 303, 1090, Vienna, Austria
| | - Karina Martinez-Mayorga
- Institute of Chemistry, Campus Merida, National Autonomous University of Mexico, Merida‑Tetiz Highway, Km. 4.5, Ucu, Yucatan, Mexico.
- Institute for Applied Mathematics and Systems, Merida Research Unit, National Autonomous University of Mexico, Sierra Papacal, Merida, Yucatan, Mexico.
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Avenida Universidad 3000, 04510, Mexico City, Mexico.
| | - Matthias Rarey
- ZBH - Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146, Hamburg, Germany
| | - Alexander Tropsha
- Molecular Modeling Laboratory, UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Alexandre Varnek
- Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, Rue B. Pascal, 67000, Strasbourg, France
| | - Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| |
Collapse
|
3
|
Padalino G, Coghlan A, Pagliuca G, Forde-Thomas JE, Berriman M, Hoffmann KF. Using ChEMBL to Complement Schistosome Drug Discovery. Pharmaceutics 2023; 15:pharmaceutics15051359. [PMID: 37242601 DOI: 10.3390/pharmaceutics15051359] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 04/25/2023] [Accepted: 04/26/2023] [Indexed: 05/28/2023] Open
Abstract
Schistosomiasis is one of the most important neglected tropical diseases. Until an effective vaccine is registered for use, the cornerstone of schistosomiasis control remains chemotherapy with praziquantel. The sustainability of this strategy is at substantial risk due to the possibility of praziquantel insensitive/resistant schistosomes developing. Considerable time and effort could be saved in the schistosome drug discovery pipeline if available functional genomics, bioinformatics, cheminformatics and phenotypic resources are systematically leveraged. Our approach, described here, outlines how schistosome-specific resources/methodologies, coupled to the open-access drug discovery database ChEMBL, can be cooperatively used to accelerate early-stage, schistosome drug discovery efforts. Our process identified seven compounds (fimepinostat, trichostatin A, NVP-BEP800, luminespib, epoxomicin, CGP60474 and staurosporine) with ex vivo anti-schistosomula potencies in the sub-micromolar range. Three of those compounds (epoxomicin, CGP60474 and staurosporine) also demonstrated potent and fast-acting ex vivo effects on adult schistosomes and completely inhibited egg production. ChEMBL toxicity data were also leveraged to provide further support for progressing CGP60474 (as well as luminespib and TAE684) as a novel anti-schistosomal compound. As very few compounds are currently at the advanced stages of the anti-schistosomal pipeline, our approaches highlight a strategy by which new chemical matter can be identified and quickly progressed through preclinical development.
Collapse
Affiliation(s)
- Gilda Padalino
- School of Pharmacy and Pharmaceutical Sciences, Cardiff University, Redwood Building, King Edward VII Avenue, Cardiff CF10 3NB, UK
| | - Avril Coghlan
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge CB10 1SA, UK
| | | | | | - Matthew Berriman
- Wellcome Centre for Integrative Parasitology, School of Infection and Immunity, University of Glasgow, 120 University Place, Glasgow G12 8TA, UK
| | - Karl F Hoffmann
- The Department of Life Sciences (DLS), Aberystwyth University, Aberystwyth SY23 3DA, UK
| |
Collapse
|
4
|
Aldahish A, Balaji P, Vasudevan R, Kandasamy G, James JP, Prabahar K. Elucidating the Potential Inhibitor against Type 2 Diabetes Mellitus Associated Gene of GLUT4. J Pers Med 2023; 13:jpm13040660. [PMID: 37109046 PMCID: PMC10146764 DOI: 10.3390/jpm13040660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 04/02/2023] [Accepted: 04/10/2023] [Indexed: 04/29/2023] Open
Abstract
Diabetes is a chronic hyperglycemic disorder that leads to a group of metabolic diseases. This condition of chronic hyperglycemia is caused by abnormal insulin levels. The impact of hyperglycemia on the human vascular tree is the leading cause of disease and death in type 1 and type 2 diabetes. People with type 2 diabetes mellitus (T2DM) have abnormal secretion as well as the action of insulin. Type 2 (non-insulin-dependent) diabetes is caused by a combination of genetic factors associated with decreased insulin production, insulin resistance, and environmental conditions. These conditions include overeating, lack of exercise, obesity, and aging. Glucose transport limits the rate of dietary glucose used by fat and muscle. The glucose transporter GLUT4 is kept intracellular and sorted dynamically, and GLUT4 translocation or insulin-regulated vesicular traffic distributes it to the plasma membrane. Different chemical compounds have antidiabetic properties. The complexity, metabolism, digestion, and interaction of these chemical compounds make it difficult to understand and apply them to reduce chronic inflammation and thus prevent chronic disease. In this study, we have applied a virtual screening approach to screen the most suitable and drug-able chemical compounds to be used as potential drug targets against T2DM. We have found that out of 5000 chemical compounds that we have analyzed, only two are known to be more effective as per our experiments based upon molecular docking studies and virtual screening through Lipinski's rule and ADMET properties.
Collapse
Affiliation(s)
- Afaf Aldahish
- Department of Pharmacology, College of Pharmacy, King Khalid University, Abha 61421, Saudi Arabia
| | | | - Rajalakshimi Vasudevan
- Department of Pharmacology, College of Pharmacy, King Khalid University, Abha 61421, Saudi Arabia
| | - Geetha Kandasamy
- Department of Clinical Pharmacy, College of Pharmacy, King Khalid University, Abha 62529, Saudi Arabia
| | - Jainey P James
- Department of Pharmaceutical Chemistry, NGSM Institute of Pharmaceutical Sciences (NGSMIPS), Nitte (Deemed to be University), Deralakatte, Mangaluru 575018, Karnataka, India
| | - Kousalya Prabahar
- Department of Pharmacy Practice, Faculty of Pharmacy, University of Tabuk, Tabuk 71491, Saudi Arabia
| |
Collapse
|
5
|
Oleneva P, Zabolotna Y, Horvath D, Marcou G, Bonachera F, Varnek A. French dispatch: GTM-based analysis of the Chimiothèque Nationale Chemical Space. Mol Inform 2023; 42:e2200208. [PMID: 36604304 DOI: 10.1002/minf.202200208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 12/29/2022] [Accepted: 01/05/2023] [Indexed: 01/07/2023]
Abstract
In order to analyze the Chimiothèque Nationale (CN) - The French National Compound Library - in the context of screening and biologically relevant compounds, the library was compared with ZINC in-stock collection and ChEMBL. This includes the study of chemical space coverage, physicochemical properties and Bemis-Murcko (BM) scaffold populations. More than 5 K CN-unique scaffolds (relative to ZINC and ChEMBL collections) were identified. Generative Topographic Maps (GTMs) accommodating those libraries were generated and used to compare the compound populations. Hierarchical GTM («zooming») was applied to generate an ensemble of maps at various resolution levels, from global overview to precise mapping of individual structures. The respective maps were added to the ChemSpace Atlas website. The analysis of synthetic accessibility in the context of combinatorial chemistry showed that only 29,7 % of CN compounds can be fully synthesized using commercially available building blocks.
Collapse
Affiliation(s)
- Polina Oleneva
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| | - Yuliana Zabolotna
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| | - Dragos Horvath
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| | - Gilles Marcou
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| | - Fanny Bonachera
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| | - Alexandre Varnek
- Laboratoire de Chémoinformatique, UMR7140 CNRS/UniStra, University of Strasbourg, 4 rue Blaise Pascal, 67081, Strasbourg, France
| |
Collapse
|
6
|
Pietruś W, Kurczab R, Warszycki D, Bojarski AJ, Bajorath J. Isomeric Activity Cliffs-A Case Study for Fluorine Substitution of Aminergic G Protein-Coupled Receptor Ligands. Molecules 2023; 28. [PMID: 36677547 DOI: 10.3390/molecules28020490] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/30/2022] [Accepted: 01/01/2023] [Indexed: 01/06/2023] Open
Abstract
Currently, G protein-coupled receptors (GPCRs) constitute a significant group of membrane-bound receptors representing more than 30% of therapeutic targets. Fluorine is commonly used in designing highly active biological compounds, as evidenced by the steadily increasing number of drugs by the Food and Drug Administration (FDA). Herein, we identified and analyzed 898 target-based F-containing isomeric analog sets for SAR analysis in the ChEMBL database-FiSAR sets active against 33 different aminergic GPCRs comprising a total of 2163 fluorinated (1201 unique) compounds. We found 30 FiSAR sets contain activity cliffs (ACs), defined as pairs of structurally similar compounds showing significant differences in affinity (≥50-fold change), where the change of fluorine position may lead up to a 1300-fold change in potency. The analysis of matched molecular pair (MMP) networks indicated that the fluorination of aromatic rings showed no clear trend toward a positive or negative effect on affinity. Additionally, we propose an in silico workflow (including induced-fit docking, molecular dynamics, quantum polarized ligand docking, and binding free energy calculations based on the Generalized-Born Surface-Area (GBSA) model) to score the fluorine positions in the molecule.
Collapse
|
7
|
Isigkeit L, Merk D. Compilation of Custom Compound/Bioactivity Datasets from Public Repositories. Methods Mol Biol 2023; 2706:25-50. [PMID: 37558939 DOI: 10.1007/978-1-0716-3397-7_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2023]
Abstract
Public repositories containing compound-bioactivity data for millions of small molecules offer a valuable resource for chemogenomic compound candidate search. Nonetheless, owning to nonuniform data mining, these databases are often incomplete, thus advocating the combined use of data from several repositories to increase target coverage and data accuracy. Here, we present a workflow to generate custom datasets from public databases for mining chemogenomic compound candidates. The compiled set provides flags for differences in structural and bioactivity data and enables rapid extraction of potent and selective bioactive compounds.
Collapse
Affiliation(s)
- Laura Isigkeit
- Institute of Pharmaceutical Chemistry, Goethe University Frankfurt, Frankfurt, Germany
| | - Daniel Merk
- Institute of Pharmaceutical Chemistry, Goethe University Frankfurt, Frankfurt, Germany.
- Department of Pharmacy, Ludwig-Maximilians-Universität München, Munich, Germany.
| |
Collapse
|
8
|
Diéguez-Santana K, Casañola-Martin GM, Torres R, Rasulev B, Green JR, González-Díaz H. Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial Compounds. Mol Pharm 2022; 19:2151-2163. [PMID: 35671399 PMCID: PMC9986951 DOI: 10.1021/acs.molpharmaceut.2c00029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Antibacterial drugs (AD) change the metabolic status of bacteria, contributing to bacterial death. However, antibiotic resistance and the emergence of multidrug-resistant bacteria increase interest in understanding metabolic network (MN) mutations and the interaction of AD vs MN. In this study, we employed the IFPTML = Information Fusion (IF) + Perturbation Theory (PT) + Machine Learning (ML) algorithm on a huge dataset from the ChEMBL database, which contains >155,000 AD assays vs >40 MNs of multiple bacteria species. We built a linear discriminant analysis (LDA) and 17 ML models centered on the linear index and based on atoms to predict antibacterial compounds. The IFPTML-LDA model presented the following results for the training subset: specificity (Sp) = 76% out of 70,000 cases, sensitivity (Sn) = 70%, and Accuracy (Acc) = 73%. The same model also presented the following results for the validation subsets: Sp = 76%, Sn = 70%, and Acc = 73.1%. Among the IFPTML nonlinear models, the k nearest neighbors (KNN) showed the best results with Sn = 99.2%, Sp = 95.5%, Acc = 97.4%, and Area Under Receiver Operating Characteristic (AUROC) = 0.998 in training sets. In the validation series, the Random Forest had the best results: Sn = 93.96% and Sp = 87.02% (AUROC = 0.945). The IFPTML linear and nonlinear models regarding the ADs vs MNs have good statistical parameters, and they could contribute toward finding new metabolic mutations in antibiotic resistance and reducing time/costs in antibacterial drug research.
Collapse
Affiliation(s)
- Karel Diéguez-Santana
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain.,Universidad Regional Amazónica IKIAM, Tena, Napo 150150, Ecuador
| | - Gerardo M Casañola-Martin
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States.,Department of Systems and Computer Engineering, Carleton University, K1S5B6 Ottawa, Ontario, Canada
| | - Roldan Torres
- Universidad Regional Amazónica IKIAM, Tena, Napo 150150, Ecuador
| | - Bakhtiyor Rasulev
- Department of Coatings and Polymeric Materials, North Dakota State University, Fargo, North Dakota 58102, United States
| | - James R Green
- Department of Systems and Computer Engineering, Carleton University, K1S5B6 Ottawa, Ontario, Canada
| | - Humbert González-Díaz
- Department of Organic and Inorganic Chemistry, University of Basque Country UPV/EHU, 48940 Leioa, Spain.,BIOFISIKA, Basque Center for Biophysics CSIC-UPVEH, 48940 Leioa, Spain.,IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Biscay, Spain
| |
Collapse
|
9
|
Aliagas I, Gobbi A, Lee ML, Sellers BD. Comparison of logP and logD correction models trained with public and proprietary data sets. J Comput Aided Mol Des 2022; 36:253-262. [PMID: 35359246 DOI: 10.1007/s10822-022-00450-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 03/15/2022] [Indexed: 10/18/2022]
Abstract
In drug discovery, partition and distribution coefficients, logP and logD for octanol/water, are widely used as metrics of the lipophilicity of molecules, which in turn have a strong influence on the bioactivity and bioavailability of potential drugs. There are a variety of established methods, mostly fragment or atom-based, to calculate logP while logD prediction generally relies on calculated logP and pKa for the estimation of neutral and ionized populations at a given pH. Algorithms such as ClogP have limitations generally leading to systematic errors for chemically related molecules while pKa estimation is generally more difficult due to the interplay of electronic, inductive and conjugation effects for ionizable moieties. We propose an integrated machine learning QSAR modeling approach to predict logD by training the model with experimental data while using ClogP and pKa predicted by commercial software as model descriptors. By optimizing the loss function for the ClogD calculated by the software, we build a correction model that incorporates both descriptors from the software and available experimental logD data. Additionally, we calculate logP from the logD model using the software predicted pKa's. Here, we have trained models using publicly or commercial available logD data to show that this approach can improve on commercial software predictions of lipophilicity. When applied to other logD data sets, this approach extends the domain of applicability of logD and logP predictions over commercial software. Performance of these models favorably compare with models built with a larger set of proprietary logD data.
Collapse
Affiliation(s)
- Ignacio Aliagas
- Discovery Chemistry, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA.
| | - Alberto Gobbi
- Discovery Chemistry, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Man-Ling Lee
- Discovery Chemistry, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Benjamin D Sellers
- Discovery Chemistry, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| |
Collapse
|
10
|
Abstract
Artificial intelligence (AI) tools find increasing application in drug discovery supporting every stage of the Design-Make-Test-Analyse (DMTA) cycle. The main focus of this chapter is the application in molecular generation with the aid of deep neural networks (DNN). We present a historical overview of the main advances in the field. We analyze the concepts of distribution and goal-directed learning and then highlight some of the recent applications of generative models in drug design with a focus into research work from the biopharmaceutical industry. We present in some more detail REINVENT which is an open-source software developed within our group in AstraZeneca and the main platform for AI molecular design support for a number of medicinal chemistry projects in the company and we also demonstrate some of our work in library design. Finally, we present some of the main challenges in the application of AI in Drug Discovery and different approaches to respond to these challenges which define areas for current and future work.
Collapse
|
11
|
Quevedo-Tumailli V, Ortega-Tenezaca B, González-Díaz H. IFPTML Mapping of Drug Graphs with Protein and Chromosome Structural Networks vs. Pre-Clinical Assay Information for Discovery of Antimalarial Compounds. Int J Mol Sci 2021; 22:ijms222313066. [PMID: 34884870 PMCID: PMC8657696 DOI: 10.3390/ijms222313066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 11/23/2021] [Accepted: 11/24/2021] [Indexed: 11/16/2022] Open
Abstract
The parasite species of genus Plasmodium causes Malaria, which remains a major global health problem due to parasite resistance to available Antimalarial drugs and increasing treatment costs. Consequently, computational prediction of new Antimalarial compounds with novel targets in the proteome of Plasmodium sp. is a very important goal for the pharmaceutical industry. We can expect that the success of the pre-clinical assay depends on the conditions of assay per se, the chemical structure of the drug, the structure of the target protein to be targeted, as well as on factors governing the expression of this protein in the proteome such as genes (Deoxyribonucleic acid, DNA) sequence and/or chromosomes structure. However, there are no reports of computational models that consider all these factors simultaneously. Some of the difficulties for this kind of analysis are the dispersion of data in different datasets, the high heterogeneity of data, etc. In this work, we analyzed three databases ChEMBL (Chemical database of the European Molecular Biology Laboratory), UniProt (Universal Protein Resource), and NCBI-GDV (National Center for Biotechnology Information—Genome Data Viewer) to achieve this goal. The ChEMBL dataset contains outcomes for 17,758 unique assays of potential Antimalarial compounds including numeric descriptors (variables) for the structure of compounds as well as a huge amount of information about the conditions of assays. The NCBI-GDV and UniProt datasets include the sequence of genes, proteins, and their functions. In addition, we also created two partitions (cassayj = caj and cdataj = cdj) of categorical variables from theChEMBL dataset. These partitions contain variables that encode information about experimental conditions of preclinical assays (caj) or about the nature and quality of data (cdj). These categorical variables include information about 22 parameters of biological activity (ca0), 28 target proteins (ca1), and 9 organisms of assay (ca2), etc. We also created another partition of (cprotj = cpj) including categorical variables with biological information about the target proteins, genes, and chromosomes. These variables cover32 genes (cp0), 10 chromosomes (cp1), gene orientation (cp2), and 31 protein functions (cp3). We used a Perturbation-Theory Machine Learning Information Fusion (IFPTML) algorithm to map all this information (from three databases) into and train a predictive model. Shannon’s entropy measure Shk (numerical variables) was used to quantify the information about the structure of drugs, protein sequences, gene sequences, and chromosomes in the same information scale. Perturbation Theory Operators (PTOs) with the form of Moving Average (MA) operators have been used to quantify perturbations (deviations) in the structural variables with respect to their expected values for different subsets (partitions) of categorical variables. We obtained three IFPTML models using General Discriminant Analysis (GDA), Classification Tree with Univariate Splits (CTUS), and Classification Tree with Linear Combinations (CTLC). The IFPTML-CTLC presented the better performance with Sensitivity Sn(%) = 83.6/85.1, and Specificity Sp(%) = 89.8/89.7 for training/validation sets, respectively. This model could become a useful tool for the optimization of preclinical assays of new Antimalarial compounds vs. different proteins in the proteome of Plasmodium.
Collapse
Affiliation(s)
- Viviana Quevedo-Tumailli
- Grupo RNASA-IMEDIR, Department of Computer Science, University of A Coruña, 15071 A Coruña, Spain; (V.Q.-T.); (B.O.-T.)
- Research Department, Puyo Campus, Universidad Estatal Amazónica, Puyo 160150, Ecuador
| | - Bernabe Ortega-Tenezaca
- Grupo RNASA-IMEDIR, Department of Computer Science, University of A Coruña, 15071 A Coruña, Spain; (V.Q.-T.); (B.O.-T.)
- Information and Communications Technology Management Department, Puyo Campus, Universidad Estatal Amazónica, Puyo 160150, Ecuador
| | - Humberto González-Díaz
- Department of Organic and Inorganic Chemistry, University of the Basque Country UPV/EHU, 48940 Leioa, Spain
- BIOFISIKA, Basque Centre for Biophysics, CSIC-UPV/EHU, 48940 Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
- Correspondence: ;Tel.: +34-94-601-3547
| |
Collapse
|
12
|
Pietruś W, Kurczab R, Stumpfe D, Bojarski AJ, Bajorath J. Data-Driven Analysis of Fluorination of Ligands of Aminergic G Protein Coupled Receptors. Biomolecules 2021; 11:1647. [PMID: 34827645 PMCID: PMC8615825 DOI: 10.3390/biom11111647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 11/16/2022] Open
Abstract
Currently, G protein-coupled receptors are the targets with the highest number of drugs in many therapeutic areas. Fluorination has become a common strategy in designing highly active biological compounds, as evidenced by the steadily increasing number of newly approved fluorine-containing drugs. Herein, we identified in the ChEMBL database and analysed 1554 target-based FSAR sets (non-fluorinated compounds and their fluorinated analogues) comprising 966 unique non-fluorinated and 2457 unique fluorinated compounds active against 33 different aminergic GPCRs. Although a relatively small number of activity cliffs (defined as a pair of structurally similar compounds showing significant differences of activity -ΔpPot > 1.7) was found in FSAR sets, it is clear that appropriately introduced fluorine can increase ligand potency more than 50-fold. The analysis of matched molecular pairs (MMPs) networks indicated that the fluorination of the aromatic ring showed no clear trend towards a positive or negative effect on affinity; however, a favourable site for a positive potency effect of fluorination was the ortho position. Fluorination of aliphatic fragments more often led to a decrease in biological activity. The results may constitute the rules of thumb for fluorination of aminergic receptor ligands and provide insights into the role of fluorine substitutions in medicinal chemistry.
Collapse
Affiliation(s)
- Wojciech Pietruś
- Department of Medicinal Chemistry, Maj Institute of Pharmacology, Polish Academy of Sciences, Smetna 12, 31-343 Krakow, Poland; (W.P.); (A.J.B.)
- Department of Life Science Informatics, LIMES Program Unit Chemical Biology and Medicinal Chemistry, B-IT, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany;
| | - Rafał Kurczab
- Department of Medicinal Chemistry, Maj Institute of Pharmacology, Polish Academy of Sciences, Smetna 12, 31-343 Krakow, Poland; (W.P.); (A.J.B.)
| | - Dagmar Stumpfe
- Department of Life Science Informatics, LIMES Program Unit Chemical Biology and Medicinal Chemistry, B-IT, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany;
| | - Andrzej J. Bojarski
- Department of Medicinal Chemistry, Maj Institute of Pharmacology, Polish Academy of Sciences, Smetna 12, 31-343 Krakow, Poland; (W.P.); (A.J.B.)
| | - Jürgen Bajorath
- Department of Life Science Informatics, LIMES Program Unit Chemical Biology and Medicinal Chemistry, B-IT, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany;
| |
Collapse
|
13
|
Falaguera MJ, Mestres J. Congenericity of Claimed Compounds in Patent Applications. Molecules 2021; 26:5253. [PMID: 34500686 DOI: 10.3390/molecules26175253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 08/17/2021] [Accepted: 08/18/2021] [Indexed: 12/04/2022] Open
Abstract
A method is presented to analyze quantitatively the degree of congenericity of claimed compounds in patent applications. The approach successfully differentiates patents exemplified with highly congeneric compounds of a structurally compact and well defined chemical series from patents containing a more diverse set of compounds around a more vaguely described patent claim. An application to 750 common patents available in SureChEMBL, SureChEMBLccs and ChEMBL is presented and the congenericity of patent compounds in those different sources discussed.
Collapse
|
14
|
Herrera-Ibatá DM. Machine Learning and Perturbation Theory Machine Learning (PTML) in Medicinal Chemistry, Biotechnology, and Nanotechnology. Curr Top Med Chem 2021; 21:649-660. [PMID: 33475073 DOI: 10.2174/1568026621666210121153413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 12/18/2020] [Accepted: 12/21/2020] [Indexed: 11/22/2022]
Abstract
Recently, different authors have reported Perturbation Theory (PT) methods combined with machine learning (ML) to obtain PTML (PT + ML) models. They have applied PTML models to the study of different biological systems. Here we present one state-of-art review about the different applications of PTML models in Organic Synthesis, Medicinal Chemistry, Protein Research, and Technology. The aim of the models is to find relations between the molecular descriptors and the biological characteristics to predict key properties of new compounds. An area where the ML has been very useful is the drug discovery process. The entire process of drug discovery leads to the generation of lots of data, and it is also a costly and time-consuming process. ML comes with the opportunity of analyzing significant amounts of chemical data obtaining outcomes to find potential drug candidates.
Collapse
Affiliation(s)
- Diana M Herrera-Ibatá
- Fundacion Universitaria Agraria de Colombia, Uniagraria, Facultad de Medicina Veterinaria, Bogota 111166, Colombia
| |
Collapse
|
15
|
Sampaio-Dias IE, Rodríguez-Borges JE, Yáñez-Pérez V, Arrasate S, Llorente J, Brea JM, Bediaga H, Viña D, Loza MI, Caamaño O, García-Mera X, González-Díaz H. Synthesis, Pharmacological, and Biological Evaluation of 2-Furoyl-Based MIF-1 Peptidomimetics and the Development of a General-Purpose Model for Allosteric Modulators (ALLOPTML). ACS Chem Neurosci 2021; 12:203-215. [PMID: 33347281 DOI: 10.1021/acschemneuro.0c00687] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
This work describes the synthesis and pharmacological evaluation of 2-furoyl-based Melanostatin (MIF-1) peptidomimetics as dopamine D2 modulating agents. Eight novel peptidomimetics were tested for their ability to enhance the maximal effect of tritiated N-propylapomorphine ([3H]-NPA) at D2 receptors (D2R). In this series, 2-furoyl-l-leucylglycinamide (6a) produced a statistically significant increase in the maximal [3H]-NPA response at 10 pM (11 ± 1%), comparable to the effect of MIF-1 (18 ± 9%) at the same concentration. This result supports previous evidence that the replacement of proline residue by heteroaromatic scaffolds are tolerated at the allosteric binding site of MIF-1. Biological assays performed for peptidomimetic 6a using cortex neurons from 19-day-old Wistar-Kyoto rat embryos suggest that 6a displays no neurotoxicity up to 100 μM. Overall, the pharmacological and toxicological profile and the structural simplicity of 6a makes this peptidomimetic a potential lead compound for further development and optimization, paving the way for the development of novel modulating agents of D2R suitable for the treatment of CNS-related diseases. Additionally, the pharmacological and biological data herein reported, along with >20 000 outcomes of preclinical assays, was used to seek a general model to predict the allosteric modulatory potential of molecular candidates for a myriad of target receptors, organisms, cell lines, and biological activity parameters based on perturbation theory (PT) ideas and machine learning (ML) techniques, abbreviated as ALLOPTML. By doing so, ALLOPTML shows high specificity Sp = 89.2/89.4%, sensitivity Sn = 71.3/72.2%, and accuracy Ac = 86.1%/86.4% in training/validation series, respectively. To the best of our knowledge, ALLOPTML is the first general-purpose chemoinformatic tool using a PTML-based model for the multioutput and multicondition prediction of allosteric compounds, which is expected to save both time and resources during the early drug discovery of allosteric modulators.
Collapse
Affiliation(s)
- Ivo E. Sampaio-Dias
- LAQV/REQUIMTE, Dept. of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
| | - José E. Rodríguez-Borges
- LAQV/REQUIMTE, Dept. of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
| | - Víctor Yáñez-Pérez
- Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
| | - Sonia Arrasate
- Dept. of Pharmacology, Faculty of Medicine and Nursing, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
| | - Javier Llorente
- Dept. of Pharmacology, Faculty of Medicine and Nursing, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
- Dept. of Pharmacology, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - José M. Brea
- Innopharma Screening Platform, Biofarma Research group, Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Harbil Bediaga
- Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
- Dept. of Physical Chemistry, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
| | - Dolores Viña
- Dept. of Pharmacology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - María Isabel Loza
- Innopharma Screening Platform, Biofarma Research group, Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Olga Caamaño
- Dept. of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Xerardo García-Mera
- Dept. of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Humberto González-Díaz
- Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
- Basque Center for Biophysics (CSIC UPV/EHU), University of Basque Country (UPV-EHU), 48940 Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
| |
Collapse
|
16
|
Lin A, Baskin II, Marcou G, Horvath D, Beck B, Varnek A. Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling. Mol Inform 2020; 39:e2000009. [PMID: 32347666 PMCID: PMC7757192 DOI: 10.1002/minf.202000009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 04/10/2020] [Indexed: 11/12/2022]
Abstract
Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 105 compounds - which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of "intermediate" manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the "final" manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above-mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds.
Collapse
Affiliation(s)
- Arkadii Lin
- University of StrasbourgLaboratory of Chemoinformatics, Faculty of Chemistry4, Blaise Pascal str.67081StrasbourgFrance
| | - Igor I. Baskin
- Faculty of PhysicsLomonosov Moscow State University1/2, Leninskie Gory str.119991MoscowRussia
| | - Gilles Marcou
- University of StrasbourgLaboratory of Chemoinformatics, Faculty of Chemistry4, Blaise Pascal str.67081StrasbourgFrance
| | - Dragos Horvath
- University of StrasbourgLaboratory of Chemoinformatics, Faculty of Chemistry4, Blaise Pascal str.67081StrasbourgFrance
| | - Bernd Beck
- Department of Medicinal ChemistryBoehringer Ingelheim Pharma GmbH & Co. KG65, Birkendorfer str.88397Biberach an der RissGermany
| | - Alexandre Varnek
- University of StrasbourgLaboratory of Chemoinformatics, Faculty of Chemistry4, Blaise Pascal str.67081StrasbourgFrance
| |
Collapse
|
17
|
Tuerkova A, Zdrazil B. A ligand-based computational drug repurposing pipeline using KNIME and Programmatic Data Access: case studies for rare diseases and COVID-19. J Cheminform 2020; 12:71. [PMID: 33250934 PMCID: PMC7686838 DOI: 10.1186/s13321-020-00474-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 11/09/2020] [Indexed: 01/01/2023] Open
Abstract
Biomedical information mining is increasingly recognized as a promising technique to accelerate drug discovery and development. Especially, integrative approaches which mine data from several (open) data sources have become more attractive with the increasing possibilities to programmatically access data through Application Programming Interfaces (APIs). The use of open data in conjunction with free, platform-independent analytic tools provides the additional advantage of flexibility, re-usability, and transparency. Here, we present a strategy for performing ligand-based in silico drug repurposing with the analytics platform KNIME. We demonstrate the usefulness of the developed workflow on the basis of two different use cases: a rare disease (here: Glucose Transporter Type 1 (GLUT-1) deficiency), and a new disease (here: COVID 19). The workflow includes a targeted download of data through web services, data curation, detection of enriched structural patterns, as well as substructure searches in DrugBank and a recently deposited data set of antiviral drugs provided by Chemical Abstracts Service. Developed workflows, tutorials with detailed step-by-step instructions, and the information gained by the analysis of data for GLUT-1 deficiency syndrome and COVID-19 are made freely available to the scientific community. The provided framework can be reused by researchers for other in silico drug repurposing projects, and it should serve as a valuable teaching resource for conveying integrative data mining strategies.
Collapse
Affiliation(s)
- Alzbeta Tuerkova
- Department of Pharmaceutical Chemistry, Division of Drug Design and Medicinal Chemistry, University of Vienna, Althanstraße 14, 1090 Vienna, Austria
| | - Barbara Zdrazil
- Department of Pharmaceutical Chemistry, Division of Drug Design and Medicinal Chemistry, University of Vienna, Althanstraße 14, 1090 Vienna, Austria
| |
Collapse
|
18
|
Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson F, Bellis LJ, De Veij M, Leach AR. An open source chemical structure curation pipeline using RDKit. J Cheminform 2020; 12:51. [PMID: 33431044 PMCID: PMC7458899 DOI: 10.1186/s13321-020-00456-1] [Citation(s) in RCA: 128] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 08/24/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. RESULTS A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. CONCLUSION All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.
Collapse
Affiliation(s)
- A Patrícia Bento
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Anne Hersey
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Eloy Félix
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | | | - Anna Gaulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Francis Atkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
- The Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK
| | - Louisa J Bellis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
- Department of Oncology, University of Cambridge, Cambridge, UK
| | - Marleen De Veij
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Andrew R Leach
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK.
| |
Collapse
|
19
|
Santana R, Zuluaga R, Gañán P, Arrasate S, Onieva E, Montemore MM, González-Díaz H. PTML Model for Selection of Nanoparticles, Anticancer Drugs, and Vitamins in the Design of Drug-Vitamin Nanoparticle Release Systems for Cancer Cotherapy. Mol Pharm 2020; 17:2612-2627. [PMID: 32459098 DOI: 10.1021/acs.molpharmaceut.0c00308] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Nanosystems are gaining momentum in pharmaceutical sciences because of the wide variety of possibilities for designing these systems to have specific functions. Specifically, studies of new cancer cotherapy drug-vitamin release nanosystems (DVRNs) including anticancer compounds and vitamins or vitamin derivatives have revealed encouraging results. However, the number of possible combinations of design and synthesis conditions is remarkably high. In addition, a large number of anticancer and vitamin derivatives have been already assayed, but a notably less number of cases of DVRNs were assayed as a whole (with the anticancer compound and the vitamin linked to them). Our approach combines with the perturbation theory and machine learning (PTML) model to predict the probability of obtaining an interesting DVRN by changing the anticancer compound and/or the vitamin present in a DVRN that is already tested for other anticancer compounds or vitamins that have not been tested yet as part of a DVRN. In a previous work, we built a linear PTML model useful for the design of these nanosystems. In doing so, we used information fusion (IF) techniques to carry out data enrichment of DVRN data compiled from the literature with the data for preclinical assays of vitamins from the ChEMBL database. The design features of DVRNs and the assay conditions of nanoparticles (NPs) and vitamins were included as multiplicative PT operators (PTOs) to the system, which indicates the importance of these variables. However, the previous work omitted experiments with nonlinear ML techniques and different types of PTOs such as metric-based PTOs. More importantly, the previous work does not consider the structure of the anticancer drug to be included in the new DVRNs. In this work, we are going to accomplish three main objectives (tasks). In the first task, we found a new model, alternative to the one published before, for the rational design of DVRNs using metric-based PTOs. The most accurate PTML model was the artificial neural network model, which showed values of specificity, sensitivity, and accuracy in the range of 90-95% in training and external validation series for more than 130,000 cases (DVRNs vs ChEMBL assays). Furthermore, in the second task, we used IF techniques to carry out data enrichment of our previous data set. In doing so, we constructed a new working data set of >970,000 cases with the data of preclinical assays of DVRNs, vitamins, and anticancer compounds from the ChEMBL database. All these assays have multiple continuous variables or descriptors dk and categorical variables cj (conditions of the assays) for drugs (dack, cacj), vitamins (dvk, cvj), and NPs (dnk, cnj). These data include >20,000 potential anticancer compounds with >270 protein targets (cac1), >580 assay cell organisms (cac2), and so forth. Furthermore, we include >36,000 assay vitamin derivatives in >6200 types of cells (c2vit), >120 assay organisms (c3vit), >60 assay strains (c4vit), and so forth. The enriched data set also contains >20 types of DVRNs (c5n) with 9 NP core materials (c4n), 8 synthesis methods (c7n), and so forth. We expressed all this information with PTOs and developed a qualitatively new PTML model that incorporates information of the anticancer drugs. This new model presents 96-97% of accuracy for training and external validation subsets. In the last task, we carried out a comparative study of ML and/or PTML models published and described how the models we are presenting cover the gap of knowledge in terms of drug delivery. In conclusion, we present here for the first time a multipurpose PTML model that is able to select NPs, anticancer compounds, and vitamins and their conditions of assay for DVRN design.
Collapse
Affiliation(s)
- Ricardo Santana
- Department of Chemical and Biomolecular Engineering, Tulane University, 6823 St Charles Avenue, New Orleans, Louisiana 70118, United States.,University of Deusto, Avda. Universidades, 24, 48007 Bilbao, Spain.,Grupo de Investigación Sobre Nuevos Materiales, Facultad de Ingeniería Química, Universidad Pontificia Bolivariana, Circular 1 No. 70-01, 050031 Medellín, Colombia
| | - Robin Zuluaga
- Facultad de Ingeniería Agroindustrial, Universidad Pontificia Bolivariana, Circular 1 No. 70-01, 050031 Medellín, Colombia
| | - Piedad Gañán
- Grupo de Investigación Sobre Nuevos Materiales, Facultad de Ingeniería Química, Universidad Pontificia Bolivariana, Circular 1 No. 70-01, 050031 Medellín, Colombia
| | - Sonia Arrasate
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Basque Country, Spain
| | - Enrique Onieva
- University of Deusto, Avda. Universidades, 24, 48007 Bilbao, Spain
| | - Matthew M Montemore
- Department of Chemical and Biomolecular Engineering, Tulane University, 6823 St Charles Avenue, New Orleans, Louisiana 70118, United States
| | - Humbert González-Díaz
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Basque Country, Spain.,Basque Center for Biophysics, Spanish National Research Council (CSIC)-University of Basque Country UPV/EHU, 48940 Leioa, Basque Country, Spain.,Ikerbasque, Basque Foundation for Science, 48013 Bilbao, Basque Country, Spain
| |
Collapse
|
20
|
Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020; 12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open
Abstract
Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using Ki, Kd, IC50 and EC50 data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC50 data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC50 units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC50 units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC50 units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK. .,European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK.
| | - Ctibor Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Daniel Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| |
Collapse
|
21
|
Chávez-Hernández AL, Sánchez-Cruz N, Medina-Franco JL. A Fragment Library of Natural Products and its Comparative Chemoinformatic Characterization. Mol Inform 2020; 39:e2000050. [PMID: 32302465 DOI: 10.1002/minf.202000050] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 04/17/2020] [Indexed: 11/06/2022]
Abstract
We report a comprehensive fragment library with 205,903 fragments derived from the recently published Collection of Open Natural Products (COCONUT) data set with more than 400,000 non-redundant natural products. The natural products-based fragment library was compared with other two fragment libraries herein generated from ChEMBL (biologically relevant compounds) and Enamine-REAL (a large on-demand collection of synthetic compounds), both used as reference data sets with relevance in drug discovery. It was found that there is a large diversity of unique fragments derived from natural products and that the entire structures and fragments derived from natural products are more diverse and structurally complex than the two reference compound collections. During this work we introduced a novel visual representation of the chemical space based on the recently published concept of statistical-based database fingerprint. The compounds and fragments libraries from natural products generated and analyzed in this work are freely available.
Collapse
Affiliation(s)
- Ana L Chávez-Hernández
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City, 04510, Mexico
| | - Norberto Sánchez-Cruz
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City, 04510, Mexico
| | - José L Medina-Franco
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City, 04510, Mexico
| |
Collapse
|
22
|
Sturm N, Mayr A, Le Van T, Chupakhin V, Ceulemans H, Wegner J, Golib-Dzib JF, Jeliazkova N, Vandriessche Y, Böhm S, Cima V, Martinovic J, Greene N, Vander Aa T, Ashby TJ, Hochreiter S, Engkvist O, Klambauer G, Chen H. Industry-scale application and evaluation of deep learning for drug target prediction. J Cheminform 2020; 12:26. [PMID: 33430964 PMCID: PMC7169028 DOI: 10.1186/s13321-020-00428-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 03/30/2020] [Indexed: 12/02/2022] Open
Abstract
Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.
Collapse
Affiliation(s)
- Noé Sturm
- Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
| | - Andreas Mayr
- LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
| | - Thanh Le Van
- High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
| | - Vladimir Chupakhin
- High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen R&D, 1400 McKean Rd, Spring House, Pennsylvania, 19002, USA
| | - Hugo Ceulemans
- High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
| | - Joerg Wegner
- High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
| | - Jose-Felipe Golib-Dzib
- High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Cilag SA, Calle Río Jarama, 75A, 45007, Toledo, Spain
| | - Nina Jeliazkova
- Ideaconsult Ltd., 4. Angel Kanchev Str., 1000, Sofia, Bulgaria
| | - Yves Vandriessche
- Intel Corporation, Data Center Group, Veldkant 31, 2550, Kontich, Belgium
| | - Stanislav Böhm
- IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
| | - Vojtech Cima
- IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
| | - Jan Martinovic
- IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
| | - Nigel Greene
- Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden
| | - Tom Vander Aa
- Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium
| | - Thomas J Ashby
- Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium
| | - Sepp Hochreiter
- LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden
| | - Günter Klambauer
- LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria.
| | - Hongming Chen
- Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
| |
Collapse
|
23
|
Drakakis G, Cortés-Ciriano I, Alexander-Dann B, Bender A. Elucidating Compound Mechanism of Action and Predicting Cytotoxicity Using Machine Learning Approaches, Taking Prediction Confidence into Account. ACTA ACUST UNITED AC 2020; 11:e73. [PMID: 31483099 DOI: 10.1002/cpch.73] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The modes of action (MoAs) of drugs frequently are unknown, because many are small molecules initially identified from phenotypic screens, giving rise to the need to elucidate their MoAs. In addition, the high attrition rate for candidate drugs in preclinical studies due to intolerable toxicity has motivated the development of computational approaches to predict drug candidate (cyto)toxicity as early as possible in the drug-discovery process. Here, we provide detailed instructions for capitalizing on bioactivity predictions to elucidate the MoAs of small molecules and infer their underlying phenotypic effects. We illustrate how these predictions can be used to infer the underlying antidepressive effects of marketed drugs. We also provide the necessary functionalities to model cytotoxicity data using single and ensemble machine-learning algorithms. Finally, we give detailed instructions on how to calculate confidence intervals for individual predictions using the conformal prediction framework. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Georgios Drakakis
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Ben Alexander-Dann
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
24
|
Santana R, Zuluaga R, Gañán P, Arrasate S, Onieva Caracuel E, González-Díaz H. PTML Model of ChEMBL Compounds Assays for Vitamin Derivatives. ACS Comb Sci 2020; 22:129-141. [PMID: 32011854 DOI: 10.1021/acscombsci.9b00166] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Determining the biological activity of vitamin derivatives is needed given that organic synthesis of analogs of vitamins is an active field of interest for medicinal chemistry, pharmaceuticals, and food additives. Accordingly, scientists from different disciplines perform preclinical assays (nij) with a considerable combination of assay conditions (cj). Indeed, the ChEMBL platform contains a database that includes results from 36 220 different biological activity bioassays of 21 240 different vitamins and vitamin derivatives. These assays present are heterogeneous in terms of assay combinations of cj. They are focused on >500 different biological activity parameters (c0), >340 different targets (c1), >6200 types of cell (c2), >120 organisms of assay (c3), and >60 assay strains (c4). It includes a total of >1850 niacin assays, >1580 tretinoin assays, >1580 retinol assays, 857 ascorbic acid assays, etc. Given the complexity of this combinatorial data in terms of being assimilated by researchers, we propose to build a model by combining perturbation theory (PT) and machine learning (ML). Through this study, we propose a PTML (PT + ML) combinatorial model for ChEMBL results on biological activity of vitamins and vitamins derivatives. The linear discriminant analysis (LDA) model presented the following results for training subset a: specificity (%) = 90.38, sensitivity (%) = 87.51, and accuracy (%) = 89.89. The model showed the following results for the external validation subset: specificity (%) = 90.58, sensitivity (%) = 87.72, and accuracy (%) = 90.09. Different types of linear and nonlinear PTML models, such as logistic regression (LR), classification tree (CT), näive Bayes (NB), and random Forest (RF), were applied to contrast the capacity of prediction. The PTML-LDA model predicts with more accuracy by applying combinatorial descriptors. In addition, a PCA experiment with chemical structure descriptors allowed us to characterize the high structural diversity of the chemical space studied. In any case, PTML models using chemical structure descriptors do not improve the performance of the PTML-LDA model based on ALOGP and PSA. We can conclude that the three variable PTML-LDA model is a simplified and adaptable tool for the prediction, for different experiment combinations, the biological activity of derivative vitamins.
Collapse
Affiliation(s)
- Ricardo Santana
- DeustoTech-Fundación Deusto, Avda. Universidades, 24, 48007 Bilbao, Spain
- Grupo de Investigación sobre Nuevos Materiales, Universidad Pontificia Bolivariana UPB, 050031, Medellín, Colombia
| | - Robin Zuluaga
- Facultad de Ingeniería Agroindustrial, Universidad Pontificia Bolivariana UPB, 050031, Medellín, Colombia
| | - Piedad Gañán
- Facultad de Ingeniería Química, Universidad Pontificia Bolivariana UPB, 050031, Medellín, Colombia
| | - Sonia Arrasate
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain
| | | | - Humbert González-Díaz
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain
| |
Collapse
|
25
|
Sarkar A. Enabling design of screening libraries for antibiotic discovery by modeling ChEMBL data. Eur J Pharm Sci 2020; 143:105166. [PMID: 31783159 DOI: 10.1016/j.ejps.2019.105166] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 11/11/2019] [Accepted: 11/24/2019] [Indexed: 11/17/2022]
Abstract
It is critical to identify novel antibiotics. Yet, the scientific community has struggled in this pursuit because we do not understand which molecules will penetrate the bacterial outer envelope. In this work, we have identified a large dataset of compounds known to reach their targets in bacterial cells (penetrators) and compared them with molecules that do not (non-penetrators). Our dataset, extracted from the ChEMBL database, is a useful tool to guide the selection of molecules for antibiotic screening. Simple random forest classification models are able to correctly identify penetrators from non-penetrators. The model demonstrated ~87% accuracy, with high precision (~88%) and recall (~97%) in identifying penetrators of Gram-positive bacteria. A paucity of data for non-penetrators was a major hurdle to model-building; we observed a ~86% negative predictive value, but only a ~57% specificity. Accumulation of data on non-penetrators is therefore necessary. Data for Gram-negative bacteria was also sparse, but a larger fraction of these data represented non-penetrators. Correspondingly, the resultant models performed well in predicting those molecules that would fail to enter Gram-negative cells, but were relatively weaker in correctly predicting penetrators. A comparison of physicochemical properties of penetrators and non-penetrators suggests only marginal differences exist. Therefore, it may be difficult to identify overarching rules for generation of screening libraries for antibiotic discovery, based purely on physicochemical properties alone. Instead, models such as ours should be of use. Our models are highly preliminary and based on phenotypic data, but a similar large dataset directly addressing accumulation of chemical matter in bacterial cells is currently unavailable. Hence, our models represent the cutting edge in design of screening libraries for antibiotic discovery until appropriate data can be compiled.
Collapse
Affiliation(s)
- Aurijit Sarkar
- Department of Basic Pharmaceutical Sciences, Fred Wilson School of Pharmacy, High Point University, One University Pkwy, High Point NC 27268 USA.
| |
Collapse
|
26
|
Diez-Alarcia R, Yáñez-Pérez V, Muneta-Arrate I, Arrasate S, Lete E, Meana JJ, González-Díaz H. Big Data Challenges Targeting Proteins in GPCR Signaling Pathways; Combining PTML- ChEMBL Models and [ 35S]GTPγS Binding Assays. ACS Chem Neurosci 2019; 10:4476-4491. [PMID: 31618004 DOI: 10.1021/acschemneuro.9b00302] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
G-protein-coupled receptors (GPCRs), also known as 7-transmembrane receptors, are the single largest class of drug targets. Consequently, a large amount of preclinical assays having GPCRs as molecular targets has been released to public sources like the Chemical European Molecular Biology Laboratory (ChEMBL) database. These data are also very complex covering changes in drug chemical structure and assay conditions like c0 = activity parameter (Ki, IC50, etc.), c1 = target protein, c2 = cell line, c3 = assay organism, etc., making difficult the analysis of these databases that are placed in the borders of a Big Data challenge. One of the aims of this work is to develop a computational model able to predict new GPCRs targeting drugs taking into consideration multiple conditions of assay. Another objective is to perform new predictive and experimental studies of selective 5-HTA2 receptor agonist, antagonist, or inverse agonist in human comparing the results with those from the literature. In this work, we combined Perturbation Theory (PT) and Machine Learning (ML) to seek a general PTML model for this data set. We analyzed 343 738 unique compounds with 812 072 end points (assay outcomes), with 185 different experimental parameters, 592 protein targets, 51 cell lines, and/or 55 organisms (species). The best PTML linear model found has three input variables only and predicted 56 202/58 653 positive outcomes (sensitivity = 95.8%) and 470 230/550 401 control cases (specificity = 85.4%) in training series. The model also predicted correctly 18 732/19 549 (95.8%) of positive outcomes and 156 739/183 469 (85.4%) of cases in external validation series. To illustrate its practical use, we used the model to predict the outcomes of six different 5-HT2A receptor drugs, namely, TCB-2, DOI, DOB, altanserin, pimavanserin, and nelotanserin, in a very large number of different pharmacological assays. 5-HT2A receptors are altered in schizophrenia and represent drug target for antipsychotic therapeutic activity. The model correctly predicted 93.83% (76 of 86) experimental results for these compounds reported in ChEMBL. Moreover, [35S]GTPγS binding assays were performed experimentally with the same six drugs with the aim of determining their potency and efficacy in the modulation of G-proteins in human brain tissue. The antagonist ketanserin was included as inactive drug with demonstrated affinity for 5-HT2A/C receptors. Our results demonstrate that some of these drugs, previously described as serotonin 5-HT2A receptor agonists, antagonists, or inverse agonists, are not so specific and show different intrinsic activity to that previously reported. Overall, this work opens a new gate for the prediction of GPCRs targeting compounds.
Collapse
Affiliation(s)
- Rebeca Diez-Alarcia
- Centro de Investigación Biomédica en Red en Salud Mental, 48940 Leioa, Spain
| | | | | | | | | | - J. Javier Meana
- Centro de Investigación Biomédica en Red en Salud Mental, 48940 Leioa, Spain
| | - Humbert González-Díaz
- Biophysics Institute, CSIC-UPV/EHU, University of the Basque Country UPV/EHU, Leioa, 48940, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
| |
Collapse
|
27
|
Vásquez-Domínguez E, Armijos-Jaramillo VD, Tejera E, González-Díaz H. Multioutput Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds. Mol Pharm 2019; 16:4200-4212. [PMID: 31426639 DOI: 10.1021/acs.molpharmaceut.9b00538] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Retroviral infections, such as HIV, are, until now, diseases with no cure. Medicine and pharmaceutical chemistry need and consider it a huge goal to define target proteins of new antiretroviral compounds. ChEMBL manages Big Data features with a complex data set, which is hard to organize. This makes information difficult to analyze due to a big number of characteristics described in order to predict new drug candidates for retroviral infections. For this reason, we propose to develop a new predictive model combining perturbation theory (PT) bases and machine learning (ML) modeling to create a new tool that can take advantage of all the available information. The PTML model proposed in this work for the ChEMBL data set preclinical experimental assays for antiretroviral compounds consists of a linear equation with four variables. The PT operators used are founded on multicondition moving averages, combining different features and simplifying the difficulty to manage all data. More than 140 000 preclinical assays for 56 105 compounds with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c0), 55 protein accessions (c1), 83 cell lines (c2), 64 organisms of assay (c3), and 773 subtypes or strains. We have included 150 148 preclinical experimental assays for HIV virus, 1188 for HTLV virus, 84 for simian immunodeficiency virus, 370 for murine leukemia virus, 119 for Rous sarcoma virus, 1581 for MMTV, etc. We also included 5277 assays for hepatitis B virus. The developed PTML model reached considerable values in sensibility (73.05% for training and 73.10% for validation), specificity (86.61% for training and 87.17% for validation), and accuracy (75.84% for training and 75.98% for validation). We also compared alternative PTML models with different PT operators such as covariance, moments, and exponential terms. Finally, we made a comparison between literature ML models with our PTML model and also artificial neural network (ANN) nonlinear models. We conclude that this PTML model is the first one to consider multiple characteristics of preclinical experimental antiretroviral assays combined, generating a simple, useful, and adaptable instrument, which could reduce time and costs in antiretroviral drugs research.
Collapse
Affiliation(s)
- Emilia Vásquez-Domínguez
- Department of Organic Chemistry II , University of Basque Country UPV/EHU , 48940 Leioa , Spain.,Faculty of Engineering and Applied Sciences-Biotechnology , Universidad de Las Américas (UDLA) , 170125 Quito , Ecuador
| | - Vinicio Danilo Armijos-Jaramillo
- Faculty of Engineering and Applied Sciences-Biotechnology , Universidad de Las Américas (UDLA) , 170125 Quito , Ecuador.,Bio-chemioinformatics group , Universidad de Las Américas (UDLA) , 170125 Quito , Ecuador
| | - Eduardo Tejera
- Faculty of Engineering and Applied Sciences-Biotechnology , Universidad de Las Américas (UDLA) , 170125 Quito , Ecuador.,Bio-chemioinformatics group , Universidad de Las Américas (UDLA) , 170125 Quito , Ecuador
| | - Humbert González-Díaz
- Department of Organic Chemistry II , University of Basque Country UPV/EHU , 48940 Leioa , Spain.,IKERBASQUE, Basque Foundation for Science , 48011 Bilbao , Spain
| |
Collapse
|
28
|
Liang L, Ma C, Du T, Zhao Y, Zhao X, Liu M, Wang Z, Lin J. Bioactivity-explorer: a web application for interactive visualization and exploration of bioactivity data. J Cheminform 2019; 11:47. [PMID: 31292807 PMCID: PMC6617623 DOI: 10.1186/s13321-019-0370-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 07/02/2019] [Indexed: 12/29/2022] Open
Abstract
To better leverage the accumulated bioactivity data in the ChEMBL database, we have developed Bioactivity-explorer, a web application for interactive visualization and exploration of the large-scale bioactivity data in ChEMBL. Mining and integration of the Therapeutic Target Database disease-target mapping into the ChEMBL database has enabled Bioactivity-explorer to include 493,430 scaffolds, 31,400,000 matched molecular pairs, 1330,220 target-target interactions in terms of shared active compounds, 4526,718 target-target interactions in terms of shared active scaffolds, 97,041,700 molecule-molecule interactions and 14,974 disease-target mappings. This web tool is available at http://cadd.pharmacy.nankai.edu.cn/b17r . The source codes of the front end and back end, released under MIT license, can be found at GitHub.
Collapse
Affiliation(s)
- Lu Liang
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
| | - Chunfeng Ma
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
| | - Tengfei Du
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
| | - Yufei Zhao
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
| | - Xiaoyong Zhao
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
| | - Mengmeng Liu
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
| | - Zhonghua Wang
- Tianjin Institute of Industrial Biotechnology, Biodesign Center, Chinese Academy of Sciences, Tianjin, China.
| | - Jianping Lin
- State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China. .,Tianjin Institute of Industrial Biotechnology, Biodesign Center, Chinese Academy of Sciences, Tianjin, China. .,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China.
| |
Collapse
|
29
|
Cortés-Ciriano I, Bender A. KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 2019; 11:41. [PMID: 31218493 PMCID: PMC6582521 DOI: 10.1186/s13321-019-0364-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/09/2019] [Indexed: 02/08/2023] Open
Abstract
The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope .
Collapse
Affiliation(s)
- Isidro Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW UK
| |
Collapse
|
30
|
Hu Y, Bajorath J. SAR Matrix Method for Large-Scale Analysis of Compound Structure-Activity Relationships and Exploration of Multitarget Activity Spaces. Methods Mol Biol 2019; 1825:339-352. [PMID: 30334212 DOI: 10.1007/978-1-4939-8639-2_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2023]
Abstract
As the number of compounds and the volume of bioactivity data rapidly grow, advanced computational methods are required to study structure-activity relationships (SARs) on a large scale. Herein, the SAR matrix (SARM) methodology is described that was designed to systematically extract structural relationships between bioactive compounds from large databases, explore structure-activity relationships, and navigate multitarget activity spaces, which is one of the core tasks in chemogenomics. In addition, the SARM approach was designed to visualize structural and structure-activity relationships, which is often of critical importance for making this information available in an intuitive form for practical applications.
Collapse
Affiliation(s)
- Ye Hu
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany.
| |
Collapse
|
31
|
Lin A, Horvath D, Marcou G, Beck B, Varnek A. Multi-task generative topographic mapping in virtual screening. J Comput Aided Mol Des 2019; 33:331-343. [PMID: 30739238 DOI: 10.1007/s10822-019-00188-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 02/02/2019] [Indexed: 12/16/2022]
Abstract
The previously reported procedure to generate "universal" Generative Topographic Maps (GTMs) of the drug-like chemical space is in practice a multi-task learning process, in which both operational GTM parameters (example: map grid size) and hyperparameters (key example: the molecular descriptor space to be used) are being chosen by an evolutionary process in order to fit/select "universal" GTM manifolds. After selection (a one-time task aimed at optimizing the compromise in terms of neighborhood behavior compliance, over a large pool of various biological targets), for any further use the manifolds are ready to provide "fit-free" predictive models. Using any structure-activity set-irrespectively whether the associated target served at map fitting stage or not-the generation or "coloring" a property landscape enables predicting the property for any external molecule, with zero additional fitable parameters involved. While previous works have signaled the excellent behavior of such models in aggressive three-fold cross-validation assessments of their predictive power, the present work wished to explore their behavior in Virtual Screening (VS), here simulated on hand of external DUD ligand and decoy series that are fully disjoint from the ChEMBL-extracted landscape coloring sets. Beyond the rather robust results of the universal GTM manifolds in this challenge, it could be shown that the descriptor spaces selected by the evolutionary multi-task learner were intrinsically able to serve as an excellent support for many other VS procedures, starting from parameter-free similarity searching, to local (target-specific) GTM models, to parameter-rich, nonlinear Random Forest and Neural Network approaches.
Collapse
Affiliation(s)
- Arkadii Lin
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal Str., 67081, Strasbourg, France.,Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorferstrasse 65, 88397, Biberach an der Riss, Germany
| | - Dragos Horvath
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal Str., 67081, Strasbourg, France
| | - Gilles Marcou
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal Str., 67081, Strasbourg, France
| | - Bernd Beck
- Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorferstrasse 65, 88397, Biberach an der Riss, Germany
| | - Alexandre Varnek
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal Str., 67081, Strasbourg, France.
| |
Collapse
|
32
|
Bosc N, Atkinson F, Felix E, Gaulton A, Hersey A, Leach AR. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminform 2019; 11:4. [PMID: 30631996 PMCID: PMC6690068 DOI: 10.1186/s13321-018-0325-4] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 12/24/2018] [Indexed: 12/22/2022] Open
Abstract
Structure–activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, we describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a “real world” application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.
Collapse
Affiliation(s)
- Nicolas Bosc
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Francis Atkinson
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Eloy Felix
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Anna Gaulton
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Anne Hersey
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Andrew R Leach
- Chemogenomics Team, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
33
|
Ferreira da Costa J, Silva D, Caamaño O, Brea JM, Loza MI, Munteanu CR, Pazos A, García-Mera X, González-Díaz H. Perturbation Theory/Machine Learning Model of ChEMBL Data for Dopamine Targets: Docking, Synthesis, and Assay of New l-Prolyl-l-leucyl-glycinamide Peptidomimetics. ACS Chem Neurosci 2018; 9:2572-2587. [PMID: 29791132 DOI: 10.1021/acschemneuro.8b00083] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Predicting drug-protein interactions (DPIs) for target proteins involved in dopamine pathways is a very important goal in medicinal chemistry. We can tackle this problem using Molecular Docking or Machine Learning (ML) models for one specific protein. Unfortunately, these models fail to account for large and complex big data sets of preclinical assays reported in public databases. This includes multiple conditions of assays, such as different experimental parameters, biological assays, target proteins, cell lines, organism of the target, or organism of assay. On the other hand, perturbation theory (PT) models allow us to predict the properties of a query compound or molecular system in experimental assays with multiple boundary conditions based on a previously known case of reference. In this work, we report the first PTML (PT + ML) study of a large ChEMBL data set of preclinical assays of compounds targeting dopamine pathway proteins. The best PTML model found predicts 50000 cases with accuracy of 70-91% in training and external validation series. We also compared the linear PTML model with alternative PTML models trained with multiple nonlinear methods (artificial neural network (ANN), Random Forest, Deep Learning, etc.). Some of the nonlinear methods outperform the linear model but at the cost of a notable increment of the complexity of the model. We illustrated the practical use of the new model with a proof-of-concept theoretical-experimental study. We reported for the first time the organic synthesis, chemical characterization, and pharmacological assay of a new series of l-prolyl-l-leucyl-glycinamide (PLG) peptidomimetic compounds. In addition, we performed a molecular docking study for some of these compounds with the software Vina AutoDock. The work ends with a PTML model predictive study of the outcomes of the new compounds in a large number of assays. Therefore, this study offers a new computational methodology for predicting the outcome for any compound in new assays. This PTML method focuses on the prediction with a simple linear model of multiple pharmacological parameters (IC50, EC50, Ki, etc.) for compounds in assays involving different cell lines used, organisms of the protein target, or organism of assay for proteins in the dopamine pathway.
Collapse
Affiliation(s)
- Joana Ferreira da Costa
- Department of Organic Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - David Silva
- Department of Organic Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Olga Caamaño
- Department of Organic Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - José M. Brea
- CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Department of Pharmacology, Pharmacy and Pharmaceutical Technology, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Maria Isabel Loza
- CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
- Department of Pharmacology, Pharmacy and Pharmaceutical Technology, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Cristian R. Munteanu
- Instituto de Investigacion Biomedica de A Coruña (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), A Coruña, 15006, Spain
| | - Alejandro Pazos
- Instituto de Investigacion Biomedica de A Coruña (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), A Coruña, 15006, Spain
- Computer Science Department, Faculty of Computer Science, University of A Coruna, 15071 A Coruña, Spain
| | - Xerardo García-Mera
- Department of Organic Chemistry, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
| | - Humbert González-Díaz
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940 Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
| |
Collapse
|
34
|
Abstract
Determining the target proteins of new anticancer compounds is a very important task in Medicinal Chemistry. In this sense, chemists carry out preclinical assays with a high number of combinations of experimental conditions (c j). In fact, ChEMBL database contains outcomes of 65 534 different anticancer activity preclinical assays for 35 565 different chemical compounds (1.84 assays per compound). These assays cover different combinations of c j formed from >70 different biological activity parameters ( c0), >300 different drug targets ( c1), >230 cell lines ( c2), and 5 organisms of assay ( c3) or organisms of the target ( c4). It include a total of 45 833 assays in leukemia, 6227 assays in breast cancer, 2499 assays in ovarian cancer, 3499 in colon cancer, 3159 in lung cancer, 2750 in prostate cancer, 601 in melanoma, etc. This is a very complex data set with multiple Big Data features. This data is hard to be rationalized by researchers to extract useful relationships and predict new compounds. In this context, we propose to combine perturbation theory (PT) ideas and machine learning (ML) modeling to solve this combinatorial-like problem. In this work, we report a PTML (PT + ML) model for ChEMBL data set of preclinical assays of anticancer compounds. This is a simple linear model with only three variables. The model presented values of area under receiver operating curve = AUROC = 0.872, specificity = Sp(%) = 90.2, sensitivity = Sn(%) = 70.6, and overall accuracy = Ac(%) = 87.7 in training series. The model also have Sp(%) = 90.1, Sn(%) = 71.4, and Ac(%) = 87.8 in external validation series. The model use PT operators based on multicondition moving averages to capture all the complexity of the data set. We also compared the model with nonlinear artificial neural network (ANN) models obtaining similar results. This confirms the hypothesis of a linear relationship between the PT operators and the classification as anticancer compounds in different combinations of assay conditions. Last, we compared the model with other PTML models reported in the literature concluding that this is the only one PTML model able to predict activity against multiple types of cancer. This model is a simple but versatile tool for the prediction of the targets of anticancer compounds taking into consideration multiple combinations of experimental conditions in preclinical assays.
Collapse
Affiliation(s)
- Harbil Bediaga
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain
| | - Sonia Arrasate
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain
| | - Humbert González-Díaz
- Department of Organic Chemistry II, University of Basque Country UPV/EHU, 48940, Leioa, Spain
- IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain
| |
Collapse
|
35
|
Lagunin AA, Romanova MA, Zadorozhny AD, Kurilenko NS, Shilov BV, Pogodin PV, Ivanov SM, Filimonov DA, Poroikov VV. Comparison of Quantitative and Qualitative (Q)SAR Models Created for the Prediction of K i and IC 50 Values of Antitarget Inhibitors. Front Pharmacol 2018; 9:1136. [PMID: 30364128 PMCID: PMC6192375 DOI: 10.3389/fphar.2018.01136] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 09/18/2018] [Indexed: 12/20/2022] Open
Abstract
Estimation of interaction of drug-like compounds with antitargets is important for the assessment of possible toxic effects during drug development. Publicly available online databases provide data on the experimental results of chemical interactions with antitargets, which can be used for the creation of (Q)SAR models. The structures and experimental Ki and IC50 values for compounds tested on the inhibition of 30 antitargets from the ChEMBL 20 database were used. Data sets with Ki and IC50 values including more than 100 compounds were created for each antitarget. The (Q)SAR models were created by GUSAR software using quantitative neighborhoods of atoms (QNA), multilevel neighborhoods of atoms (MNA) descriptors, and self-consistent regression. The accuracy of (Q)SAR models was validated by the fivefold cross-validation procedure. The balanced accuracy was higher for qualitative SAR models (0.80 and 0.81 for Ki and IC50 values, respectively) than for quantitative QSAR models (0.73 and 0.76 for Ki and IC50 values, respectively). In most cases, sensitivity was higher for SAR models than for QSAR models, but specificity was higher for QSAR models. The mean R 2 and RMSE were 0.64 and 0.77 for Ki values and 0.59 and 0.73 for IC50 values, respectively. The number of compounds falling within the applicability domain was higher for SAR models than for the test sets.
Collapse
Affiliation(s)
- Alexey A. Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Maria A. Romanova
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anton D. Zadorozhny
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Natalia S. Kurilenko
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Boris V. Shilov
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Pavel V. Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Sergey M. Ivanov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Dmitry A. Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | | |
Collapse
|
36
|
Fukunishi Y, Yamashita Y, Mashimo T, Nakamura H. Prediction of Protein-compound Binding Energies from Known Activity Data: Docking-score-based Method and its Applications. Mol Inform 2018; 37:e1700120. [PMID: 29442436 PMCID: PMC6055825 DOI: 10.1002/minf.201700120] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 01/22/2018] [Indexed: 12/18/2022]
Abstract
We used protein-compound docking simulations to develop a structure-based quantitative structure-activity relationship (QSAR) model. The prediction model used docking scores as descriptors. The binding free energy was approximated by a weighted average of docking scores for multiple proteins. This approximation was based on a pharmacophore model of receptor pockets and compounds. The weights of the docking scores were restricted to small values to avoid unrealistic weights by a regularization term. Additional outlier elimination improved the results. We applied this method to two groups of targets. The first target was the kinase family. The cross-validation results of 107 kinase proteins showed that the RMSE of predicted binding free energies was 1.1 kcal/mol. The second target was the matrix metalloproteinase (MMP) family, which has been difficult for docking programs. MMPs require metal-binding groups in their inhibitor structures in many cases. A quantum effect contributes to the metal-ligand interaction. Despite this difficulty, the present method worked well for the MMPs. This method showed that the RMSE of predicted binding free energies was 1.1 kcal/mol. In comparison, with the original docking method the RMSE was 1.7 kcal/mol. The results suggest that the present QSAR model should be applied to general target proteins.
Collapse
Affiliation(s)
- Yoshifumi Fukunishi
- Molecular Profiling Research Center for Drug Discovery (molprof)National Institute of Advanced Industrial Science and Technology (AIST)2-3-26Aomi, Koto-ku, Tokyo135-0064Japan
| | - Yasunobu Yamashita
- Technology Research Association for Next-Generation Natural Products Chemistry2-3-26, Aomi, Koto-kuTokyo135-0064Japan
| | - Tadaaki Mashimo
- Technology Research Association for Next-Generation Natural Products Chemistry2-3-26, Aomi, Koto-kuTokyo135-0064Japan
- IMSBIO Co., Ltd.Owl Tower, 4-21-1Higashi-Ikebukuro, Toshima-kuTokyo170-0013Japan
| | - Haruki Nakamura
- Institute for Protein ResearchOsaka University3-2 YamadaokaSuita, Osaka565-0871Japan
| |
Collapse
|
37
|
Pogodin PV, Lagunin AA, Rudik AV, Filimonov DA, Druzhilovskiy DS, Nicklaus MC, Poroikov VV. How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors. Front Chem 2018; 6:133. [PMID: 29755970 PMCID: PMC5935003 DOI: 10.3389/fchem.2018.00133] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 04/09/2018] [Indexed: 12/16/2022] Open
Abstract
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of "active" and "inactive" compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
Collapse
Affiliation(s)
- Pavel V. Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Alexey A. Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anastasia V. Rudik
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Dmitry A. Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | | | - Mark C. Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, NCI-Frederick, Frederick, MD, United States
| | | |
Collapse
|
38
|
Fedoros EI, Orlov AA, Zherebker A, Gubareva EA, Maydin MA, Konstantinov AI, Krasnov KA, Karapetian RN, Izotova EI, Pigarev SE, Panchenko AV, Tyndyk ML, Osolodkin DI, Nikolaev EN, Perminova IV, Anisimov VN. Novel water-soluble lignin derivative BP-Cx-1: identification of components and screening of potential targets in silico and in vitro. Oncotarget 2018; 9:18578-18593. [PMID: 29719628 PMCID: PMC5915095 DOI: 10.18632/oncotarget.24990] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 12/16/2017] [Indexed: 11/25/2022] Open
Abstract
Identification of molecular targets and mechanism of action is always a challenge, in particular – for natural compounds due to inherent chemical complexity. BP-Cx-1 is a water-soluble modification of hydrolyzed lignin used as the platform for a portfolio of innovative pharmacological products aimed for therapy and supportive care of oncological patients. The present study describes a new approach, which combines in vitro screening of potential molecular targets for BP-Cx-1 using Diversity Profile - P9 panel by Eurofins Cerep (France) with a search of possible active components in silico in ChEMBL - manually curated chemical database of bioactive molecules with drug-like properties. The results of diversity assay demonstrate that BP-Cx-1 has multiple biological effects on neurotransmitters receptors, ligand-gated ion channels and transporters. Of particular importance is that the major part of identified molecular targets are involved in modulation of inflammation and immune response and might be related to tumorigenesis. Characterization of molecular composition of BP-Cx-1 with Fourier Transform Ion Cyclotron Resonance Mass Spectrometry and subsequent identification of possible active components by searching for molecular matches in silico in ChEMBL indicated polyphenolic components, nominally, flavonoids, sapogenins, phenanthrenes, as the major carriers of biological activity of BP-Cx-1. In vitro and in silico target screening yielded overlapping lists of proteins: adenosine receptors, dopamine receptor DRD4, glucocorticoid receptor, serotonin receptor 5-HT1, prostaglandin receptors, muscarinic cholinergic receptor, GABAA receptor. The pleiotropic molecular activities of polyphenolic components are beneficial in treatment of multifactorial disorders such as diseases associated with chronic inflammation and cancer.
Collapse
Affiliation(s)
- Elena I Fedoros
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia.,Nobel LTD, Saint-Petersburg 192012, Russia
| | - Alexey A Orlov
- Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Alexander Zherebker
- Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia.,Skolkovo Institute of Science and Technology, Skolkovo 143025, Russia
| | - Ekaterina A Gubareva
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
| | - Mikhail A Maydin
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
| | | | - Konstantin A Krasnov
- Institute of Toxicology, Federal Medical-Biological Agency, Saint-Petersburg 192019, Russia
| | | | | | | | - Andrey V Panchenko
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
| | - Margarita L Tyndyk
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
| | - Dmitry I Osolodkin
- Institute of Poliomyelitis and Viral Encephalitides, Chumakov FSC R&D IBP RAS, Moscow 108819, Russia.,Sechenov First Moscow State Medical University, Moscow 119991, Russia
| | - Evgeny N Nikolaev
- Skolkovo Institute of Science and Technology, Skolkovo 143025, Russia.,Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119334, Russia.,Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow 119121, Russia
| | - Irina V Perminova
- Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Vladimir N Anisimov
- N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
| |
Collapse
|
39
|
Abstract
Chemogenomics is a comparatively nascent branch dealing with the effects of drugs and chemicals on molecular level systems. With the emergence of this new epoch, the quantity of data sources is also unprecedentedly increasing. Despite having a plethora of a databases, the variation in bioactivity measurement as well as bias toward specific protein studies, varied computational procedures and redundant information make data mining tedious, especially for newcomers in the field. In this chapter, we give an overview of hands-on data collection and domains of applicability from some useful Web-based chemogenomic resources that are accessible with nothing more than a Web browser. This overview can help assist users in acquiring chemogenomic datasets for their project at hand.
Collapse
Affiliation(s)
- Rasel Al Mahmud
- Department of Radiation Genetics, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Rifat Ara Najnin
- Department of Radiation Genetics, Graduate School of Medicine, Kyoto University, Kyoto, Japan.
| | - Ahsan Habib Polash
- Department of Radiation Genetics, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| |
Collapse
|
40
|
Ong E, Xie J, Ni Z, Liu Q, Sarntivijai S, Lin Y, Cooper D, Terryn R, Stathias V, Chung C, Schürer S, He Y. Ontological representation, integration, and analysis of LINCS cell line cells and their cellular responses. BMC Bioinformatics 2017; 18:556. [PMID: 29322930 PMCID: PMC5763302 DOI: 10.1186/s12859-017-1981-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Background Aiming to understand cellular responses to different perturbations, the NIH Common Fund Library of Integrated Network-based Cellular Signatures (LINCS) program involves many institutes and laboratories working on over a thousand cell lines. The community-based Cell Line Ontology (CLO) is selected as the default ontology for LINCS cell line representation and integration. Results CLO has consistently represented all 1097 LINCS cell lines and included information extracted from the LINCS Data Portal and ChEMBL. Using MCF 10A cell line cells as an example, we demonstrated how to ontologically model LINCS cellular signatures such as their non-tumorigenic epithelial cell type, three-dimensional growth, latrunculin-A-induced actin depolymerization and apoptosis, and cell line transfection. A CLO subset view of LINCS cell lines, named LINCS-CLOview, was generated to support systematic LINCS cell line analysis and queries. In summary, LINCS cell lines are currently associated with 43 cell types, 131 tissues and organs, and 121 cancer types. The LINCS-CLO view information can be queried using SPARQL scripts. Conclusions CLO was used to support ontological representation, integration, and analysis of over a thousand LINCS cell line cells and their cellular responses.
Collapse
Affiliation(s)
- Edison Ong
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Jiangan Xie
- Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | - Zhaohui Ni
- Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | - Qingping Liu
- Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | - Sirarat Sarntivijai
- Samples, Phenotypes and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
| | - Yu Lin
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA
| | - Daniel Cooper
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
| | - Raymond Terryn
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
| | - Vasileios Stathias
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
| | - Caty Chung
- BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA.,Center for Computational Science, University of Miami, Miami, FL, USA
| | - Stephan Schürer
- Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA. .,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA. .,Center for Computational Science, University of Miami, Miami, FL, USA.
| | - Yongqun He
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. .,Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
41
|
Lenselink EB, Ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, IJzerman AP, van Westen GJP. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 2017; 9:45. [PMID: 29086168 PMCID: PMC5555960 DOI: 10.1186/s13321-017-0232-0] [Citation(s) in RCA: 165] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 07/31/2017] [Indexed: 11/10/2022] Open
Abstract
The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .
Collapse
Affiliation(s)
- Eelke B Lenselink
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Niels Ten Dijke
- Leiden Institute of Advanced Computer Science, Leiden University, P.O. Box 9512, 2300 RA, Leiden, The Netherlands
| | - Brandon Bongers
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - George Papadatos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK.,GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Herts, SG1 2NY, UK
| | - Herman W T van Vlijmen
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Wojtek Kowalczyk
- Leiden Institute of Advanced Computer Science, Leiden University, P.O. Box 9512, 2300 RA, Leiden, The Netherlands
| | - Adriaan P IJzerman
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Gerard J P van Westen
- Division of Medicinal Chemistry, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands.
| |
Collapse
|
42
|
Nowotka MM, Gaulton A, Mendez D, Bento AP, Hersey A, Leach A. Using ChEMBL web services for building applications and data processing workflows relevant to drug discovery. Expert Opin Drug Discov 2017; 12:757-767. [PMID: 28602100 DOI: 10.1080/17460441.2017.1339032] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
INTRODUCTION ChEMBL is a manually curated database of bioactivity data on small drug-like molecules, used by drug discovery scientists. Among many access methods, a REST API provides programmatic access, allowing the remote retrieval of ChEMBL data and its integration into other applications. This approach allows scientists to move from a world where they go to the ChEMBL web site to search for relevant data, to one where ChEMBL data can be simply integrated into their everyday tools and work environment. Areas covered: This review highlights some of the audiences who may benefit from using the ChEMBL API, and the goals they can address, through the description of several use cases. The examples cover a team communication tool (Slack), a data analytics platform (KNIME), batch job management software (Luigi) and Rich Internet Applications. Expert opinion: The advent of web technologies, cloud computing and micro services oriented architectures have made REST APIs an essential ingredient of modern software development models. The widespread availability of tools consuming RESTful resources have made them useful for many groups of users. The ChEMBL API is a valuable resource of drug discovery bioactivity data for professional chemists, chemistry students, data scientists, scientific and web developers.
Collapse
Affiliation(s)
- Michał M Nowotka
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| | - Anna Gaulton
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| | - David Mendez
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| | - A Patricia Bento
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| | - Anne Hersey
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| | - Andrew Leach
- a European Molecular Biology Laboratory - European Bioinformatics Institute , Wellcome Genome Campus , Hinxton , UK
| |
Collapse
|
43
|
Senger S. Assessment of the significance of patent-derived information for the early identification of compound-target interaction hypotheses. J Cheminform 2017; 9:26. [PMID: 29086108 PMCID: PMC5400772 DOI: 10.1186/s13321-017-0214-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Accepted: 04/13/2017] [Indexed: 11/16/2022] Open
Abstract
Background Patents are an important source of information for effective decision making in drug discovery. Encouragingly, freely accessible patent-chemistry databases are now in the public domain. However, at present there is still a wide gap between relatively low coverage-high quality manually-curated data sources and high coverage data sources that use text mining and automated extraction of chemical structures. To secure much needed funding for further research and an improved infrastructure, hard evidence is required to demonstrate the significance of patent-derived information in drug discovery. Surprisingly little such evidence has been reported so far. To address this, the present study attempts to quantify the relevance of patents for formulating and substantiating hypotheses for compound–target interactions. Results A manually-curated set of 130 compound–target interaction pairs annotated with what are considered to be the earliest patent and publication has been produced. The analysis of this set revealed that in stark contrast to what has been reported for novel chemical structures, only about 10% of the compound–target interaction pairs could be found in publications in the scientific literature within one year of being reported in patents. The average delay across all interaction pairs is close to 4 years. In an attempt to benchmark current capabilities, it was also examined how much of the benefit of using patent-derived information can be retained when a bioannotated version of SureChEMBL is used as secondary source for the patent literature. Encouragingly, this approach found the patents in the annotated set for 72% of the compound–target interaction pairs. Similarly, the effect of using the bioactivity database ChEMBL as secondary source for the scientific literature was studied. Here, the publications from the annotated set were only found for 46% of the compound–target interaction pairs. Conclusion Patent-derived information is a significant enabler for formulating compound–target interaction hypotheses even in cases where the respective interaction is later reported in the scientific literature. The findings of this study clearly highlight the significance of future investments in the development and provision of databases and tools that will allow scientists to search patent information in a comprehensive, reliable, and efficient manner. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0214-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Stefan Senger
- GlaxoSmithKline, Stevenage, Hertfordshire, SG1 2NY, UK.
| |
Collapse
|
44
|
Fukunishi Y, Yamasaki S, Yasumatsu I, Takeuchi K, Kurosawa T, Nakamura H. Quantitative Structure-activity Relationship (QSAR) Models for Docking Score Correction. Mol Inform 2017; 36:1600013. [PMID: 28001004 PMCID: PMC5297997 DOI: 10.1002/minf.201600013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 04/01/2016] [Indexed: 01/26/2023]
Abstract
In order to improve docking score correction, we developed several structure-based quantitative structure activity relationship (QSAR) models by protein-drug docking simulations and applied these models to public affinity data. The prediction models used descriptor-based regression, and the compound descriptor was a set of docking scores against multiple (∼600) proteins including nontargets. The binding free energy that corresponded to the docking score was approximated by a weighted average of docking scores for multiple proteins, and we tried linear, weighted linear and polynomial regression models considering the compound similarities. In addition, we tried a combination of these regression models for individual data sets such as IC50 , Ki , and %inhibition values. The cross-validation results showed that the weighted linear model was more accurate than the simple linear regression model. Thus, the QSAR approaches based on the affinity data of public databases should improve docking scores.
Collapse
Affiliation(s)
- Yoshifumi Fukunishi
- Molecular Profiling Research Center for Drug Discovery (molprof), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Satoshi Yamasaki
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Isao Yasumatsu
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Daiichi Sankyo RD Novare Co., Ltd., 1-16-13, Kita-Kasai, Edogawa-ku, Tokyo, 134-8630, Japan
| | - Koh Takeuchi
- Molecular Profiling Research Center for Drug Discovery (molprof), National Institute of Advanced Industrial Science and Technology (AIST), 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Takashi Kurosawa
- Technology Research Association for Next-Generation Natural Products Chemistry, 2-3-26, Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Hitachi Solutions East Japan, 12-1 Ekimaehoncho, Kawasaki-ku, Kanagawa, 210-0007, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| |
Collapse
|
45
|
Dieguez-Santana K, Pham-The H, Villegas-Aguilar PJ, Le-Thi-Thu H, Castillo-Garit JA, Casañola-Martin GM. Prediction of acute toxicity of phenol derivatives using multiple linear regression approach for Tetrahymena pyriformis contaminant identification in a median-size database. Chemosphere 2016; 165:434-441. [PMID: 27668720 DOI: 10.1016/j.chemosphere.2016.09.041] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 09/10/2016] [Accepted: 09/12/2016] [Indexed: 06/06/2023]
Abstract
In this article, the modeling of inhibitory grown activity against Tetrahymena pyriformis is described. The 0-2D Dragon descriptors based on structural aspects to gain some knowledge of factors influencing aquatic toxicity are mainly used. Besides, it is done by some enlarged data of phenol derivatives described for the first time and composed of 358 chemicals. It overcomes the previous datasets with about one hundred compounds. Moreover, the results of the model evaluation by the parameters in the training, prediction and validation give adequate results comparable with those of the previous works. The more influential descriptors included in the model are: X3A, MWC02, MWC10 and piPC03 with positive contributions to the dependent variable; and MWC09, piPC02 and TPC with negative contributions. In a next step, a median-size database of nearly 8000 phenolic compounds extracted from ChEMBL was evaluated with the quantitative-structure toxicity relationship (QSTR) model developed providing some clues (SARs) for identification of ecotoxicological compounds. The outcome of this report is very useful to screen chemical databases for finding the compounds responsible of aquatic contamination in the biomarker used in the current work.
Collapse
Affiliation(s)
- Karel Dieguez-Santana
- Universidad Estatal Amazónica, Facultad de Ingeniería Ambiental, Paso Lateral Km 21/2 Via Napo, Puyo, Ecuador.
| | - Hai Pham-The
- Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hoan Kiem, Hanoi, Viet Nam
| | | | - Huong Le-Thi-Thu
- School of Medicine and Pharmacy, Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, Cau Giay, Hanoi, Viet Nam
| | - Juan A Castillo-Garit
- Unidad de Toxicologia Experimental, Universidad de Ciencias Médicas Dr. Serafin Ruiz de Zárate Ruiz Santa Clara, 50200, Villa Clara, Cuba
| | - Gerardo M Casañola-Martin
- Universidad Estatal Amazónica, Facultad de Ingeniería Ambiental, Paso Lateral Km 21/2 Via Napo, Puyo, Ecuador; Hanoi University of Pharmacy, 13-15 Le Thanh Tong, Hoan Kiem, Hanoi, Viet Nam; Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Química Física, Facultad de Farmacia, Universitat de València, Spain.
| |
Collapse
|
46
|
Abstract
Three (3) different methods (logistic regression, covariate shift and k-NN) were applied to five (5) internal datasets and one (1) external, publically available dataset where covariate shift existed. In all cases, k-NN’s performance was inferior to either logistic regression or covariate shift. Surprisingly, there was no obvious advantage for using covariate shift to reweight the training data in the examined datasets.
Collapse
Affiliation(s)
| | | | - Brian Goldman
- Modeling & Informatics, Vertex Pharmaceuticals, Boston, MA, USA
| |
Collapse
|
47
|
Abstract
For the generation of contemporary databases of bioactive compounds, activity information is usually extracted from the scientific literature. However, when activity data are analyzed, source publications are typically no longer taken into consideration. Therefore, compound activity data selected from ChEMBL were traced back to thousands of original publications, activity records including compound, assay, and target information were systematically generated, and their distributions across the literature were determined. In addition, publications were categorized on the basis of activity records. Furthermore, compound promiscuity, defined as the ability of small molecules to specifically interact with multiple target proteins, was analyzed in light of publication statistics, thus adding another layer of information to promiscuity assessment. It was shown that the degree of compound promiscuity was not influenced by increasing numbers of source publications. Rather, most non-promiscuous as well as promiscuous compounds, regardless of their degree of promiscuity, originated from single publications, which emerged as a characteristic feature of the medicinal chemistry literature.
Collapse
Affiliation(s)
- Ye Hu
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
| |
Collapse
|
48
|
Ebejer JP, Charlton MH, Finn PW. Are the physicochemical properties of antibacterial compounds really different from other drugs? J Cheminform 2016; 8:30. [PMID: 27274770 PMCID: PMC4891840 DOI: 10.1186/s13321-016-0143-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 05/25/2016] [Indexed: 01/12/2023] Open
Abstract
Background It is now widely recognized that there is an urgent need for new antibacterial drugs, with novel mechanisms of action, to combat the rise of multi-drug resistant bacteria. However, few new compounds are reaching the market. Antibacterial drug discovery projects often succeed in identifying potent molecules in biochemical assays but have been beset by difficulties in obtaining antibacterial activity. A commonly held view, based on analysis of marketed antibacterial compounds, is that antibacterial drugs possess very different physicochemical properties to other drugs, and that this profile is required for antibacterial activity. Results We have re-examined this issue by performing a cheminformatics analysis of the literature data available in the ChEMBL database. The physicochemical properties of compounds with a recorded activity in an antibacterial assay were calculated and compared to two other datasets extracted from ChEMBL, marketed antibacterials and drugs marketed for other therapeutic indications. The chemical class of the compounds and Gram-negative/Gram-positive profile were also investigated. This analysis shows that compounds with antibacterial activity have physicochemical property profiles very similar to other drug classes. Conclusions The observation that many current antibacterial drugs lie in regions of physicochemical property space far from conventional small molecule therapeutics is correct. However, the inference that a compound must lie in one of these “outlier” regions in order to possess antibacterial activity is not supported by our analysis. Electronic supplementary material The online version of this article (doi:10.1186/s13321-016-0143-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jean-Paul Ebejer
- InhibOx Limited, Oxford Centre for Innovation, New Road, Oxford, OX1 1BY UK ; Centre for Molecular Medicine and Biobanking, University of Malta, Msida, MSD 2080 Malta
| | - Michael H Charlton
- InhibOx Limited, Oxford Centre for Innovation, New Road, Oxford, OX1 1BY UK
| | - Paul W Finn
- InhibOx Limited, Oxford Centre for Innovation, New Road, Oxford, OX1 1BY UK ; University of Buckingham, Hunter Street, Buckingham, MK18 1EG UK
| |
Collapse
|
49
|
Pogodin PV, Lagunin AA, Filimonov DA, Poroikov VV. PASS Targets: Ligand-based multi-target computational system based on a public data and naïve Bayes approach. SAR QSAR Environ Res 2015; 26:783-793. [PMID: 26305108 DOI: 10.1080/1062936x.2015.1078407] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Estimation of interactions between drug-like compounds and drug targets is very important for drug discovery and toxicity assessment. Using data extracted from the 19th version of the ChEMBL database ( https://www.ebi.ac.uk/chembl ) as a training set and a Bayesian-like method realized in PASS software ( http://www.way2drug.com/PASSOnline ), we developed a computational tool for the prediction of interactions between protein targets and drug-like compounds. After training, PASS Targets became able to predict interactions of drug-like compounds with 2507 protein targets from different organisms based on analysis of structure-activity relationships for 589,107 different chemical compounds. The prediction accuracy, estimated as AUC ROC calculated by the leave-one-out cross-validation and 20-fold cross-validation procedures, was about 96%. Average AUC ROC value was about 90% for the external test set from approximately 700 known drugs interacting with 206 protein targets.
Collapse
Affiliation(s)
- P V Pogodin
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| | - A A Lagunin
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| | - D A Filimonov
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
| | - V V Poroikov
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| |
Collapse
|