Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

49
(from Reference Citation Analysis)

Article PDFs (22)

Cited by > 0 (40)

Searched Name

ChEMBL

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Esaki T, Yonezawa T, Ikeda K. A new workflow for the effective curation of membrane permeability data from open ADME information. J Cheminform 2024;16:30. [PMID: 38481269 PMCID: PMC10938840 DOI: 10.1186/s13321-024-00826-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/10/2024] [Indexed: 03/17/2024] Open

Abstract

Membrane permeability is an in vitro parameter that represents the apparent permeability (Papp) of a compound, and is a key absorption, distribution, metabolism, and excretion parameter in drug development. Although the Caco-2 cell lines are the most used cell lines to measure Papp, other cell lines, such as the Madin-Darby Canine Kidney (MDCK), LLC-Pig Kidney 1 (LLC-PK1), and Ralph Russ Canine Kidney (RRCK) cell lines, can also be used to estimate Papp. Therefore, constructing in silico models for Papp estimation using the MDCK, LLC-PK1, and RRCK cell lines requires collecting extensive amounts of in vitro Papp data. An open database offers extensive measurements of various compounds covering a vast chemical space; however, concerns were reported on the use of data published in open databases without the appropriate accuracy and quality checks. Ensuring the quality of datasets for training in silico models is critical because artificial intelligence (AI, including deep learning) was used to develop models to predict various pharmacokinetic properties, and data quality affects the performance of these models. Hence, careful curation of the collected data is imperative. Herein, we developed a new workflow that supports automatic curation of Papp data measured in the MDCK, LLC-PK1, and RRCK cell lines collected from ChEMBL using KNIME. The workflow consisted of four main phases. Data were extracted from ChEMBL and filtered to identify the target protocols. A total of 1661 high-quality entries were retained after checking 436 articles. The workflow is freely available, can be updated, and has high reusability. Our study provides a novel approach for data quality analysis and accelerates the development of helpful in silico models for effective drug discovery. Scientific Contribution: The cost of building highly accurate predictive models can be significantly reduced by automating the collection of reliable measurement data. Our tool reduces the time and effort required for data collection and will enable researchers to focus on constructing high-performance in silico models for other types of analysis. To the best of our knowledge, no such tool is available in the literature.

Collapse

Gonzalez-Ponce K, Horta Andrade C, Hunter F, Kirchmair J, Martinez-Mayorga K, Medina-Franco JL, Rarey M, Tropsha A, Varnek A, Zdrazil B. School of cheminformatics in Latin America. J Cheminform 2023;15:82. [PMID: 37726809 PMCID: PMC10507835 DOI: 10.1186/s13321-023-00758-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/10/2023] [Indexed: 09/21/2023] Open

Padalino G, Coghlan A, Pagliuca G, Forde-Thomas JE, Berriman M, Hoffmann KF. Using ChEMBL to Complement Schistosome Drug Discovery. Pharmaceutics 2023;15:pharmaceutics15051359. [PMID: 37242601 DOI: 10.3390/pharmaceutics15051359] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 04/25/2023] [Accepted: 04/26/2023] [Indexed: 05/28/2023] Open

Aldahish A, Balaji P, Vasudevan R, Kandasamy G, James JP, Prabahar K. Elucidating the Potential Inhibitor against Type 2 Diabetes Mellitus Associated Gene of GLUT4. J Pers Med 2023;13:jpm13040660. [PMID: 37109046 PMCID: PMC10146764 DOI: 10.3390/jpm13040660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 04/02/2023] [Accepted: 04/10/2023] [Indexed: 04/29/2023] Open

Oleneva P, Zabolotna Y, Horvath D, Marcou G, Bonachera F, Varnek A. French dispatch: GTM-based analysis of the Chimiothèque Nationale Chemical Space. Mol Inform 2023;42:e2200208. [PMID: 36604304 DOI: 10.1002/minf.202200208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 12/29/2022] [Accepted: 01/05/2023] [Indexed: 01/07/2023]

Pietruś W, Kurczab R, Warszycki D, Bojarski AJ, Bajorath J. Isomeric Activity Cliffs-A Case Study for Fluorine Substitution of Aminergic G Protein-Coupled Receptor Ligands. Molecules 2023;28. [PMID: 36677547 DOI: 10.3390/molecules28020490] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 12/30/2022] [Accepted: 01/01/2023] [Indexed: 01/06/2023] Open

Isigkeit L, Merk D. Compilation of Custom Compound/Bioactivity Datasets from Public Repositories. Methods Mol Biol 2023;2706:25-50. [PMID: 37558939 DOI: 10.1007/978-1-0716-3397-7_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2023]

Diéguez-Santana K, Casañola-Martin GM, Torres R, Rasulev B, Green JR, González-Díaz H. Machine Learning Study of Metabolic Networks vs ChEMBL Data of Antibacterial Compounds. Mol Pharm 2022;19:2151-2163. [PMID: 35671399 PMCID: PMC9986951 DOI: 10.1021/acs.molpharmaceut.2c00029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Aliagas I, Gobbi A, Lee ML, Sellers BD. Comparison of logP and logD correction models trained with public and proprietary data sets. J Comput Aided Mol Des 2022;36:253-262. [PMID: 35359246 DOI: 10.1007/s10822-022-00450-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 03/15/2022] [Indexed: 10/18/2022]

Patronov A, Papadopoulos K, Engkvist O. Has Artificial Intelligence Impacted Drug Discovery? Methods Mol Biol 2022;2390:153-76. [PMID: 34731468 DOI: 10.1007/978-1-0716-1787-8_6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Quevedo-Tumailli V, Ortega-Tenezaca B, González-Díaz H. IFPTML Mapping of Drug Graphs with Protein and Chromosome Structural Networks vs. Pre-Clinical Assay Information for Discovery of Antimalarial Compounds. Int J Mol Sci 2021;22:ijms222313066. [PMID: 34884870 PMCID: PMC8657696 DOI: 10.3390/ijms222313066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 11/23/2021] [Accepted: 11/24/2021] [Indexed: 11/16/2022] Open

Abstract

The parasite species of genus Plasmodium causes Malaria, which remains a major global health problem due to parasite resistance to available Antimalarial drugs and increasing treatment costs. Consequently, computational prediction of new Antimalarial compounds with novel targets in the proteome of Plasmodium sp. is a very important goal for the pharmaceutical industry. We can expect that the success of the pre-clinical assay depends on the conditions of assay per se, the chemical structure of the drug, the structure of the target protein to be targeted, as well as on factors governing the expression of this protein in the proteome such as genes (Deoxyribonucleic acid, DNA) sequence and/or chromosomes structure. However, there are no reports of computational models that consider all these factors simultaneously. Some of the difficulties for this kind of analysis are the dispersion of data in different datasets, the high heterogeneity of data, etc. In this work, we analyzed three databases ChEMBL (Chemical database of the European Molecular Biology Laboratory), UniProt (Universal Protein Resource), and NCBI-GDV (National Center for Biotechnology Information—Genome Data Viewer) to achieve this goal. The ChEMBL dataset contains outcomes for 17,758 unique assays of potential Antimalarial compounds including numeric descriptors (variables) for the structure of compounds as well as a huge amount of information about the conditions of assays. The NCBI-GDV and UniProt datasets include the sequence of genes, proteins, and their functions. In addition, we also created two partitions (c_assayj = c_aj and c_dataj = cd_j) of categorical variables from theChEMBL dataset. These partitions contain variables that encode information about experimental conditions of preclinical assays (c_aj) or about the nature and quality of data (c_dj). These categorical variables include information about 22 parameters of biological activity (c_a0), 28 target proteins (c_a1), and 9 organisms of assay (c_a2), etc. We also created another partition of (c_protj = c_pj) including categorical variables with biological information about the target proteins, genes, and chromosomes. These variables cover32 genes (c_p0), 10 chromosomes (c_p1), gene orientation (c_p2), and 31 protein functions (c_p3). We used a Perturbation-Theory Machine Learning Information Fusion (IFPTML) algorithm to map all this information (from three databases) into and train a predictive model. Shannon’s entropy measure Sh_k (numerical variables) was used to quantify the information about the structure of drugs, protein sequences, gene sequences, and chromosomes in the same information scale. Perturbation Theory Operators (PTOs) with the form of Moving Average (MA) operators have been used to quantify perturbations (deviations) in the structural variables with respect to their expected values for different subsets (partitions) of categorical variables. We obtained three IFPTML models using General Discriminant Analysis (GDA), Classification Tree with Univariate Splits (CTUS), and Classification Tree with Linear Combinations (CTLC). The IFPTML-CTLC presented the better performance with Sensitivity Sn(%) = 83.6/85.1, and Specificity Sp(%) = 89.8/89.7 for training/validation sets, respectively. This model could become a useful tool for the optimization of preclinical assays of new Antimalarial compounds vs. different proteins in the proteome of Plasmodium.

Collapse

Pietruś W, Kurczab R, Stumpfe D, Bojarski AJ, Bajorath J. Data-Driven Analysis of Fluorination of Ligands of Aminergic G Protein Coupled Receptors. Biomolecules 2021;11:1647. [PMID: 34827645 PMCID: PMC8615825 DOI: 10.3390/biom11111647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 11/16/2022] Open

Falaguera MJ, Mestres J. Congenericity of Claimed Compounds in Patent Applications. Molecules 2021;26:5253. [PMID: 34500686 DOI: 10.3390/molecules26175253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 08/17/2021] [Accepted: 08/18/2021] [Indexed: 12/04/2022] Open

Herrera-Ibatá DM. Machine Learning and Perturbation Theory Machine Learning (PTML) in Medicinal Chemistry, Biotechnology, and Nanotechnology. Curr Top Med Chem 2021;21:649-660. [PMID: 33475073 DOI: 10.2174/1568026621666210121153413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 12/18/2020] [Accepted: 12/21/2020] [Indexed: 11/22/2022]

Sampaio-Dias IE, Rodríguez-Borges JE, Yáñez-Pérez V, Arrasate S, Llorente J, Brea JM, Bediaga H, Viña D, Loza MI, Caamaño O, García-Mera X, González-Díaz H. Synthesis, Pharmacological, and Biological Evaluation of 2-Furoyl-Based MIF-1 Peptidomimetics and the Development of a General-Purpose Model for Allosteric Modulators (ALLOPTML). ACS Chem Neurosci 2021;12:203-215. [PMID: 33347281 DOI: 10.1021/acschemneuro.0c00687] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open

Abstract

This work describes the synthesis and pharmacological evaluation of 2-furoyl-based Melanostatin (MIF-1) peptidomimetics as dopamine D₂ modulating agents. Eight novel peptidomimetics were tested for their ability to enhance the maximal effect of tritiated N-propylapomorphine ([³H]-NPA) at D₂ receptors (D₂R). In this series, 2-furoyl-l-leucylglycinamide (6a) produced a statistically significant increase in the maximal [³H]-NPA response at 10 pM (11 ± 1%), comparable to the effect of MIF-1 (18 ± 9%) at the same concentration. This result supports previous evidence that the replacement of proline residue by heteroaromatic scaffolds are tolerated at the allosteric binding site of MIF-1. Biological assays performed for peptidomimetic 6a using cortex neurons from 19-day-old Wistar-Kyoto rat embryos suggest that 6a displays no neurotoxicity up to 100 μM. Overall, the pharmacological and toxicological profile and the structural simplicity of 6a makes this peptidomimetic a potential lead compound for further development and optimization, paving the way for the development of novel modulating agents of D₂R suitable for the treatment of CNS-related diseases. Additionally, the pharmacological and biological data herein reported, along with >20 000 outcomes of preclinical assays, was used to seek a general model to predict the allosteric modulatory potential of molecular candidates for a myriad of target receptors, organisms, cell lines, and biological activity parameters based on perturbation theory (PT) ideas and machine learning (ML) techniques, abbreviated as ALLOPTML. By doing so, ALLOPTML shows high specificity Sp = 89.2/89.4%, sensitivity Sn = 71.3/72.2%, and accuracy Ac = 86.1%/86.4% in training/validation series, respectively. To the best of our knowledge, ALLOPTML is the first general-purpose chemoinformatic tool using a PTML-based model for the multioutput and multicondition prediction of allosteric compounds, which is expected to save both time and resources during the early drug discovery of allosteric modulators.

Collapse

Affiliation(s)

Ivo E. Sampaio-Dias LAQV/REQUIMTE, Dept. of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
José E. Rodríguez-Borges LAQV/REQUIMTE, Dept. of Chemistry and Biochemistry, Faculty of Sciences, University of Porto, 4169-007 Porto, Portugal
Víctor Yáñez-Pérez Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
Sonia Arrasate Dept. of Pharmacology, Faculty of Medicine and Nursing, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
Javier Llorente Dept. of Pharmacology, Faculty of Medicine and Nursing, University of Basque Country (UPV-EHU), 48940 Leioa, Spain Dept. of Pharmacology, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
José M. Brea Innopharma Screening Platform, Biofarma Research group, Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
Harbil Bediaga Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain Dept. of Physical Chemistry, University of Basque Country (UPV-EHU), 48940 Leioa, Spain
Dolores Viña Dept. of Pharmacology, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
María Isabel Loza Innopharma Screening Platform, Biofarma Research group, Centre of Research in Molecular Medicine and Chronic Diseases CIMUS, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
Olga Caamaño Dept. of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
Xerardo García-Mera Dept. of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782 Santiago de Compostela, Spain
Humberto González-Díaz Dept. of Organic Chemistry II, University of Basque Country (UPV-EHU), 48940 Leioa, Spain Basque Center for Biophysics (CSIC UPV/EHU), University of Basque Country (UPV-EHU), 48940 Leioa, Spain IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain

Collapse

Lin A, Baskin II, Marcou G, Horvath D, Beck B, Varnek A. Parallel Generative Topographic Mapping: An Efficient Approach for Big Data Handling. Mol Inform 2020;39:e2000009. [PMID: 32347666 PMCID: PMC7757192 DOI: 10.1002/minf.202000009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 04/10/2020] [Indexed: 11/12/2022]

Abstract

Generative Topographic Mapping (GTM) can be efficiently used to visualize, analyze and model large chemical data. The GTM manifold needs to span the chemical space deemed relevant for a given problem. Therefore, the Frame set (FS) of compounds used for the manifold construction must well cover a given chemical space. Intuitively, the FS size must raise with the size and diversity of the target library. At the same time, the GTM training can be very slow or even becomes technically impossible at FS sizes of the order of 105 compounds - which is a very small number compared to today's commercially accessible compounds, and, especially, to the theoretically feasible molecules. In order to solve this problem, we propose a Parallel GTM algorithm based on the merging of "intermediate" manifolds constructed in parallel for different subsets of molecules. An ensemble of these subsets forms a FS for the "final" manifold. In order to assess the efficiency of the new algorithm, 80 GTMs were built on the FSs of different sizes ranging from 10 to 1.8 M compounds selected from the ChEMBL database. Each GTM was challenged to build classification models for up to 712 biological activities (depending on the FS size). With the novel parallel GTM procedure, we could thus cover the entire spectrum of possible FS sizes, whereas previous studies were forced to rely on the working hypothesis that FS sizes of few thousands of compounds are sufficient to describe the ChEMBL chemical space. In fact, this study formally proves this to be true: a FS containing only 5000 randomly picked compounds is sufficient to represent the entire ChEMBL collection (1.8 M molecules), in the sense that a further increase of FS compound numbers has no benefice impact on the predictive propensity of the above-mentioned 712 activity classification models. Parallel GTM may, however, be required to generate maps based on very large FS, that might improve chemical space cartography of big commercial and virtual libraries, approaching billions of compounds.

Collapse

Tuerkova A, Zdrazil B. A ligand-based computational drug repurposing pipeline using KNIME and Programmatic Data Access: case studies for rare diseases and COVID-19. J Cheminform 2020;12:71. [PMID: 33250934 PMCID: PMC7686838 DOI: 10.1186/s13321-020-00474-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 11/09/2020] [Indexed: 01/01/2023] Open

Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson F, Bellis LJ, De Veij M, Leach AR. An open source chemical structure curation pipeline using RDKit. J Cheminform 2020;12:51. [PMID: 33431044 PMCID: PMC7458899 DOI: 10.1186/s13321-020-00456-1] [Citation(s) in RCA: 128] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 08/24/2020] [Indexed: 11/13/2022] Open

Santana R, Zuluaga R, Gañán P, Arrasate S, Onieva E, Montemore MM, González-Díaz H. PTML Model for Selection of Nanoparticles, Anticancer Drugs, and Vitamins in the Design of Drug-Vitamin Nanoparticle Release Systems for Cancer Cotherapy. Mol Pharm 2020;17:2612-2627. [PMID: 32459098 DOI: 10.1021/acs.molpharmaceut.0c00308] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]

Abstract

Nanosystems are gaining momentum in pharmaceutical sciences because of the wide variety of possibilities for designing these systems to have specific functions. Specifically, studies of new cancer cotherapy drug-vitamin release nanosystems (DVRNs) including anticancer compounds and vitamins or vitamin derivatives have revealed encouraging results. However, the number of possible combinations of design and synthesis conditions is remarkably high. In addition, a large number of anticancer and vitamin derivatives have been already assayed, but a notably less number of cases of DVRNs were assayed as a whole (with the anticancer compound and the vitamin linked to them). Our approach combines with the perturbation theory and machine learning (PTML) model to predict the probability of obtaining an interesting DVRN by changing the anticancer compound and/or the vitamin present in a DVRN that is already tested for other anticancer compounds or vitamins that have not been tested yet as part of a DVRN. In a previous work, we built a linear PTML model useful for the design of these nanosystems. In doing so, we used information fusion (IF) techniques to carry out data enrichment of DVRN data compiled from the literature with the data for preclinical assays of vitamins from the ChEMBL database. The design features of DVRNs and the assay conditions of nanoparticles (NPs) and vitamins were included as multiplicative PT operators (PTOs) to the system, which indicates the importance of these variables. However, the previous work omitted experiments with nonlinear ML techniques and different types of PTOs such as metric-based PTOs. More importantly, the previous work does not consider the structure of the anticancer drug to be included in the new DVRNs. In this work, we are going to accomplish three main objectives (tasks). In the first task, we found a new model, alternative to the one published before, for the rational design of DVRNs using metric-based PTOs. The most accurate PTML model was the artificial neural network model, which showed values of specificity, sensitivity, and accuracy in the range of 90-95% in training and external validation series for more than 130,000 cases (DVRNs vs ChEMBL assays). Furthermore, in the second task, we used IF techniques to carry out data enrichment of our previous data set. In doing so, we constructed a new working data set of >970,000 cases with the data of preclinical assays of DVRNs, vitamins, and anticancer compounds from the ChEMBL database. All these assays have multiple continuous variables or descriptors d_k and categorical variables c_j (conditions of the assays) for drugs (d_ack, c_acj), vitamins (d_vk, c_vj), and NPs (d_nk, c_nj). These data include >20,000 potential anticancer compounds with >270 protein targets (c_ac1), >580 assay cell organisms (c_ac2), and so forth. Furthermore, we include >36,000 assay vitamin derivatives in >6200 types of cells (c_2vit), >120 assay organisms (c_3vit), >60 assay strains (c_4vit), and so forth. The enriched data set also contains >20 types of DVRNs (c_5n) with 9 NP core materials (c_4n), 8 synthesis methods (c_7n), and so forth. We expressed all this information with PTOs and developed a qualitatively new PTML model that incorporates information of the anticancer drugs. This new model presents 96-97% of accuracy for training and external validation subsets. In the last task, we carried out a comparative study of ML and/or PTML models published and described how the models we are presenting cover the gap of knowledge in terms of drug delivery. In conclusion, we present here for the first time a multipurpose PTML model that is able to select NPs, anticancer compounds, and vitamins and their conditions of assay for DVRN design.

Collapse

Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020;12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open

Abstract

Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using K_i, K_d, IC₅₀ and EC₅₀ data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC₅₀ data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC₅₀ units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC₅₀ units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC₅₀ units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .

Collapse

Chávez-Hernández AL, Sánchez-Cruz N, Medina-Franco JL. A Fragment Library of Natural Products and its Comparative Chemoinformatic Characterization. Mol Inform 2020;39:e2000050. [PMID: 32302465 DOI: 10.1002/minf.202000050] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 04/17/2020] [Indexed: 11/06/2022]

Sturm N, Mayr A, Le Van T, Chupakhin V, Ceulemans H, Wegner J, Golib-Dzib JF, Jeliazkova N, Vandriessche Y, Böhm S, Cima V, Martinovic J, Greene N, Vander Aa T, Ashby TJ, Hochreiter S, Engkvist O, Klambauer G, Chen H. Industry-scale application and evaluation of deep learning for drug target prediction. J Cheminform 2020;12:26. [PMID: 33430964 PMCID: PMC7169028 DOI: 10.1186/s13321-020-00428-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 03/30/2020] [Indexed: 12/02/2022] Open

Affiliation(s)

Noé Sturm Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.
Andreas Mayr LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
Thanh Le Van High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
Vladimir Chupakhin High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen R&D, 1400 McKean Rd, Spring House, Pennsylvania, 19002, USA
Hugo Ceulemans High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
Joerg Wegner High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Pharmaceutica, Turnhoutseweg 30, 2349, Beerse, Belgium
Jose-Felipe Golib-Dzib High-Dimensional Biology & Discovery Data Sciences, Discovery Sciences, Janssen Cilag SA, Calle Río Jarama, 75A, 45007, Toledo, Spain
Nina Jeliazkova Ideaconsult Ltd., 4. Angel Kanchev Str., 1000, Sofia, Bulgaria
Yves Vandriessche Intel Corporation, Data Center Group, Veldkant 31, 2550, Kontich, Belgium
Stanislav Böhm IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
Vojtech Cima IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
Jan Martinovic IT4Innovations, VSB - Technical University of Ostrava, 17. Listopadu 2172/15, 70800, Ostrava-Poruba, Czech Republic
Nigel Greene Clinical Pharmacology and Safety Science, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden
Tom Vander Aa Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium
Thomas J Ashby Exascience Lab, Imec, Kapeldreef 75, 3001, Louvain, Belgium
Sepp Hochreiter LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
Ola Engkvist Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden
Günter Klambauer LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria.
Hongming Chen Hit Discovery, Discovery Sciences, R&D BioPharmaceuticals, AstraZeneca, Pepparedsleden 1, 43183, Mölndal, Sweden.

Collapse

Drakakis G, Cortés-Ciriano I, Alexander-Dann B, Bender A. Elucidating Compound Mechanism of Action and Predicting Cytotoxicity Using Machine Learning Approaches, Taking Prediction Confidence into Account. ACTA ACUST UNITED AC 2020;11:e73. [PMID: 31483099 DOI: 10.1002/cpch.73] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]

Santana R, Zuluaga R, Gañán P, Arrasate S, Onieva Caracuel E, González-Díaz H. PTML Model of ChEMBL Compounds Assays for Vitamin Derivatives. ACS Comb Sci 2020;22:129-141. [PMID: 32011854 DOI: 10.1021/acscombsci.9b00166] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]

Abstract

Determining the biological activity of vitamin derivatives is needed given that organic synthesis of analogs of vitamins is an active field of interest for medicinal chemistry, pharmaceuticals, and food additives. Accordingly, scientists from different disciplines perform preclinical assays (n_ij) with a considerable combination of assay conditions (c_j). Indeed, the ChEMBL platform contains a database that includes results from 36 220 different biological activity bioassays of 21 240 different vitamins and vitamin derivatives. These assays present are heterogeneous in terms of assay combinations of c_j. They are focused on >500 different biological activity parameters (c₀), >340 different targets (c₁), >6200 types of cell (c₂), >120 organisms of assay (c₃), and >60 assay strains (c₄). It includes a total of >1850 niacin assays, >1580 tretinoin assays, >1580 retinol assays, 857 ascorbic acid assays, etc. Given the complexity of this combinatorial data in terms of being assimilated by researchers, we propose to build a model by combining perturbation theory (PT) and machine learning (ML). Through this study, we propose a PTML (PT + ML) combinatorial model for ChEMBL results on biological activity of vitamins and vitamins derivatives. The linear discriminant analysis (LDA) model presented the following results for training subset a: specificity (%) = 90.38, sensitivity (%) = 87.51, and accuracy (%) = 89.89. The model showed the following results for the external validation subset: specificity (%) = 90.58, sensitivity (%) = 87.72, and accuracy (%) = 90.09. Different types of linear and nonlinear PTML models, such as logistic regression (LR), classification tree (CT), näive Bayes (NB), and random Forest (RF), were applied to contrast the capacity of prediction. The PTML-LDA model predicts with more accuracy by applying combinatorial descriptors. In addition, a PCA experiment with chemical structure descriptors allowed us to characterize the high structural diversity of the chemical space studied. In any case, PTML models using chemical structure descriptors do not improve the performance of the PTML-LDA model based on ALOGP and PSA. We can conclude that the three variable PTML-LDA model is a simplified and adaptable tool for the prediction, for different experiment combinations, the biological activity of derivative vitamins.

Collapse

Sarkar A. Enabling design of screening libraries for antibiotic discovery by modeling ChEMBL data. Eur J Pharm Sci 2020;143:105166. [PMID: 31783159 DOI: 10.1016/j.ejps.2019.105166] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 11/11/2019] [Accepted: 11/24/2019] [Indexed: 11/17/2022]

Abstract

It is critical to identify novel antibiotics. Yet, the scientific community has struggled in this pursuit because we do not understand which molecules will penetrate the bacterial outer envelope. In this work, we have identified a large dataset of compounds known to reach their targets in bacterial cells (penetrators) and compared them with molecules that do not (non-penetrators). Our dataset, extracted from the ChEMBL database, is a useful tool to guide the selection of molecules for antibiotic screening. Simple random forest classification models are able to correctly identify penetrators from non-penetrators. The model demonstrated ~87% accuracy, with high precision (~88%) and recall (~97%) in identifying penetrators of Gram-positive bacteria. A paucity of data for non-penetrators was a major hurdle to model-building; we observed a ~86% negative predictive value, but only a ~57% specificity. Accumulation of data on non-penetrators is therefore necessary. Data for Gram-negative bacteria was also sparse, but a larger fraction of these data represented non-penetrators. Correspondingly, the resultant models performed well in predicting those molecules that would fail to enter Gram-negative cells, but were relatively weaker in correctly predicting penetrators. A comparison of physicochemical properties of penetrators and non-penetrators suggests only marginal differences exist. Therefore, it may be difficult to identify overarching rules for generation of screening libraries for antibiotic discovery, based purely on physicochemical properties alone. Instead, models such as ours should be of use. Our models are highly preliminary and based on phenotypic data, but a similar large dataset directly addressing accumulation of chemical matter in bacterial cells is currently unavailable. Hence, our models represent the cutting edge in design of screening libraries for antibiotic discovery until appropriate data can be compiled.

Collapse

Diez-Alarcia R, Yáñez-Pérez V, Muneta-Arrate I, Arrasate S, Lete E, Meana JJ, González-Díaz H. Big Data Challenges Targeting Proteins in GPCR Signaling Pathways; Combining PTML-ChEMBL Models and [³⁵S]GTPγS Binding Assays. ACS Chem Neurosci 2019;10:4476-4491. [PMID: 31618004 DOI: 10.1021/acschemneuro.9b00302] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open

Abstract

G-protein-coupled receptors (GPCRs), also known as 7-transmembrane receptors, are the single largest class of drug targets. Consequently, a large amount of preclinical assays having GPCRs as molecular targets has been released to public sources like the Chemical European Molecular Biology Laboratory (ChEMBL) database. These data are also very complex covering changes in drug chemical structure and assay conditions like c₀ = activity parameter (K_i, IC₅₀, etc.), c₁ = target protein, c₂ = cell line, c₃ = assay organism, etc., making difficult the analysis of these databases that are placed in the borders of a Big Data challenge. One of the aims of this work is to develop a computational model able to predict new GPCRs targeting drugs taking into consideration multiple conditions of assay. Another objective is to perform new predictive and experimental studies of selective 5-HTA2 receptor agonist, antagonist, or inverse agonist in human comparing the results with those from the literature. In this work, we combined Perturbation Theory (PT) and Machine Learning (ML) to seek a general PTML model for this data set. We analyzed 343 738 unique compounds with 812 072 end points (assay outcomes), with 185 different experimental parameters, 592 protein targets, 51 cell lines, and/or 55 organisms (species). The best PTML linear model found has three input variables only and predicted 56 202/58 653 positive outcomes (sensitivity = 95.8%) and 470 230/550 401 control cases (specificity = 85.4%) in training series. The model also predicted correctly 18 732/19 549 (95.8%) of positive outcomes and 156 739/183 469 (85.4%) of cases in external validation series. To illustrate its practical use, we used the model to predict the outcomes of six different 5-HT2A receptor drugs, namely, TCB-2, DOI, DOB, altanserin, pimavanserin, and nelotanserin, in a very large number of different pharmacological assays. 5-HT2A receptors are altered in schizophrenia and represent drug target for antipsychotic therapeutic activity. The model correctly predicted 93.83% (76 of 86) experimental results for these compounds reported in ChEMBL. Moreover, [³⁵S]GTPγS binding assays were performed experimentally with the same six drugs with the aim of determining their potency and efficacy in the modulation of G-proteins in human brain tissue. The antagonist ketanserin was included as inactive drug with demonstrated affinity for 5-HT2A/C receptors. Our results demonstrate that some of these drugs, previously described as serotonin 5-HT2A receptor agonists, antagonists, or inverse agonists, are not so specific and show different intrinsic activity to that previously reported. Overall, this work opens a new gate for the prediction of GPCRs targeting compounds.

Collapse

Vásquez-Domínguez E, Armijos-Jaramillo VD, Tejera E, González-Díaz H. Multioutput Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds. Mol Pharm 2019;16:4200-4212. [PMID: 31426639 DOI: 10.1021/acs.molpharmaceut.9b00538] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]

Abstract

Retroviral infections, such as HIV, are, until now, diseases with no cure. Medicine and pharmaceutical chemistry need and consider it a huge goal to define target proteins of new antiretroviral compounds. ChEMBL manages Big Data features with a complex data set, which is hard to organize. This makes information difficult to analyze due to a big number of characteristics described in order to predict new drug candidates for retroviral infections. For this reason, we propose to develop a new predictive model combining perturbation theory (PT) bases and machine learning (ML) modeling to create a new tool that can take advantage of all the available information. The PTML model proposed in this work for the ChEMBL data set preclinical experimental assays for antiretroviral compounds consists of a linear equation with four variables. The PT operators used are founded on multicondition moving averages, combining different features and simplifying the difficulty to manage all data. More than 140 000 preclinical assays for 56 105 compounds with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c₀), 55 protein accessions (c₁), 83 cell lines (c₂), 64 organisms of assay (c₃), and 773 subtypes or strains. We have included 150 148 preclinical experimental assays for HIV virus, 1188 for HTLV virus, 84 for simian immunodeficiency virus, 370 for murine leukemia virus, 119 for Rous sarcoma virus, 1581 for MMTV, etc. We also included 5277 assays for hepatitis B virus. The developed PTML model reached considerable values in sensibility (73.05% for training and 73.10% for validation), specificity (86.61% for training and 87.17% for validation), and accuracy (75.84% for training and 75.98% for validation). We also compared alternative PTML models with different PT operators such as covariance, moments, and exponential terms. Finally, we made a comparison between literature ML models with our PTML model and also artificial neural network (ANN) nonlinear models. We conclude that this PTML model is the first one to consider multiple characteristics of preclinical experimental antiretroviral assays combined, generating a simple, useful, and adaptable instrument, which could reduce time and costs in antiretroviral drugs research.

Collapse

Liang L, Ma C, Du T, Zhao Y, Zhao X, Liu M, Wang Z, Lin J. Bioactivity-explorer: a web application for interactive visualization and exploration of bioactivity data. J Cheminform 2019;11:47. [PMID: 31292807 PMCID: PMC6617623 DOI: 10.1186/s13321-019-0370-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 07/02/2019] [Indexed: 12/29/2022] Open

Affiliation(s)

Lu Liang State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
Chunfeng Ma State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
Tengfei Du State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
Yufei Zhao State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
Xiaoyong Zhao State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China
Mengmeng Liu State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China.,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China
Zhonghua Wang Tianjin Institute of Industrial Biotechnology, Biodesign Center, Chinese Academy of Sciences, Tianjin, China.
Jianping Lin State Key Laboratory of Medicinal Chemical Biology, College of Pharmacy and Tianjin Key Laboratory of Molecular Drug Research, Nankai University, Haihe Education Park, 38 Tongyan Road, Tianjin, 300353, China. .,Tianjin Institute of Industrial Biotechnology, Biodesign Center, Chinese Academy of Sciences, Tianjin, China. .,Platform of Pharmaceutical Intelligence, Tianjin International Joint Academy of Biomedicine, Tianjin, 300000, China.

Collapse

Cortés-Ciriano I, Bender A. KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 2019;11:41. [PMID: 31218493 PMCID: PMC6582521 DOI: 10.1186/s13321-019-0364-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2019] [Accepted: 06/09/2019] [Indexed: 02/08/2023] Open

Abstract

The application of convolutional neural networks (ConvNets) to harness high-content screening images or 2D compound representations is gaining increasing attention in drug discovery. However, existing applications often require large data sets for training, or sophisticated pretraining schemes. Here, we show using 33 IC50 data sets from ChEMBL 23 that the in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted on a continuous scale from their Kekulé structure representations alone by extending existing architectures (AlexNet, DenseNet-201, ResNet152 and VGG-19), which were pretrained on unrelated image data sets. We show that the predictive power of the generated models, which just require standard 2D compound representations as input, is comparable to that of Random Forest (RF) models and fully-connected Deep Neural Networks trained on circular (Morgan) fingerprints. Notably, including additional fully-connected layers further increases the predictive power of the ConvNets by up to 10%. Analysis of the predictions generated by RF models and ConvNets shows that by simply averaging the output of the RF models and ConvNets we obtain significantly lower errors in prediction for multiple data sets, although the effect size is small, than those obtained with either model alone, indicating that the features extracted by the convolutional layers of the ConvNets provide complementary predictive signal to Morgan fingerprints. Lastly, we show that multi-task ConvNets trained on compound images permit to model COX isoform selectivity on a continuous scale with errors in prediction comparable to the uncertainty of the data. Overall, in this work we present a set of ConvNet architectures for the prediction of compound activity from their Kekulé structure representations with state-of-the-art performance, that require no generation of compound descriptors or use of sophisticated image processing techniques. The code needed to reproduce the results presented in this study and all the data sets are provided at https://github.com/isidroc/kekulescope .

Collapse

Hu Y, Bajorath J. SAR Matrix Method for Large-Scale Analysis of Compound Structure-Activity Relationships and Exploration of Multitarget Activity Spaces. Methods Mol Biol 2019;1825:339-352. [PMID: 30334212 DOI: 10.1007/978-1-4939-8639-2_11] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2023]

Lin A, Horvath D, Marcou G, Beck B, Varnek A. Multi-task generative topographic mapping in virtual screening. J Comput Aided Mol Des 2019;33:331-343. [PMID: 30739238 DOI: 10.1007/s10822-019-00188-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 02/02/2019] [Indexed: 12/16/2022]

Bosc N, Atkinson F, Felix E, Gaulton A, Hersey A, Leach AR. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminform 2019;11:4. [PMID: 30631996 PMCID: PMC6690068 DOI: 10.1186/s13321-018-0325-4] [Citation(s) in RCA: 70] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 12/24/2018] [Indexed: 12/22/2022] Open

Ferreira da Costa J, Silva D, Caamaño O, Brea JM, Loza MI, Munteanu CR, Pazos A, García-Mera X, González-Díaz H. Perturbation Theory/Machine Learning Model of ChEMBL Data for Dopamine Targets: Docking, Synthesis, and Assay of New l-Prolyl-l-leucyl-glycinamide Peptidomimetics. ACS Chem Neurosci 2018;9:2572-2587. [PMID: 29791132 DOI: 10.1021/acschemneuro.8b00083] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Abstract

Predicting drug-protein interactions (DPIs) for target proteins involved in dopamine pathways is a very important goal in medicinal chemistry. We can tackle this problem using Molecular Docking or Machine Learning (ML) models for one specific protein. Unfortunately, these models fail to account for large and complex big data sets of preclinical assays reported in public databases. This includes multiple conditions of assays, such as different experimental parameters, biological assays, target proteins, cell lines, organism of the target, or organism of assay. On the other hand, perturbation theory (PT) models allow us to predict the properties of a query compound or molecular system in experimental assays with multiple boundary conditions based on a previously known case of reference. In this work, we report the first PTML (PT + ML) study of a large ChEMBL data set of preclinical assays of compounds targeting dopamine pathway proteins. The best PTML model found predicts 50000 cases with accuracy of 70-91% in training and external validation series. We also compared the linear PTML model with alternative PTML models trained with multiple nonlinear methods (artificial neural network (ANN), Random Forest, Deep Learning, etc.). Some of the nonlinear methods outperform the linear model but at the cost of a notable increment of the complexity of the model. We illustrated the practical use of the new model with a proof-of-concept theoretical-experimental study. We reported for the first time the organic synthesis, chemical characterization, and pharmacological assay of a new series of l-prolyl-l-leucyl-glycinamide (PLG) peptidomimetic compounds. In addition, we performed a molecular docking study for some of these compounds with the software Vina AutoDock. The work ends with a PTML model predictive study of the outcomes of the new compounds in a large number of assays. Therefore, this study offers a new computational methodology for predicting the outcome for any compound in new assays. This PTML method focuses on the prediction with a simple linear model of multiple pharmacological parameters (IC₅₀, EC₅₀, K_i, etc.) for compounds in assays involving different cell lines used, organisms of the protein target, or organism of assay for proteins in the dopamine pathway.

Collapse

Bediaga H, Arrasate S, González-Díaz H. PTML Combinatorial Model of ChEMBL Compounds Assays for Multiple Types of Cancer. ACS Comb Sci 2018;20:621-632. [PMID: 30240186 DOI: 10.1021/acscombsci.8b00090] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abstract

Determining the target proteins of new anticancer compounds is a very important task in Medicinal Chemistry. In this sense, chemists carry out preclinical assays with a high number of combinations of experimental conditions (c _j). In fact, ChEMBL database contains outcomes of 65 534 different anticancer activity preclinical assays for 35 565 different chemical compounds (1.84 assays per compound). These assays cover different combinations of c _j formed from >70 different biological activity parameters ( c₀), >300 different drug targets ( c₁), >230 cell lines ( c₂), and 5 organisms of assay ( c₃) or organisms of the target ( c₄). It include a total of 45 833 assays in leukemia, 6227 assays in breast cancer, 2499 assays in ovarian cancer, 3499 in colon cancer, 3159 in lung cancer, 2750 in prostate cancer, 601 in melanoma, etc. This is a very complex data set with multiple Big Data features. This data is hard to be rationalized by researchers to extract useful relationships and predict new compounds. In this context, we propose to combine perturbation theory (PT) ideas and machine learning (ML) modeling to solve this combinatorial-like problem. In this work, we report a PTML (PT + ML) model for ChEMBL data set of preclinical assays of anticancer compounds. This is a simple linear model with only three variables. The model presented values of area under receiver operating curve = AUROC = 0.872, specificity = Sp(%) = 90.2, sensitivity = Sn(%) = 70.6, and overall accuracy = Ac(%) = 87.7 in training series. The model also have Sp(%) = 90.1, Sn(%) = 71.4, and Ac(%) = 87.8 in external validation series. The model use PT operators based on multicondition moving averages to capture all the complexity of the data set. We also compared the model with nonlinear artificial neural network (ANN) models obtaining similar results. This confirms the hypothesis of a linear relationship between the PT operators and the classification as anticancer compounds in different combinations of assay conditions. Last, we compared the model with other PTML models reported in the literature concluding that this is the only one PTML model able to predict activity against multiple types of cancer. This model is a simple but versatile tool for the prediction of the targets of anticancer compounds taking into consideration multiple combinations of experimental conditions in preclinical assays.

Collapse

Lagunin AA, Romanova MA, Zadorozhny AD, Kurilenko NS, Shilov BV, Pogodin PV, Ivanov SM, Filimonov DA, Poroikov VV. Comparison of Quantitative and Qualitative (Q)SAR Models Created for the Prediction of K_i and IC₅₀ Values of Antitarget Inhibitors. Front Pharmacol 2018;9:1136. [PMID: 30364128 PMCID: PMC6192375 DOI: 10.3389/fphar.2018.01136] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 09/18/2018] [Indexed: 12/20/2022] Open

Fukunishi Y, Yamashita Y, Mashimo T, Nakamura H. Prediction of Protein-compound Binding Energies from Known Activity Data: Docking-score-based Method and its Applications. Mol Inform 2018;37:e1700120. [PMID: 29442436 PMCID: PMC6055825 DOI: 10.1002/minf.201700120] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 01/22/2018] [Indexed: 12/18/2022]

Pogodin PV, Lagunin AA, Rudik AV, Filimonov DA, Druzhilovskiy DS, Nicklaus MC, Poroikov VV. How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors. Front Chem 2018;6:133. [PMID: 29755970 PMCID: PMC5935003 DOI: 10.3389/fchem.2018.00133] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 04/09/2018] [Indexed: 12/16/2022] Open

Abstract

Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of "active" and "inactive" compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.

Collapse

Fedoros EI, Orlov AA, Zherebker A, Gubareva EA, Maydin MA, Konstantinov AI, Krasnov KA, Karapetian RN, Izotova EI, Pigarev SE, Panchenko AV, Tyndyk ML, Osolodkin DI, Nikolaev EN, Perminova IV, Anisimov VN. Novel water-soluble lignin derivative BP-Cx-1: identification of components and screening of potential targets in silico and in vitro. Oncotarget 2018;9:18578-18593. [PMID: 29719628 PMCID: PMC5915095 DOI: 10.18632/oncotarget.24990] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 12/16/2017] [Indexed: 11/25/2022] Open

Abstract

Identification of molecular targets and mechanism of action is always a challenge, in particular – for natural compounds due to inherent chemical complexity. BP-Cx-1 is a water-soluble modification of hydrolyzed lignin used as the platform for a portfolio of innovative pharmacological products aimed for therapy and supportive care of oncological patients. The present study describes a new approach, which combines in vitro screening of potential molecular targets for BP-Cx-1 using Diversity Profile - P9 panel by Eurofins Cerep (France) with a search of possible active components in silico in ChEMBL - manually curated chemical database of bioactive molecules with drug-like properties. The results of diversity assay demonstrate that BP-Cx-1 has multiple biological effects on neurotransmitters receptors, ligand-gated ion channels and transporters. Of particular importance is that the major part of identified molecular targets are involved in modulation of inflammation and immune response and might be related to tumorigenesis. Characterization of molecular composition of BP-Cx-1 with Fourier Transform Ion Cyclotron Resonance Mass Spectrometry and subsequent identification of possible active components by searching for molecular matches in silico in ChEMBL indicated polyphenolic components, nominally, flavonoids, sapogenins, phenanthrenes, as the major carriers of biological activity of BP-Cx-1. In vitro and in silico target screening yielded overlapping lists of proteins: adenosine receptors, dopamine receptor DRD4, glucocorticoid receptor, serotonin receptor 5-HT1, prostaglandin receptors, muscarinic cholinergic receptor, GABAA receptor. The pleiotropic molecular activities of polyphenolic components are beneficial in treatment of multifactorial disorders such as diseases associated with chronic inflammation and cancer.

Collapse

Affiliation(s)

Elena I Fedoros N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia.,Nobel LTD, Saint-Petersburg 192012, Russia
Alexey A Orlov Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
Alexander Zherebker Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia.,Skolkovo Institute of Science and Technology, Skolkovo 143025, Russia
Ekaterina A Gubareva N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
Mikhail A Maydin N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
Andrey I Konstantinov Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
Konstantin A Krasnov Institute of Toxicology, Federal Medical-Biological Agency, Saint-Petersburg 192019, Russia
Ruben N Karapetian CHEMDIV LTD, Moscow District, Khimki 141400, Russia
Ekaterina I Izotova Nobel LTD, Saint-Petersburg 192012, Russia
Sergey E Pigarev Nobel LTD, Saint-Petersburg 192012, Russia
Andrey V Panchenko N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
Margarita L Tyndyk N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia
Dmitry I Osolodkin Institute of Poliomyelitis and Viral Encephalitides, Chumakov FSC R&D IBP RAS, Moscow 108819, Russia.,Sechenov First Moscow State Medical University, Moscow 119991, Russia
Evgeny N Nikolaev Skolkovo Institute of Science and Technology, Skolkovo 143025, Russia.,Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow 119334, Russia.,Orekhovich Institute of Biomedical Chemistry, Russian Academy of Medical Sciences, Moscow 119121, Russia
Irina V Perminova Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
Vladimir N Anisimov N.N. Petrov National Medical Research Center of Oncology, Saint-Petersburg 197758, Russia

Collapse

Al Mahmud R, Najnin RA, Polash AH. A Survey of Web-Based Chemogenomic Data Resources. Methods Mol Biol 2018;1825:3-62. [PMID: 30334202 DOI: 10.1007/978-1-4939-8639-2_1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Ong E, Xie J, Ni Z, Liu Q, Sarntivijai S, Lin Y, Cooper D, Terryn R, Stathias V, Chung C, Schürer S, He Y. Ontological representation, integration, and analysis of LINCS cell line cells and their cellular responses. BMC Bioinformatics 2017;18:556. [PMID: 29322930 PMCID: PMC5763302 DOI: 10.1186/s12859-017-1981-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Affiliation(s)

Edison Ong Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Jiangan Xie Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
Zhaohui Ni Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
Qingping Liu Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA
Sirarat Sarntivijai Samples, Phenotypes and Ontologies Team, European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Hinxton, Cambridge, UK
Yu Lin Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA
Daniel Cooper Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
Raymond Terryn Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
Vasileios Stathias Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA.,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA
Caty Chung BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA.,Center for Computational Science, University of Miami, Miami, FL, USA
Stephan Schürer Department of Molecular and Cellular Pharmacology, University of Miami, Miami, FL, USA. .,BD2K LINCS Data Coordination and Integration Center, University of Miami, Miami, FL, USA. .,Center for Computational Science, University of Miami, Miami, FL, USA.
Yongqun He Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. .,Unit of Laboratory Animal Medicine and Department of Micro biology and Immunology, University of Michigan, Ann Arbor, MI, USA.

Collapse

Lenselink EB, Ten Dijke N, Bongers B, Papadatos G, van Vlijmen HWT, Kowalczyk W, IJzerman AP, van Westen GJP. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform 2017;9:45. [PMID: 29086168 PMCID: PMC5555960 DOI: 10.1186/s13321-017-0232-0] [Citation(s) in RCA: 165] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 07/31/2017] [Indexed: 11/10/2022] Open

Abstract

The increase of publicly available bioactivity data in recent years has fueled and catalyzed research in chemogenomics, data mining, and modeling approaches. As a direct result, over the past few years a multitude of different methods have been reported and evaluated, such as target fishing, nearest neighbor similarity-based methods, and Quantitative Structure Activity Relationship (QSAR)-based protocols. However, such studies are typically conducted on different datasets, using different validation strategies, and different metrics. In this study, different methods were compared using one single standardized dataset obtained from ChEMBL, which is made available to the public, using standardized metrics (BEDROC and Matthews Correlation Coefficient). Specifically, the performance of Naïve Bayes, Random Forests, Support Vector Machines, Logistic Regression, and Deep Neural Networks was assessed using QSAR and proteochemometric (PCM) methods. All methods were validated using both a random split validation and a temporal validation, with the latter being a more realistic benchmark of expected prospective execution. Deep Neural Networks are the top performing classifiers, highlighting the added value of Deep Neural Networks over other more conventional methods. Moreover, the best method ('DNN_PCM') performed significantly better at almost one standard deviation higher than the mean performance. Furthermore, Multi-task and PCM implementations were shown to improve performance over single task Deep Neural Networks. Conversely, target prediction performed almost two standard deviations under the mean performance. Random Forests, Support Vector Machines, and Logistic Regression performed around mean performance. Finally, using an ensemble of DNNs, alongside additional tuning, enhanced the relative performance by another 27% (compared with unoptimized 'DNN_PCM'). Here, a standardized set to test and evaluate different machine learning algorithms in the context of multi-task learning is offered by providing the data and the protocols. Graphical Abstract .

Collapse

Nowotka MM, Gaulton A, Mendez D, Bento AP, Hersey A, Leach A. Using ChEMBL web services for building applications and data processing workflows relevant to drug discovery. Expert Opin Drug Discov 2017;12:757-767. [PMID: 28602100 DOI: 10.1080/17460441.2017.1339032] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]

Senger S. Assessment of the significance of patent-derived information for the early identification of compound-target interaction hypotheses. J Cheminform 2017;9:26. [PMID: 29086108 PMCID: PMC5400772 DOI: 10.1186/s13321-017-0214-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Accepted: 04/13/2017] [Indexed: 11/16/2022] Open

Abstract

Background

Patents are an important source of information for effective decision making in drug discovery. Encouragingly, freely accessible patent-chemistry databases are now in the public domain. However, at present there is still a wide gap between relatively low coverage-high quality manually-curated data sources and high coverage data sources that use text mining and automated extraction of chemical structures. To secure much needed funding for further research and an improved infrastructure, hard evidence is required to demonstrate the significance of patent-derived information in drug discovery. Surprisingly little such evidence has been reported so far. To address this, the present study attempts to quantify the relevance of patents for formulating and substantiating hypotheses for compound–target interactions.

Results

A manually-curated set of 130 compound–target interaction pairs annotated with what are considered to be the earliest patent and publication has been produced. The analysis of this set revealed that in stark contrast to what has been reported for novel chemical structures, only about 10% of the compound–target interaction pairs could be found in publications in the scientific literature within one year of being reported in patents. The average delay across all interaction pairs is close to 4 years. In an attempt to benchmark current capabilities, it was also examined how much of the benefit of using patent-derived information can be retained when a bioannotated version of SureChEMBL is used as secondary source for the patent literature. Encouragingly, this approach found the patents in the annotated set for 72% of the compound–target interaction pairs. Similarly, the effect of using the bioactivity database ChEMBL as secondary source for the scientific literature was studied. Here, the publications from the annotated set were only found for 46% of the compound–target interaction pairs.

Conclusion

Patent-derived information is a significant enabler for formulating compound–target interaction hypotheses even in cases where the respective interaction is later reported in the scientific literature. The findings of this study clearly highlight the significance of future investments in the development and provision of databases and tools that will allow scientists to search patent information in a comprehensive, reliable, and efficient manner.

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-017-0214-2) contains supplementary material, which is available to authorized users.

Collapse

Fukunishi Y, Yamasaki S, Yasumatsu I, Takeuchi K, Kurosawa T, Nakamura H. Quantitative Structure-activity Relationship (QSAR) Models for Docking Score Correction. Mol Inform 2017;36:1600013. [PMID: 28001004 PMCID: PMC5297997 DOI: 10.1002/minf.201600013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 04/01/2016] [Indexed: 01/26/2023]

Dieguez-Santana K, Pham-The H, Villegas-Aguilar PJ, Le-Thi-Thu H, Castillo-Garit JA, Casañola-Martin GM. Prediction of acute toxicity of phenol derivatives using multiple linear regression approach for Tetrahymena pyriformis contaminant identification in a median-size database. Chemosphere 2016;165:434-441. [PMID: 27668720 DOI: 10.1016/j.chemosphere.2016.09.041] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 09/10/2016] [Accepted: 09/12/2016] [Indexed: 06/06/2023]

McGaughey G, Walters WP, Goldman B. Understanding covariate shift in model performance. F1000Res 2016;5. [PMID: 27803797 PMCID: PMC5070592 DOI: 10.12688/f1000research.8317.3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/22/2016] [Indexed: 11/20/2022] Open

Hu Y, Bajorath J. Analyzing compound activity records and promiscuity degrees in light of publication statistics. F1000Res 2016;5. [PMID: 27347396 PMCID: PMC4916991 DOI: 10.12688/f1000research.8792.2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/05/2016] [Indexed: 11/25/2022] Open

Ebejer JP, Charlton MH, Finn PW. Are the physicochemical properties of antibacterial compounds really different from other drugs? J Cheminform 2016;8:30. [PMID: 27274770 PMCID: PMC4891840 DOI: 10.1186/s13321-016-0143-5] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Accepted: 05/25/2016] [Indexed: 01/12/2023] Open

Pogodin PV, Lagunin AA, Filimonov DA, Poroikov VV. PASS Targets: Ligand-based multi-target computational system based on a public data and naïve Bayes approach. SAR QSAR Environ Res 2015;26:783-793. [PMID: 26305108 DOI: 10.1080/1062936x.2015.1078407] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]