1
|
Noga M, Jurowski K. Toxicity of Bromo-DragonFLY as a New Psychoactive Substance: Application of In Silico Methods for the Prediction of Key Toxicological Parameters Important to Clinical and Forensic Toxicology. Chem Res Toxicol 2024. [PMID: 39119730 DOI: 10.1021/acs.chemrestox.4c00105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
Bromo-DragonFLY is a synthetic new psychoactive substance (NPS) that has gained attention due to its powerful and long-lasting hallucinogenic effects, legal status, and widespread availability. This study aimed to use various in silico toxicology methods to predict key toxicological parameters for Bromo-DragonFLY, including acute toxicity (LD50), genotoxicity, cardiotoxicity, health effects, and the potential for endocrine disruption. The results indicate significant acute toxicity with noticeable variations across different species, a low likelihood of genotoxic potential suggesting potential DNA damage, and a notable risk of cardiotoxicity associated with inhibition of the hERG channel. Evaluation of endocrine disruption suggests a low probability of Bromo-DragonFLY interacting with the estrogen receptor α (ER-α), indicating minimal estrogenic activity. These insights from in silico investigations are important for advancing our understanding of this NPS in forensic and clinical toxicology. These initial toxicological examinations establish a foundation for future research efforts and contribute to developing risk assessment and management strategies for using and misusing NPS.
Collapse
Affiliation(s)
- Maciej Noga
- Department of Regulatory and Forensic Toxicology, Institute of Medical Expertises in Łódź, Ul. Aleksandrowska 67/93, 91-205 Łódź, Poland
| | - Kamil Jurowski
- Department of Regulatory and Forensic Toxicology, Institute of Medical Expertises in Łódź, Ul. Aleksandrowska 67/93, 91-205 Łódź, Poland
- Laboratory of Innovative Toxicological Research and Analyzes, Institute of Medical Studies, Medical College, Rzeszów University, Al. Mjr. W. Kopisto 2a, 35-959 Rzeszów, Poland
| |
Collapse
|
2
|
Kim S, Yu B, Li Q, Bolton EE. PubChem synonym filtering process using crowdsourcing. J Cheminform 2024; 16:69. [PMID: 38880887 PMCID: PMC11181558 DOI: 10.1186/s13321-024-00868-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 06/09/2024] [Indexed: 06/18/2024] Open
Abstract
PubChem ( https://pubchem.ncbi.nlm.nih.gov ) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a "chemical synonym"). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem's crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem's filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.
Collapse
Affiliation(s)
- Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Bo Yu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Qingliang Li
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Evan E Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|
3
|
Eriksen CA, Andersen JL, Fagerberg R, Merkle D. Toward the Reconciliation of Inconsistent Molecular Structures from Biochemical Databases. J Comput Biol 2024; 31:498-512. [PMID: 38758924 DOI: 10.1089/cmb.2024.0520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024] Open
Abstract
Information on the structure of molecules, retrieved via biochemical databases, plays a pivotal role in various disciplines, including metabolomics, systems biology, and drug discovery. No such database can be complete and it is often necessary to incorporate data from several sources. However, the molecular structure for a given compound is not necessarily consistent between databases. This article presents StructRecon, a novel tool for resolving unique molecular structures from database identifiers. Currently, identifiers from BiGG, ChEBI, Escherichia coli Metabolome Database (ECMDB), MetaNetX, and PubChem are supported. StructRecon traverses the cross-links between entries in different databases to construct what we call identifier graphs. The goal of these graphs is to offer a more complete view of the total information available on a given compound across all the supported databases. To reconcile discrepancies met during the traversal of the databases, we develop an extensible model for molecular structure supporting multiple independent levels of detail, which allows standardization of the structure to be applied iteratively. In some cases, our standardization approach results in multiple candidate structures for a given compound, in which case a random walk-based algorithm is used to select the most likely structure among incompatible alternatives. As a case study, we applied StructRecon to the EColiCore2 model. We found at least one structure for 98.66% of its compounds, which is more than twice as many as possible when using the databases in more standard ways not considering the complex network of cross-database references captured by our identifier graphs. StructRecon is open-source and modular, which enables support for more databases in the future.
Collapse
Affiliation(s)
- Casper Asbjørn Eriksen
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Jakob Lykke Andersen
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Rolf Fagerberg
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Daniel Merkle
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
4
|
Mansouri K, Moreira-Filho JT, Lowe CN, Charest N, Martin T, Tkachenko V, Judson R, Conway M, Kleinstreuer NC, Williams AJ. Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling. J Cheminform 2024; 16:19. [PMID: 38378618 PMCID: PMC10880251 DOI: 10.1186/s13321-024-00814-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/10/2024] [Indexed: 02/22/2024] Open
Abstract
The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.
Collapse
Affiliation(s)
- Kamel Mansouri
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
| | - José T Moreira-Filho
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Charles N Lowe
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Nathaniel Charest
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Todd Martin
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | | | - Richard Judson
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Mike Conway
- National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Nicole C Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Antony J Williams
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
5
|
On the ability of machine learning methods to discover novel scaffolds. J Mol Model 2022; 29:22. [PMID: 36574054 DOI: 10.1007/s00894-022-05359-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Accepted: 10/21/2022] [Indexed: 12/28/2022]
Abstract
The recent advances in the application of machine learning to drug discovery have made it a 'hot topic' for research, with hundreds of academic groups and companies integrating machine learning into their drug discovery projects. Nevertheless, there remains great uncertainty regarding the most appropriate ways to evaluate the relative performance of these powerful methods against more traditional cheminformatics approaches, and many pitfalls remain for the unwary. In 2020, researchers at MIT (Stokes et al., Cell 180(4), 688-702, 2020) reported the discovery of a new compound with antibacterial activity, halicin, through the use of a neural network machine learning method. A robust ability to identify new active chemotypes through computational methods would be very useful. In this study, we have used the Stokes et al. dataset to compare the performance of this method to two other approaches, Mapping of Activity Through Dichotomic Scores (MADS) by Todeschini et al. (J Chemom 32(4):e2994, 2018) and Random Matrix Theory (RMT) by Lee et al. (Proc Natl Acad Sci 116(9):3373-3378, 2019). Our results demonstrate that all three methods are capable of predicting halicin as an active antibacterial compound, but that this result is dependent on the dataset composition, pre-processing and the molecular fingerprint used. We have further assessed overall performance as determined by several performance metrics. We also investigated the scaffold hopping potential of the methods by modifying the dataset by removal of the β-lactam and fluoroquinolone chemotypes. MADS and RMT are able to identify actives in the test set that contained these substructures. This ability arises because of high scoring fragments of the withheld chemotypes that are in common with other active antibiotic classes. Interestingly, MADS is relatively better compared to the other two methods based on general predictive performance.
Collapse
|
6
|
Li L, Zhang Z, Men Y, Baskaran S, Sangion A, Wang S, Arnot JA, Wania F. Retrieval, Selection, and Evaluation of Chemical Property Data for Assessments of Chemical Emissions, Fate, Hazard, Exposure, and Risks. ACS ENVIRONMENTAL AU 2022; 2:376-395. [PMID: 37101455 PMCID: PMC10125307 DOI: 10.1021/acsenvironau.2c00010] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 07/01/2022] [Accepted: 07/05/2022] [Indexed: 04/28/2023]
Abstract
Reliable chemical property data are the key to defensible and unbiased assessments of chemical emissions, fate, hazard, exposure, and risks. However, the retrieval, evaluation, and use of reliable chemical property data can often be a formidable challenge for chemical assessors and model users. This comprehensive review provides practical guidance for use of chemical property data in chemical assessments. We assemble available sources for obtaining experimentally derived and in silico predicted property data; we also elaborate strategies for evaluating and curating the obtained property data. We demonstrate that both experimentally derived and in silico predicted property data can be subject to considerable uncertainty and variability. Chemical assessors are encouraged to use property data derived through the harmonization of multiple carefully selected experimental data if a sufficient number of reliable laboratory measurements is available or through the consensus consolidation of predictions from multiple in silico tools if the data pool from laboratory measurements is not adequate.
Collapse
Affiliation(s)
- Li Li
- School
of Public Health, University of Nevada Reno, Reno, Nevada 89557, United States
- . Phone: +1 (775) 682 7077
| | - Zhizhen Zhang
- School
of Public Health, University of Nevada Reno, Reno, Nevada 89557, United States
| | - Yujie Men
- Department
of Chemical & Environmental Engineering, University of California Riverside, Riverside, California 92521, United States
| | - Sivani Baskaran
- Department
of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario M1C 1A4, Canada
| | - Alessandro Sangion
- Department
of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario M1C 1A4, Canada
- ARC
Arnot Research & Consulting, Toronto, Ontario M4M 1W4, Canada
| | - Shenghong Wang
- School
of Public Health, University of Nevada Reno, Reno, Nevada 89557, United States
| | - Jon A. Arnot
- Department
of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario M1C 1A4, Canada
- ARC
Arnot Research & Consulting, Toronto, Ontario M4M 1W4, Canada
- Department
of Pharmacology and Toxicology, University
of Toronto, Toronto, Ontario M5S 1A8, Canada
| | - Frank Wania
- Department
of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario M1C 1A4, Canada
| |
Collapse
|
7
|
Baig MH, Ahmad K, Moon JS, Park SY, Ho Lim J, Chun HJ, Qadri AF, Hwang YC, Jan AT, Ahmad SS, Ali S, Shaikh S, Lee EJ, Choi I. Myostatin and its Regulation: A Comprehensive Review of Myostatin Inhibiting Strategies. Front Physiol 2022; 13:876078. [PMID: 35812316 PMCID: PMC9259834 DOI: 10.3389/fphys.2022.876078] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 06/06/2022] [Indexed: 12/12/2022] Open
Abstract
Myostatin (MSTN) is a well-reported negative regulator of muscle growth and a member of the transforming growth factor (TGF) family. MSTN has important functions in skeletal muscle (SM), and its crucial involvement in several disorders has made it an important therapeutic target. Several strategies based on the use of natural compounds to inhibitory peptides are being used to inhibit the activity of MSTN. This review delivers an overview of the current state of knowledge about SM and myogenesis with particular emphasis on the structural characteristics and regulatory functions of MSTN during myogenesis and its involvements in various muscle related disorders. In addition, we review the diverse approaches used to inhibit the activity of MSTN, especially in silico approaches to the screening of natural compounds and the design of novel short peptides derived from proteins that typically interact with MSTN.
Collapse
Affiliation(s)
- Mohammad Hassan Baig
- Department of Family Medicine, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, South Korea
| | - Khurshid Ahmad
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
- Research Institute of Cell Culture, Yeungnam University, Gyeongsan, South Korea
| | - Jun Sung Moon
- Department of Internal Medicine, College of Medicine, Yeungnam University, Daegu, South Korea
| | - So-Young Park
- Department of Physiology, College of Medicine, Yeungnam University, Daegu, South Korea
| | - Jeong Ho Lim
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
- Research Institute of Cell Culture, Yeungnam University, Gyeongsan, South Korea
| | - Hee Jin Chun
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
- Research Institute of Cell Culture, Yeungnam University, Gyeongsan, South Korea
| | - Afsha Fatima Qadri
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Ye Chan Hwang
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Arif Tasleem Jan
- School of Biosciences and Biotechnology, Baba Ghulam Shah Badshah University, Rajouri, India
| | - Syed Sayeed Ahmad
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Shahid Ali
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Sibhghatulla Shaikh
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Eun Ju Lee
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
- Research Institute of Cell Culture, Yeungnam University, Gyeongsan, South Korea
- *Correspondence: Eun Ju Lee, ; Inho Choi,
| | - Inho Choi
- Department of Medical Biotechnology, Yeungnam University, Gyeongsan, South Korea
- Research Institute of Cell Culture, Yeungnam University, Gyeongsan, South Korea
- *Correspondence: Eun Ju Lee, ; Inho Choi,
| |
Collapse
|
8
|
He K. Pharmacological affinity fingerprints derived from bioactivity data for the identification of designer drugs. J Cheminform 2022; 14:35. [PMID: 35672835 PMCID: PMC9171973 DOI: 10.1186/s13321-022-00607-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 05/05/2022] [Indexed: 12/15/2022] Open
Abstract
Facing the continuous emergence of new psychoactive substances (NPS) and their threat to public health, more effective methods for NPS prediction and identification are critical. In this study, the pharmacological affinity fingerprints (Ph-fp) of NPS compounds were predicted by Random Forest classification models using bioactivity data from the ChEMBL database. The binary Ph-fp is the vector consisting of a compound's activity against a list of molecular targets reported to be responsible for the pharmacological effects of NPS. Their performance in similarity searching and unsupervised clustering was assessed and compared to 2D structure fingerprints Morgan and MACCS (1024-bits ECFP4 and 166-bits SMARTS-based MACCS implementation of RDKit). The performance in retrieving compounds according to their pharmacological categorizations is influenced by the predicted active assay counts in Ph-fp and the choice of similarity metric. Overall, the comparative unsupervised clustering analysis suggests the use of a classification model with Morgan fingerprints as input for the construction of Ph-fp. This combination gives satisfactory clustering performance based on external and internal clustering validation indices.
Collapse
Affiliation(s)
- Kedan He
- Physical Sciences, Eastern Connecticut State University, 83 Windham St, Willimantic, CT, 06226, USA.
| |
Collapse
|
9
|
Dolciami D, Villasclaras-Fernandez E, Kannas C, Meniconi M, Al-Lazikani B, Antolin AA. canSAR chemistry registration and standardization pipeline. J Cheminform 2022; 14:28. [PMID: 35643512 PMCID: PMC9148294 DOI: 10.1186/s13321-022-00606-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 04/04/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.
Results
We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.
Conclusions
We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.
Collapse
|
10
|
Jacobs A, Williams D, Hickey K, Patrick N, Williams AJ, Chalk S, McEwen L, Willighagen E, Walker M, Bolton E, Sinclair G, Sanford A. CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community. J Chem Inf Model 2022; 62:2737-2743. [PMID: 35559614 PMCID: PMC9199008 DOI: 10.1021/acs.jcim.2c00268] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
CAS Common Chemistry (https://commonchemistry.cas.org/) is an open web resource that provides access to reliable chemical substance information for the scientific community. Having served millions of visitors since its creation in 2009, the resource was extensively updated in 2021 with significant enhancements. The underlying dataset was expanded from 8000 to 500,000 chemical substances and includes additional associated information, such as basic properties and computer-readable chemical structure information. New use cases are supported with enhanced search capabilities and an integrated application programming interface. Reusable licensing of the content is provided through a Creative Commons Attribution-Non-Commercial (CC-BY-NC 4.0) license allowing other public resources to integrate the data into their systems. This paper provides an overview of the enhancements to data and functionality, discusses the benefits of the contribution to the chemistry community, and summarizes recent progress in leveraging this resource to strengthen other information sources.
Collapse
Affiliation(s)
- Andrea Jacobs
- CAS, 2540 Olentangy River Rd, Columbus, Ohio 43202, United States
| | - Dustin Williams
- CAS, 2540 Olentangy River Rd, Columbus, Ohio 43202, United States
| | - Katherine Hickey
- CAS, 2540 Olentangy River Rd, Columbus, Ohio 43202, United States
| | - Nathan Patrick
- CAS, 2540 Olentangy River Rd, Columbus, Ohio 43202, United States
| | - Antony J Williams
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency (U.S. EPA), Research Triangle Park, North Carolina 27711, United States
| | - Stuart Chalk
- Department of Chemistry, University of North Florida, Jacksonville, Florida 32224, United States
| | - Leah McEwen
- Physical Sciences Library, Cornell University, Ithaca, New York 14853, United States
| | - Egon Willighagen
- Department of Bioinformatics - BiGCaT, Maastricht University, 6229 ER Maastricht, The Netherlands
| | - Martin Walker
- Department of Chemistry, SUNY Potsdam, 44 Pierrepont Ave., Potsdam, New York 13676, United States
| | - Evan Bolton
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, Maryland 20894, United States
| | - Gabriel Sinclair
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency (U.S. EPA), Research Triangle Park, North Carolina 27711, United States
| | - Adam Sanford
- CAS, 2540 Olentangy River Rd, Columbus, Ohio 43202, United States
| |
Collapse
|
11
|
Talley KR, White R, Wunder N, Eash M, Schwarting M, Evenson D, Perkins JD, Tumas W, Munch K, Phillips C, Zakutayev A. Research data infrastructure for high-throughput experimental materials science. PATTERNS (NEW YORK, N.Y.) 2021; 2:100373. [PMID: 34950901 PMCID: PMC8672147 DOI: 10.1016/j.patter.2021.100373] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 08/13/2021] [Accepted: 09/30/2021] [Indexed: 11/26/2022]
Abstract
The High-Throughput Experimental Materials Database (HTEM-DB, htem.nrel.gov) is a repository of inorganic thin-film materials data collected during combinatorial experiments at the National Renewable Energy Laboratory (NREL). This data asset is enabled by NREL's Research Data Infrastructure (RDI), a set of custom data tools that collect, process, and store experimental data and metadata. Here, we describe the experimental data flow from the RDI to the HTEM-DB to illustrate the strategies and best practices currently used for materials data at NREL. Integration of the data tools with experimental instruments establishes a data communication pipeline between experimental researchers and data scientists. This work motivates the creation of similar workflows at other institutions to aggregate valuable data and increase their usefulness for future machine learning studies. In turn, such data-driven studies can greatly accelerate the pace of discovery and design in the materials science domain. Automated curation of experimental materials data Integration of data tools into the experimental laboratory Simple, effective, and flexible data archival system Collection of metadata for enhanced total data value
For machine learning to make significant contributions to a scientific domain, algorithms must ingest and learn from high-quality, large-volume datasets. The Research Data Infrastructure (RDI) that feeds the High-Throughput Experimental Materials Database (HTEM-DB, htem.nrel.gov) provides such a dataset from existing experimental data streams at the National Renewable Energy Laboratory (NREL). The described methods for curating experimental data can be applied to other materials research laboratory settings, paving the way for increased application of machine learning to materials science. In turn, the resulting new materials and new knowledge will benefit the society by advancing new technologies in energy, fuels, computing, security, and other important areas.
Collapse
Affiliation(s)
- Kevin R Talley
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Robert White
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Nick Wunder
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Matthew Eash
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Marcus Schwarting
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Dave Evenson
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - John D Perkins
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - William Tumas
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Kristin Munch
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Caleb Phillips
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| | - Andriy Zakutayev
- Materials, Chemical and Computational Science Directorate, National Renewable Energy Laboratory, Golden, CO 80401, USA
| |
Collapse
|
12
|
Santana K, do Nascimento LD, Lima e Lima A, Damasceno V, Nahum C, Braga RC, Lameira J. Applications of Virtual Screening in Bioprospecting: Facts, Shifts, and Perspectives to Explore the Chemo-Structural Diversity of Natural Products. Front Chem 2021; 9:662688. [PMID: 33996755 PMCID: PMC8117418 DOI: 10.3389/fchem.2021.662688] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022] Open
Abstract
Natural products are continually explored in the development of new bioactive compounds with industrial applications, attracting the attention of scientific research efforts due to their pharmacophore-like structures, pharmacokinetic properties, and unique chemical space. The systematic search for natural sources to obtain valuable molecules to develop products with commercial value and industrial purposes remains the most challenging task in bioprospecting. Virtual screening strategies have innovated the discovery of novel bioactive molecules assessing in silico large compound libraries, favoring the analysis of their chemical space, pharmacodynamics, and their pharmacokinetic properties, thus leading to the reduction of financial efforts, infrastructure, and time involved in the process of discovering new chemical entities. Herein, we discuss the computational approaches and methods developed to explore the chemo-structural diversity of natural products, focusing on the main paradigms involved in the discovery and screening of bioactive compounds from natural sources, placing particular emphasis on artificial intelligence, cheminformatics methods, and big data analyses.
Collapse
Affiliation(s)
- Kauê Santana
- Instituto de Biodiversidade, Universidade Federal do Oeste do Pará, Santarém, Brazil
| | | | - Anderson Lima e Lima
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | - Vinícius Damasceno
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | - Claudio Nahum
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | | | - Jerônimo Lameira
- Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Brazil
| |
Collapse
|
13
|
Bugeac CA, Ancuceanu R, Dinu M. QSAR Models for Active Substances against Pseudomonas aeruginosa Using Disk-Diffusion Test Data. Molecules 2021; 26:molecules26061734. [PMID: 33808845 PMCID: PMC8003670 DOI: 10.3390/molecules26061734] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Revised: 03/14/2021] [Accepted: 03/15/2021] [Indexed: 12/02/2022] Open
Abstract
Pseudomonas aeruginosa is a Gram-negative bacillus included among the six “ESKAPE” microbial species with an outstanding ability to “escape” currently used antibiotics and developing new antibiotics against it is of the highest priority. Whereas minimum inhibitory concentration (MIC) values against Pseudomonas aeruginosa have been used previously for QSAR model development, disk diffusion results (inhibition zones) have not been apparently used for this purpose in the literature and we decided to explore their use in this sense. We developed multiple QSAR methods using several machine learning algorithms (support vector classifier, K nearest neighbors, random forest classifier, decision tree classifier, AdaBoost classifier, logistic regression and naïve Bayes classifier). We used four sets of molecular descriptors and fingerprints and three different methods of data balancing, together with the “native” data set. In total, 32 models were built for each set of descriptors or fingerprint and balancing method, of which 28 were selected and stacked to create meta-models. In terms of balanced accuracy, the best performance was provided by KNN, logistic regression and decision tree classifier, but the ensemble method had slightly superior results in nested cross-validation.
Collapse
Affiliation(s)
- Cosmin Alexandru Bugeac
- Faculty of Pharmacy, Carol Davila University of Medicine and Pharmacy, 6 Traian Vuia Street, Sector 2, 020956 Bucharest, Romania;
| | - Robert Ancuceanu
- Department of Pharmaceutical Botany and Cell Biology, Faculty of Pharmacy, Carol Davila University of Medicine and Pharmacy, 6 Traian Vuia Street, Sector 2, 020956 Bucharest, Romania;
- Correspondence:
| | - Mihaela Dinu
- Department of Pharmaceutical Botany and Cell Biology, Faculty of Pharmacy, Carol Davila University of Medicine and Pharmacy, 6 Traian Vuia Street, Sector 2, 020956 Bucharest, Romania;
| |
Collapse
|
14
|
Hu B, Lin A, Brinson LC. ChemProps: A RESTful API enabled database for composite polymer name standardization. J Cheminform 2021; 13:22. [PMID: 33712066 PMCID: PMC7955638 DOI: 10.1186/s13321-021-00502-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 03/01/2021] [Indexed: 11/24/2022] Open
Abstract
The inconsistency of polymer indexing caused by the lack of uniformity in expression of polymer names is a major challenge for widespread use of polymer related data resources and limits broad application of materials informatics for innovation in broad classes of polymer science and polymeric based materials. The current solution of using a variety of different chemical identifiers has proven insufficient to address the challenge and is not intuitive for researchers. This work proposes a multi-algorithm-based mapping methodology entitled ChemProps that is optimized to solve the polymer indexing issue with easy-to-update design both in depth and in width. RESTful API is enabled for lightweight data exchange and easy integration across data systems. A weight factor is assigned to each algorithm to generate scores for candidate chemical names and optimized to maximize the minimum value of the score difference between the ground truth chemical name and the other candidate chemical names. Ten-fold validation is utilized on the 160 training data points to prevent overfitting issues. The obtained set of weight factors achieves a 100% test accuracy on the 54 test data points. The weight factors will evolve as ChemProps grows. With ChemProps, other polymer databases can remove duplicate entries and enable a more accurate “search by SMILES” function by using ChemProps as a common name-to-SMILES translator through API calls. ChemProps is also an excellent tool for auto-populating polymer properties thanks to its easy-to-update design.
Collapse
Affiliation(s)
- Bingyin Hu
- Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, 27708, USA
| | - Anqi Lin
- Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, 27708, USA
| | - L Catherine Brinson
- Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC, 27708, USA.
| |
Collapse
|
15
|
Costanzi S, Slavick CK, Hutcheson BO, Koblentz GD, Cupitt RT. Lists of Chemical Warfare Agents and Precursors from International Nonproliferation Frameworks: Structural Annotation and Chemical Fingerprint Analysis. J Chem Inf Model 2020; 60:4804-4816. [DOI: 10.1021/acs.jcim.0c00896] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Stefano Costanzi
- Department of Chemistry, American University, 4400 Massachusetts Avenue, NW, Washington, District of Columbia 20016, United States
| | - Charlotte K. Slavick
- Department of Chemistry, American University, 4400 Massachusetts Avenue, NW, Washington, District of Columbia 20016, United States
| | - Brent O. Hutcheson
- Department of Chemistry, American University, 4400 Massachusetts Avenue, NW, Washington, District of Columbia 20016, United States
| | - Gregory D. Koblentz
- Schar School of Policy and Government, George Mason University, 3351 Fairfax Drive, Arlington, Virginia 22201, United States
| | - Richard T. Cupitt
- Stimson Center, 1211 Connecticut Avenue, NW, Washington, District of Columbia 20036, United States
| |
Collapse
|
16
|
Baker CM, Kidley NJ, Papachristos K, Hotson M, Carson R, Gravestock D, Pouliot M, Harrison J, Dowling A. Tautomer Standardization in Chemical Databases: Deriving Business Rules from Quantum Chemistry. J Chem Inf Model 2020; 60:3781-3791. [PMID: 32644790 DOI: 10.1021/acs.jcim.0c00232] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Databases of small, potentially bioactive molecules are ubiquitous across the industry and academia. Designed such that each unique compound should appear only once, the multiplicity of ways in which many compounds can be represented means that these databases require methods for standardizing the representation of chemistry. This is commonly achieved through the use of "Chemistry Business Rules", sets of predefined rules that describe the "house style" of the database in question. At Syngenta, the historical approach to the design of chemistry business rules has been to focus on consistency of representation, with chemical relevance given secondary consideration. In this work, we overturn that convention. Through the use of quantum chemistry calculations, we define a set of chemistry business rules for tautomer standardization that reproduces gas-phase energetic preferences. We go on to show that, compared to our historic approach, this method yields tautomers that are in better agreement with those observed experimentally in condensed phases and that are better suited for use in predictive models.
Collapse
Affiliation(s)
- Christopher M Baker
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Nathan J Kidley
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | | | - Matthew Hotson
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Rob Carson
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - David Gravestock
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Martin Pouliot
- Syngenta Crop Protection, Schaffhauserstrasse, Stein CH-4332, Switzerland
| | - Jim Harrison
- Datacraft Technologies, 110 Parkwood Place, Anstead, QLD 4070, Australia
| | - Alan Dowling
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| |
Collapse
|
17
|
Ambure P, Cordeiro MNDS. Importance of Data Curation in QSAR Studies Especially While Modeling Large-Size Datasets. METHODS IN PHARMACOLOGY AND TOXICOLOGY 2020. [DOI: 10.1007/978-1-0716-0150-1_5] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
|
18
|
Toukach PV, Egorova KS. New Features of Carbohydrate Structure Database Notation (CSDB Linear), As Compared to Other Carbohydrate Notations. J Chem Inf Model 2019; 60:1276-1289. [DOI: 10.1021/acs.jcim.9b00744] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Philip V. Toukach
- N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prosect 47, Moscow, Russia 119991
- National Research University Higher School of Economics, Myasnitskaya 20, Moscow, Russia 101000
| | - Ksenia S. Egorova
- N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prosect 47, Moscow, Russia 119991
| |
Collapse
|
19
|
Pham N, van Heck RGA, van Dam JCJ, Schaap PJ, Saccenti E, Suarez-Diez M. Consistency, Inconsistency, and Ambiguity of Metabolite Names in Biochemical Databases Used for Genome-Scale Metabolic Modelling. Metabolites 2019; 9:E28. [PMID: 30736318 PMCID: PMC6409771 DOI: 10.3390/metabo9020028] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 01/24/2019] [Accepted: 01/31/2019] [Indexed: 12/22/2022] Open
Abstract
Genome-scale metabolic models (GEMs) are manually curated repositories describing the metabolic capabilities of an organism. GEMs have been successfully used in different research areas, ranging from systems medicine to biotechnology. However, the different naming conventions (namespaces) of databases used to build GEMs limit model reusability and prevent the integration of existing models. This problem is known in the GEM community, but its extent has not been analyzed in depth. In this study, we investigate the name ambiguity and the multiplicity of non-systematic identifiers and we highlight the (in)consistency in their use in 11 biochemical databases of biochemical reactions and the problems that arise when mapping between different namespaces and databases. We found that such inconsistencies can be as high as 83.1%, thus emphasizing the need for strategies to deal with these issues. Currently, manual verification of the mappings appears to be the only solution to remove inconsistencies when combining models. Finally, we discuss several possible approaches to facilitate (future) unambiguous mapping.
Collapse
Affiliation(s)
- Nhung Pham
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| | - Ruben G A van Heck
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| | - Jesse C J van Dam
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| | - Peter J Schaap
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| | - Edoardo Saccenti
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| | - Maria Suarez-Diez
- Laboratory of Systems and Synthetic Biology; Wageningen University & Research, 6708 WE, Wageningen, The Netherlands.
| |
Collapse
|
20
|
Sobus JR, Wambaugh JF, Isaacs KK, Williams AJ, McEachran AD, Richard AM, Grulke CM, Ulrich EM, Rager JE, Strynar MJ, Newton SR. Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA. JOURNAL OF EXPOSURE SCIENCE & ENVIRONMENTAL EPIDEMIOLOGY 2018; 28:411-426. [PMID: 29288256 PMCID: PMC6661898 DOI: 10.1038/s41370-017-0012-y] [Citation(s) in RCA: 136] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2017] [Revised: 08/04/2017] [Accepted: 08/25/2017] [Indexed: 05/18/2023]
Abstract
Tens-of-thousands of chemicals are registered in the U.S. for use in countless processes and products. Recent evidence suggests that many of these chemicals are measureable in environmental and/or biological systems, indicating the potential for widespread exposures. Traditional public health research tools, including in vivo studies and targeted analytical chemistry methods, have been unable to meet the needs of screening programs designed to evaluate chemical safety. As such, new tools have been developed to enable rapid assessment of potentially harmful chemical exposures and their attendant biological responses. One group of tools, known as "non-targeted analysis" (NTA) methods, allows the rapid characterization of thousands of never-before-studied compounds in a wide variety of environmental, residential, and biological media. This article discusses current applications of NTA methods, challenges to their effective use in chemical screening studies, and ways in which shared resources (e.g., chemical standards, databases, model predictions, and media measurements) can advance their use in risk-based chemical prioritization. A brief review is provided of resources and projects within EPA's Office of Research and Development (ORD) that provide benefit to, and receive benefits from, NTA research endeavors. A summary of EPA's Non-Targeted Analysis Collaborative Trial (ENTACT) is also given, which makes direct use of ORD resources to benefit the global NTA research community. Finally, a research framework is described that shows how NTA methods will bridge chemical prioritization efforts within ORD. This framework exists as a guide for institutions seeking to understand the complexity of chemical exposures, and the impact of these exposures on living systems.
Collapse
Affiliation(s)
- Jon R Sobus
- U.S. Environmental Protection Agency, Office of Research and Development, National Exposure Research Laboratory, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA.
| | - John F Wambaugh
- U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Kristin K Isaacs
- U.S. Environmental Protection Agency, Office of Research and Development, National Exposure Research Laboratory, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Antony J Williams
- U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Andrew D McEachran
- Oak Ridge Institute for Science and Education (ORISE) Participant, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Ann M Richard
- U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Christopher M Grulke
- U.S. Environmental Protection Agency, Office of Research and Development, National Center for Computational Toxicology, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Elin M Ulrich
- U.S. Environmental Protection Agency, Office of Research and Development, National Exposure Research Laboratory, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Julia E Rager
- Oak Ridge Institute for Science and Education (ORISE) Participant, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
- ToxStrategies, Inc., 9390 Research Blvd., Suite 100, Austin, TX, 78759, USA
| | - Mark J Strynar
- U.S. Environmental Protection Agency, Office of Research and Development, National Exposure Research Laboratory, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| | - Seth R Newton
- U.S. Environmental Protection Agency, Office of Research and Development, National Exposure Research Laboratory, 109 T.W. Alexander Drive, Research Triangle Park, NC, 27709, USA
| |
Collapse
|
21
|
McEachran AD, Mansouri K, Grulke C, Schymanski EL, Ruttkies C, Williams AJ. "MS-Ready" structures for non-targeted high-resolution mass spectrometry screening studies. J Cheminform 2018; 10:45. [PMID: 30167882 PMCID: PMC6117229 DOI: 10.1186/s13321-018-0299-2] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 08/21/2018] [Indexed: 02/05/2023] Open
Abstract
Chemical database searching has become a fixture in many non-targeted identification workflows based on high-resolution mass spectrometry (HRMS). However, the form of a chemical structure observed in HRMS does not always match the form stored in a database (e.g., the neutral form versus a salt; one component of a mixture rather than the mixture form used in a consumer product). Linking the form of a structure observed via HRMS to its related form(s) within a database will enable the return of all relevant variants of a structure, as well as the related metadata, in a single query. A Konstanz Information Miner (KNIME) workflow has been developed to produce structural representations observed using HRMS ("MS-Ready structures") and links them to those stored in a database. These MS-Ready structures, and associated mappings to the full chemical representations, are surfaced via the US EPA's Chemistry Dashboard ( https://comptox.epa.gov/dashboard/ ). This article describes the workflow for the generation and linking of ~ 700,000 MS-Ready structures (derived from ~ 760,000 original structures) as well as download, search and export capabilities to serve structure identification using HRMS. The importance of this form of structural representation for HRMS is demonstrated with several examples, including integration with the in silico fragmentation software application MetFrag. The structures, search, download and export functionality are all available through the CompTox Chemistry Dashboard, while the MetFrag implementation can be viewed at https://msbi.ipb-halle.de/MetFragBeta/ .
Collapse
Affiliation(s)
- Andrew D. McEachran
- Oak Ridge Institute for Science and Education (ORISE) Research Participation Program, U.S. Environmental Protection Agency, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| | - Kamel Mansouri
- Oak Ridge Institute for Science and Education (ORISE) Research Participation Program, U.S. Environmental Protection Agency, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- Present Address: Integrated Laboratory Systems, Inc., 601 Keystone Dr., Morrisville, NC 27650 USA
| | - Chris Grulke
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| | - Emma L. Schymanski
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6, avenue du Swing, 4367 Belvaux, Luxembourg
| | - Christoph Ruttkies
- Department of Stress and Development Biology, Leibniz Institute of Plant Biochemistry (IPB), Weinberg 3, 06120 Halle (Saale), Germany
| | - Antony J. Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| |
Collapse
|
22
|
Basith S, Cui M, Macalino SJY, Park J, Clavio NAB, Kang S, Choi S. Exploring G Protein-Coupled Receptors (GPCRs) Ligand Space via Cheminformatics Approaches: Impact on Rational Drug Design. Front Pharmacol 2018; 9:128. [PMID: 29593527 PMCID: PMC5854945 DOI: 10.3389/fphar.2018.00128] [Citation(s) in RCA: 79] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Accepted: 02/06/2018] [Indexed: 01/14/2023] Open
Abstract
The primary goal of rational drug discovery is the identification of selective ligands which act on single or multiple drug targets to achieve the desired clinical outcome through the exploration of total chemical space. To identify such desired compounds, computational approaches are necessary in predicting their drug-like properties. G Protein-Coupled Receptors (GPCRs) represent one of the largest and most important integral membrane protein families. These receptors serve as increasingly attractive drug targets due to their relevance in the treatment of various diseases, such as inflammatory disorders, metabolic imbalances, cardiac disorders, cancer, monogenic disorders, etc. In the last decade, multitudes of three-dimensional (3D) structures were solved for diverse GPCRs, thus referring to this period as the "golden age for GPCR structural biology." Moreover, accumulation of data about the chemical properties of GPCR ligands has garnered much interest toward the exploration of GPCR chemical space. Due to the steady increase in the structural, ligand, and functional data of GPCRs, several cheminformatics approaches have been implemented in its drug discovery pipeline. In this review, we mainly focus on the cheminformatics-based paradigms in GPCR drug discovery. We provide a comprehensive view on the ligand- and structure-based cheminformatics approaches which are best illustrated via GPCR case studies. Furthermore, an appropriate combination of ligand-based knowledge with structure-based ones, i.e., integrated approach, which is emerging as a promising strategy for cheminformatics-based GPCR drug design is also discussed.
Collapse
Affiliation(s)
| | | | | | | | | | - Soosung Kang
- College of Pharmacy and Graduate School of Pharmaceutical Sciences, Ewha Womans University, Seoul, South Korea
| | - Sun Choi
- College of Pharmacy and Graduate School of Pharmaceutical Sciences, Ewha Womans University, Seoul, South Korea
| |
Collapse
|
23
|
Mansouri K, Grulke CM, Judson RS, Williams AJ. OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 2018. [PMID: 29520515 PMCID: PMC5843579 DOI: 10.1186/s13321-018-0263-1] [Citation(s) in RCA: 271] [Impact Index Per Article: 45.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The collection of chemical structure information and associated experimental data for quantitative structure–activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2–15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission’s Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure–activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency’s CompTox Chemistry Dashboard.![]()
Collapse
Affiliation(s)
- Kamel Mansouri
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA. .,Oak Ridge Institute for Science and Education, 1299 Bethel Valley Road, Oak Ridge, TN, 37830, USA. .,ScitoVation LLC, 6 Davis Drive, Research Triangle Park, NC, 27709, USA.
| | - Chris M Grulke
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Richard S Judson
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Antony J Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
24
|
Patel M, Chilton ML, Sartini A, Gibson L, Barber C, Covey-Crump L, Przybylak KR, Cronin MTD, Madden JC. Assessment and Reproducibility of Quantitative Structure–Activity Relationship Models by the Nonexpert. J Chem Inf Model 2018; 58:673-682. [DOI: 10.1021/acs.jcim.7b00523] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Mukesh Patel
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Martyn L. Chilton
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Andrea Sartini
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Laura Gibson
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Chris Barber
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Liz Covey-Crump
- Lhasa Limited, Granary Wharf House, 2 Canal Wharf, Leeds LS11 5PS, England
| | - Katarzyna R. Przybylak
- School of Pharmacy and Chemistry, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, England
| | - Mark T. D. Cronin
- School of Pharmacy and Chemistry, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, England
| | - Judith C. Madden
- School of Pharmacy and Chemistry, Liverpool John Moores University, Byrom Street, Liverpool L3 3AF, England
| |
Collapse
|
25
|
Filimonov D, Druzhilovskiy D, Lagunin A, Gloriozova T, Rudik A, Dmitriev A, Pogodin P, Poroikov V. Computer-aided prediction of biological activity spectra for chemical compounds: opportunities and limitation. ACTA ACUST UNITED AC 2018. [DOI: 10.18097/bmcrm00004] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
An essential characteristic of chemical compounds is their biological activity since its presence can become the basis for the use of the substance for therapeutic purposes, or, on the contrary, limit the possibilities of its practical application due to the manifestation of side action and toxic effects. Computer assessment of the biological activity spectra makes it possible to determine the most promising directions for the study of the pharmacological action of particular substances, and to filter out potentially dangerous molecules at the early stages of research. For more than 25 years, we have been developing and improving the computer program PASS (Prediction of Activity Spectra for Substances), designed to predict the biological activity spectrum of substance based on the structural formula of its molecules. The prediction is carried out by the analysis of structure-activity relationships for the training set, which currently contains information on structures and known biological activities for more than one million molecules. The structure of the organic compound is represented in PASS using Multilevel Neighborhoods of Atoms descriptors; the activity prediction for new compounds is performed by the naive Bayes classifier and the structure-activity relationships determined by the analysis of the training set. We have created and improved both local versions of the PASS program and freely available web resources based on PASS (http://www.way2drug.com). They predict several thousand biological activities (pharmacological effects, molecular mechanisms of action, specific toxicity and adverse effects, interaction with the unwanted targets, metabolism and action on molecular transport), cytotoxicity for tumor and non-tumor cell lines, carcinogenicity, induced changes of gene expression profiles, metabolic sites of the major enzymes of the first and second phases of xenobiotics biotransformation, and belonging to substrates and/or metabolites of metabolic enzymes. The web resource Way2Drug is used by over 18,000 researchers from more than 90 countries around the world, which allowed them to obtain over 600,000 predictions and publish about 500 papers describing the obtained results. The analysis of the published works shows that in some cases the interpretation of the prediction results presented by the authors of these publications requires an adjustment. In this work, we provide the theoretical basis and consider, on particular examples, the opportunities and limitations of computer-aided prediction of biological activity spectra.
Collapse
Affiliation(s)
| | | | - A.A. Lagunin
- Institute of Biomedical Chemistry; Pirogov Russian National Research Medical University, Moscow, Russia
| | | | - A.V. Rudik
- Institute of Biomedical Chemistry, Moscow, Russia
| | | | - P.V. Pogodin
- Institute of Biomedical Chemistry, Moscow, Russia
| | | |
Collapse
|
26
|
Olier I, Sadawi N, Bickerton GR, Vanschoren J, Grosan C, Soldatova L, King RD. Meta-QSAR: a large-scale application of meta-learning to drug design and discovery. Mach Learn 2017; 107:285-311. [PMID: 31997851 PMCID: PMC6956898 DOI: 10.1007/s10994-017-5685-x] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Accepted: 10/04/2017] [Indexed: 11/03/2022]
Abstract
We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 3 molecular representations, applied to more than 2700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning method (random forests using a molecular fingerprint representation) by up to 13%, on average. We conclude that meta-learning outperforms base-learning methods for QSAR learning, and as this investigation is one of the most extensive ever comparisons of base and meta-learning methods ever made, it provides evidence for the general effectiveness of meta-learning over base-learning.
Collapse
Affiliation(s)
- Ivan Olier
- 1Manchester Metropolitan University, Manchester, UK.,2University of Manchester, Manchester, UK
| | - Noureddin Sadawi
- 3Imperial College London, London, UK.,4Brunel University London, London, UK
| | | | | | - Crina Grosan
- 4Brunel University London, London, UK.,8Babes-Bolyai University, Cluj-Napoca, Romania
| | - Larisa Soldatova
- 4Brunel University London, London, UK.,9Goldsmiths, University of London, London, UK
| | | |
Collapse
|
27
|
Korotcov A, Tkachenko V, Russo DP, Ekins S. Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets. Mol Pharm 2017; 14:4462-4475. [PMID: 29096442 PMCID: PMC5741413 DOI: 10.1021/acs.molpharmaceut.7b00578] [Citation(s) in RCA: 184] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.
Collapse
Affiliation(s)
- Alexandru Korotcov
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Valery Tkachenko
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Daniel P Russo
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ, 08102, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|
28
|
Low YS, Daugherty AC, Schroeder EA, Chen W, Seto T, Weber S, Lim M, Hastie T, Mathur M, Desai M, Farrington C, Radin AA, Sirota M, Kenkare P, Thompson CA, Yu PP, Gomez SL, Sledge GW, Kurian AW, Shah NH. Synergistic drug combinations from electronic health records and gene expression. J Am Med Inform Assoc 2017; 24:565-576. [PMID: 27940607 PMCID: PMC6080645 DOI: 10.1093/jamia/ocw161] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Objective Using electronic health records (EHRs) and biomolecular data, we sought to discover drug pairs with synergistic repurposing potential. EHRs provide real-world treatment and outcome patterns, while complementary biomolecular data, including disease-specific gene expression and drug-protein interactions, provide mechanistic understanding. Method We applied Group Lasso INTERaction NETwork (glinternet), an overlap group lasso penalty on a logistic regression model, with pairwise interactions to identify variables and interacting drug pairs associated with reduced 5-year mortality using EHRs of 9945 breast cancer patients. We identified differentially expressed genes from 14 case-control human breast cancer gene expression datasets and integrated them with drug-protein networks. Drugs in the network were scored according to their association with breast cancer individually or in pairs. Lastly, we determined whether synergistic drug pairs found in the EHRs were enriched among synergistic drug pairs from gene-expression data using a method similar to gene set enrichment analysis. Results From EHRs, we discovered 3 drug-class pairs associated with lower mortality: anti-inflammatories and hormone antagonists, anti-inflammatories and lipid modifiers, and lipid modifiers and obstructive airway drugs. The first 2 pairs were also enriched among pairs discovered using gene expression data and are supported by molecular interactions in drug-protein networks and preclinical and epidemiologic evidence. Conclusions This is a proof-of-concept study demonstrating that a combination of complementary data sources, such as EHRs and gene expression, can corroborate discoveries and provide mechanistic insight into drug synergism for repurposing.
Collapse
Affiliation(s)
- Yen S Low
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | | | | | - William Chen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Tina Seto
- Clinical Informatics, Stanford University
| | | | - Michael Lim
- Department of Statistics, Stanford University
| | - Trevor Hastie
- Department of Statistics, Stanford University.,Department of Health Research and Policy, Stanford University
| | - Maya Mathur
- Quantitative Sciences Unit, Stanford University
| | | | | | | | | | - Pragati Kenkare
- Palo Alto Medical Foundation Research Institute, Palo Alto, CA, USA
| | | | - Peter P Yu
- Palo Alto Medical Foundation Research Institute, Palo Alto, CA, USA
| | - Scarlett L Gomez
- Department of Health Research and Policy, Stanford University.,Cancer Prevention Institute of California, Fremont, CA, USA
| | - George W Sledge
- Division of Oncology, Department of Medicine, Stanford University
| | - Allison W Kurian
- Department of Health Research and Policy, Stanford University.,Division of Oncology, Department of Medicine, Stanford University
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
29
|
Williams AJ, Grulke CM, Edwards J, McEachran AD, Mansouri K, Baker NC, Patlewicz G, Shah I, Wambaugh JF, Judson RS, Richard AM. The CompTox Chemistry Dashboard: a community data resource for environmental chemistry. J Cheminform 2017; 9:61. [PMID: 29185060 PMCID: PMC5705535 DOI: 10.1186/s13321-017-0247-6] [Citation(s) in RCA: 568] [Impact Index Per Article: 81.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2017] [Accepted: 11/18/2017] [Indexed: 11/10/2022] Open
Abstract
Despite an abundance of online databases providing access to chemical data, there is increasing demand for high-quality, structure-curated, open data to meet the various needs of the environmental sciences and computational toxicology communities. The U.S. Environmental Protection Agency's (EPA) web-based CompTox Chemistry Dashboard is addressing these needs by integrating diverse types of relevant domain data through a cheminformatics layer, built upon a database of curated substances linked to chemical structures. These data include physicochemical, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay data, surfaced through an integration hub with link-outs to additional EPA data and public domain online resources. Batch searching allows for direct chemical identifier (ID) mapping and downloading of multiple data streams in several different formats. This facilitates fast access to available structure, property, toxicity, and bioassay data for collections of chemicals (hundreds to thousands at a time). Advanced search capabilities are available to support, for example, non-targeted analysis and identification of chemicals using mass spectrometry. The contents of the chemistry database, presently containing ~ 760,000 substances, are available as public domain data for download. The chemistry content underpinning the Dashboard has been aggregated over the past 15 years by both manual and auto-curation techniques within EPA's DSSTox project. DSSTox chemical content is subject to strict quality controls to enforce consistency among chemical substance-structure identifiers, as well as list curation review to ensure accurate linkages of DSSTox substances to chemical lists and associated data. The Dashboard, publicly launched in April 2016, has expanded considerably in content and user traffic over the past year. It is continuously evolving with the growth of DSSTox into high-interest or data-rich domains of interest to EPA, such as chemicals on the Toxic Substances Control Act listing, while providing the user community with a flexible and dynamic web-based platform for integration, processing, visualization and delivery of data and resources. The Dashboard provides support for a broad array of research and regulatory programs across the worldwide community of toxicologists and environmental scientists.
Collapse
Affiliation(s)
- Antony J. Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - Christopher M. Grulke
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - Jeff Edwards
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | | | - Kamel Mansouri
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
- Oak Ridge Institute for Science and Education, Oak Ridge, TN USA
- ScitoVation LLC, Research Triangle Park, NC USA
| | | | - Grace Patlewicz
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - Imran Shah
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - John F. Wambaugh
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - Richard S. Judson
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| | - Ann M. Richard
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC USA
| |
Collapse
|
30
|
Liu J, Patlewicz G, Williams AJ, Thomas RS, Shah I. Predicting Organ Toxicity Using in Vitro Bioactivity Data and Chemical Structure. Chem Res Toxicol 2017; 30:2046-2059. [PMID: 28768096 DOI: 10.1021/acs.chemrestox.7b00084] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Animal testing alone cannot practically evaluate the health hazard posed by tens of thousands of environmental chemicals. Computational approaches making use of high-throughput experimental data may provide more efficient means to predict chemical toxicity. Here, we use a supervised machine learning strategy to systematically investigate the relative importance of study type, machine learning algorithm, and type of descriptor on predicting in vivo repeat-dose toxicity at the organ-level. A total of 985 compounds were represented using chemical structural descriptors, ToxPrint chemotype descriptors, and bioactivity descriptors from ToxCast in vitro high-throughput screening assays. Using ToxRefDB, a total of 35 target organ outcomes were identified that contained at least 100 chemicals (50 positive and 50 negative). Supervised machine learning was performed using Naïve Bayes, k-nearest neighbor, random forest, classification and regression trees, and support vector classification approaches. Model performance was assessed based on F1 scores using 5-fold cross-validation with balanced bootstrap replicates. Fixed effects modeling showed the variance in F1 scores was explained mostly by target organ outcome, followed by descriptor type, machine learning algorithm, and interactions between these three factors. A combination of bioactivity and chemical structure or chemotype descriptors were the most predictive. Model performance improved with more chemicals (up to a maximum of 24%), and these gains were correlated (ρ = 0.92) with the number of chemicals. Overall, the results demonstrate that a combination of bioactivity and chemical descriptors can accurately predict a range of target organ toxicity outcomes in repeat-dose studies, but specific experimental and methodologic improvements may increase predictivity.
Collapse
Affiliation(s)
- Jie Liu
- Department of Information Science, University of Arkansas at Little Rock , Arkansas 72204, United States.,Oak Ridge Institute for Science Education, National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| | - Grace Patlewicz
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| | - Antony J Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| | - Russell S Thomas
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| | - Imran Shah
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| |
Collapse
|
31
|
Bolgár B, Antal P. VB-MK-LMF: fusion of drugs, targets and interactions using variational Bayesian multiple kernel logistic matrix factorization. BMC Bioinformatics 2017; 18:440. [PMID: 28978313 PMCID: PMC5628496 DOI: 10.1186/s12859-017-1845-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 09/21/2017] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Computational fusion approaches to drug-target interaction (DTI) prediction, capable of utilizing multiple sources of background knowledge, were reported to achieve superior predictive performance in multiple studies. Other studies showed that specificities of the DTI task, such as weighting the observations and focusing the side information are also vital for reaching top performance. METHOD We present Variational Bayesian Multiple Kernel Logistic Matrix Factorization (VB-MK-LMF), which unifies the advantages of (1) multiple kernel learning, (2) weighted observations, (3) graph Laplacian regularization, and (4) explicit modeling of probabilities of binary drug-target interactions. RESULTS VB-MK-LMF achieves significantly better predictive performance in standard benchmarks compared to state-of-the-art methods, which can be traced back to multiple factors. The systematic evaluation of the effect of multiple kernels confirm their benefits, but also highlights the limitations of linear kernel combinations, already recognized in other fields. The analysis of the effect of prior kernels using varying sample sizes sheds light on the balance of data and knowledge in DTI tasks and on the rate at which the effect of priors vanishes. This also shows the existence of "small sample size" regions where using side information offers significant gains. Alongside favorable predictive performance, a notable property of MF methods is that they provide a unified space for drugs and targets using latent representations. Compared to earlier studies, the dimensionality of this space proved to be surprisingly low, which makes the latent representations constructed by VB-ML-LMF especially well-suited for visual analytics. The probabilistic nature of the predictions allows the calculation of the expected values of hits in functionally relevant sets, which we demonstrate by predicting drug promiscuity. The variational Bayesian approximation is also implemented for general purpose graphics processing units yielding significantly improved computational time. CONCLUSION In standard benchmarks, VB-MK-LMF shows significantly improved predictive performance in a wide range of settings. Beyond these benchmarks, another contribution of our work is highlighting and providing estimates for further pharmaceutically relevant quantities, such as promiscuity, druggability and total number of interactions.
Collapse
Affiliation(s)
- Bence Bolgár
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2., Budapest, 1117 Hungary
| | - Péter Antal
- Department of Measurement and Information Systems, Budapest University of Technology and Economics, Magyar tudósok krt. 2., Budapest, 1117 Hungary
| |
Collapse
|
32
|
Koutsoukas A, Monaghan KJ, Li X, Huan J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminform 2017; 9:42. [PMID: 29086090 PMCID: PMC5489441 DOI: 10.1186/s13321-017-0226-y] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 05/27/2017] [Indexed: 01/03/2023] Open
Abstract
Background In recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored. Results We show that feed-forward deep neural networks are capable of achieving strong classification performance and outperform shallow methods across diverse activity classes when optimized. Hyper-parameters that were found to play critical role are the activation function, dropout regularization, number hidden layers and number of neurons. When compared to the rest methods, tuned DNNs were found to statistically outperform, with p value <0.01 based on Wilcoxon statistical test. DNN achieved on average MCC units of 0.149 higher than NB, 0.092 than kNN, 0.052 than SVM with linear kernel, 0.021 than RF and finally 0.009 higher than SVM with radial basis function kernel. When exploring robustness to noise, non-linear methods were found to perform well when dealing with low levels of noise, lower than or equal to 20%, however when dealing with higher levels of noise, higher than 30%, the Naïve Bayes method was found to perform well and even outperform at the highest level of noise 50% more sophisticated methods across several datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0226-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexios Koutsoukas
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Keith J Monaghan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Xiaoli Li
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Jun Huan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA.
| |
Collapse
|
33
|
Gally JM, Bourg S, Do QT, Aci-Sèche S, Bonnet P. VSPrep: A General KNIME Workflow for the Preparation of Molecules for Virtual Screening. Mol Inform 2017; 36. [PMID: 28586180 DOI: 10.1002/minf.201700023] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Accepted: 05/05/2017] [Indexed: 12/27/2022]
Abstract
Over the past decades, virtual screening has proved itself to be a valuable asset to identify new bioactive compounds. The vast majority of commonly used techniques can be described in three steps: pre-processing the dataset i. e. small (ligands) and eventually larger (receptors) molecules, execute the method and finally analyse the results. Hence, the preparation of ligands is a critical step for success of commonly used virtual screening approaches such as protein-ligand docking, similarity or pharmacophore search. We present here a new workflow, VSPrep, for the pre-processing of small molecules; it is based on freely accessible tools for academics and is integrated within the KNIME platform. It can be used to perform several chemoinformatics tasks such as molecular database cleaning, tautomer and stereoisomer enumeration, focused library design and conformer generation. Additionally, graphical reports of the results are provided to the user as a convenient analysis tool.
Collapse
Affiliation(s)
- José-Manuel Gally
- Institut de Chimie Organique et Analytique (ICOA), Université d'Orléans et CNRS, UMR7311, BP 6759, 55067, Orléans, France
| | - Stéphane Bourg
- Institut de Chimie Organique et Analytique (ICOA), Université d'Orléans et CNRS, UMR7311, BP 6759, 55067, Orléans, France
| | - Quoc-Tuan Do
- Greenpharma SAS., 3, allée du Titane, 45100, Orléans, France
| | - Samia Aci-Sèche
- Institut de Chimie Organique et Analytique (ICOA), Université d'Orléans et CNRS, UMR7311, BP 6759, 55067, Orléans, France
| | - Pascal Bonnet
- Institut de Chimie Organique et Analytique (ICOA), Université d'Orléans et CNRS, UMR7311, BP 6759, 55067, Orléans, France
| |
Collapse
|
34
|
|
35
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
36
|
Card ML, Gomez-Alvarez V, Lee WH, Lynch DG, Orentas NS, Lee MT, Wong EM, Boethling RS. History of EPI Suite™ and future perspectives on chemical property estimation in US Toxic Substances Control Act new chemical risk assessments. ENVIRONMENTAL SCIENCE. PROCESSES & IMPACTS 2017; 19:203-212. [PMID: 28275775 DOI: 10.1039/c7em00064b] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Chemical property estimation is a key component in many industrial, academic, and regulatory activities, including in the risk assessment associated with the approximately 1000 new chemical pre-manufacture notices the United States Environmental Protection Agency (US EPA) receives annually. The US EPA evaluates fate, exposure and toxicity under the 1976 Toxic Substances Control Act (amended by the 2016 Frank R. Lautenberg Chemical Safety for the 21st Century Act), which does not require test data with new chemical applications. Though the submission of data is not required, the US EPA has, over the past 40 years, occasionally received chemical-specific data with pre-manufacture notices. The US EPA has been actively using this and publicly available data to develop and refine predictive computerized models, most of which are housed in EPI Suite™, to estimate chemical properties used in the risk assessment of new chemicals. The US EPA develops and uses models based on (quantitative) structure-activity relationships ([Q]SARs) to estimate critical parameters. As in any evolving field, (Q)SARs have experienced successes, suffered failures, and responded to emerging trends. Correlations of a chemical structure with its properties or biological activity were first demonstrated in the late 19th century and today have been encapsulated in a myriad of quantitative and qualitative SARs. The development and proliferation of the personal computer in the late 20th century gave rise to a quickly increasing number of property estimation models, and continually improved computing power and connectivity among researchers via the internet are enabling the development of increasingly complex models.
Collapse
Affiliation(s)
- Marcella L Card
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - Vicente Gomez-Alvarez
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - Wen-Hsiung Lee
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - David G Lynch
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - Nerija S Orentas
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - Mari Titcombe Lee
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | - Edmund M Wong
- United States Environmental Protection Agency Office of Pollution Prevention and Toxics, Washington, DC 20004, USA.
| | | |
Collapse
|
37
|
Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:939-965. [PMID: 27885862 DOI: 10.1080/1062936x.2016.1253611] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 10/24/2016] [Indexed: 05/18/2023]
Abstract
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
Collapse
Affiliation(s)
- K Mansouri
- a Oak Ridge Institute for Science and Education (ORISE) , Oak Ridge , TN , USA
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - C M Grulke
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - A M Richard
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - R S Judson
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - A J Williams
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| |
Collapse
|
38
|
Ekins S. The Next Era: Deep Learning in Pharmaceutical Research. Pharm Res 2016; 33:2594-603. [PMID: 27599991 DOI: 10.1007/s11095-016-2029-7] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2016] [Accepted: 08/23/2016] [Indexed: 01/22/2023]
Abstract
Over the past decade we have witnessed the increasing sophistication of machine learning algorithms applied in daily use from internet searches, voice recognition, social network software to machine vision software in cameras, phones, robots and self-driving cars. Pharmaceutical research has also seen its fair share of machine learning developments. For example, applying such methods to mine the growing datasets that are created in drug discovery not only enables us to learn from the past but to predict a molecule's properties and behavior in future. The latest machine learning algorithm garnering significant attention is deep learning, which is an artificial neural network with multiple hidden layers. Publications over the last 3 years suggest that this algorithm may have advantages over previous machine learning methods and offer a slight but discernable edge in predictive performance. The time has come for a balanced review of this technique but also to apply machine learning methods such as deep learning across a wider array of endpoints relevant to pharmaceutical research for which the datasets are growing such as physicochemical property prediction, formulation prediction, absorption, distribution, metabolism, excretion and toxicity (ADME/Tox), target prediction and skin permeation, etc. We also show that there are many potential applications of deep learning beyond cheminformatics. It will be important to perform prospective testing (which has been carried out rarely to date) in order to convince skeptics that there will be benefits from investing in this technique.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations Pharmaceuticals, Inc, 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina, 27526, USA. .,Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California, 94010, USA.
| |
Collapse
|
39
|
Richard AM, Judson RS, Houck KA, Grulke CM, Volarath P, Thillainadarajah I, Yang C, Rathman J, Martin MT, Wambaugh JF, Knudsen TB, Kancherla J, Mansouri K, Patlewicz G, Williams AJ, Little SB, Crofton KM, Thomas RS. ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology. Chem Res Toxicol 2016; 29:1225-51. [PMID: 27367298 DOI: 10.1021/acs.chemrestox.6b00135] [Citation(s) in RCA: 386] [Impact Index Per Article: 48.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
The U.S. Environmental Protection Agency's (EPA) ToxCast program is testing a large library of Agency-relevant chemicals using in vitro high-throughput screening (HTS) approaches to support the development of improved toxicity prediction models. Launched in 2007, Phase I of the program screened 310 chemicals, mostly pesticides, across hundreds of ToxCast assay end points. In Phase II, the ToxCast library was expanded to 1878 chemicals, culminating in the public release of screening data at the end of 2013. Subsequent expansion in Phase III has resulted in more than 3800 chemicals actively undergoing ToxCast screening, 96% of which are also being screened in the multi-Agency Tox21 project. The chemical library unpinning these efforts plays a central role in defining the scope and potential application of ToxCast HTS results. The history of the phased construction of EPA's ToxCast library is reviewed, followed by a survey of the library contents from several different vantage points. CAS Registry Numbers are used to assess ToxCast library coverage of important toxicity, regulatory, and exposure inventories. Structure-based representations of ToxCast chemicals are then used to compute physicochemical properties, substructural features, and structural alerts for toxicity and biotransformation. Cheminformatics approaches using these varied representations are applied to defining the boundaries of HTS testability, evaluating chemical diversity, and comparing the ToxCast library to potential target application inventories, such as used in EPA's Endocrine Disruption Screening Program (EDSP). Through several examples, the ToxCast chemical library is demonstrated to provide comprehensive coverage of the knowledge domains and target inventories of potential interest to EPA. Furthermore, the varied representations and approaches presented here define local chemistry domains potentially worthy of further investigation (e.g., not currently covered in the testing library or defined by toxicity "alerts") to strategically support data mining and predictive toxicology modeling moving forward.
Collapse
Affiliation(s)
- Ann M Richard
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Richard S Judson
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Keith A Houck
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Christopher M Grulke
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Patra Volarath
- Center for Food Safety and Nutrition, U.S. Food and Drug Administration , 5100 Paint Branch Parkway, College Park, Maryland 20740, United States
| | - Inthirany Thillainadarajah
- Senior Environmental Employment Program, U.S. Environmental Protection Agency , Research Triangle Park, Durham, North Carolina 27711, United States
| | - Chihae Yang
- Molecular Networks GmbH , Henkestraße 91, 91052 Erlangen, Germany.,Altamira, LLC , 1455 Candlewood Drive, Columbus, Ohio 43235, United States
| | - James Rathman
- Altamira, LLC , 1455 Candlewood Drive, Columbus, Ohio 43235, United States.,Department of Chemical and Biomolecular Engineering, The Ohio State University , 151 W. Woodruff Avenue, Columbus, Ohio 43210, United States
| | - Matthew T Martin
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - John F Wambaugh
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Thomas B Knudsen
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Jayaram Kancherla
- ORISE Fellow, U.S. Environmental Protection Agency, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Kamel Mansouri
- ORISE Fellow, U.S. Environmental Protection Agency, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Grace Patlewicz
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Antony J Williams
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Stephen B Little
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Kevin M Crofton
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| | - Russell S Thomas
- National Center for Computational Toxicology, Office of Research & Development, U.S. Environmental Protection Agency , Mail Code B205-01, Research Triangle Park, Durham, North Carolina 27711, United States
| |
Collapse
|
40
|
Croset S, Rupp J, Romacker M. Flexible data integration and curation using a graph-based approach. Bioinformatics 2016; 32:918-25. [PMID: 26556384 DOI: 10.1093/bioinformatics/btv644] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 10/21/2015] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The increasing diversity of data available to the biomedical scientist holds promise for better understanding of diseases and discovery of new treatments for patients. In order to provide a complete picture of a biomedical question, data from many different origins needs to be combined into a unified representation. During this data integration process, inevitable errors and ambiguities present in the initial sources compromise the quality of the resulting data warehouse, and greatly diminish the scientific value of the content. Expensive and time-consuming manual curation is then required to improve the quality of the information. However, it becomes increasingly difficult to dedicate and optimize the resources for data integration projects as available repositories are growing both in size and in number everyday. RESULTS We present a new generic methodology to identify problematic records, causing what we describe as 'data hairball' structures. The approach is graph-based and relies on two metrics traditionally used in social sciences: the graph density and the betweenness centrality. We evaluate and discuss these measures and show their relevance for flexible, optimized and automated data curation and linkage. The methodology focuses on information coherence and correctness to improve the scientific meaningfulness of data integration endeavors, such as knowledge bases and large data warehouses. CONTACT samuel.croset@roche.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Samuel Croset
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| | - Joachim Rupp
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| | - Martin Romacker
- Roche Innovation Center Basel, F. Hoffmann-La Roche AG, CH-4070 Basel, Switzerland
| |
Collapse
|
41
|
Ball N, Cronin MTD, Shen J, Blackburn K, Booth ED, Bouhifd M, Donley E, Egnash L, Hastings C, Juberg DR, Kleensang A, Kleinstreuer N, Kroese ED, Lee AC, Luechtefeld T, Maertens A, Marty S, Naciff JM, Palmer J, Pamies D, Penman M, Richarz AN, Russo DP, Stuard SB, Patlewicz G, van Ravenzwaay B, Wu S, Zhu H, Hartung T. Toward Good Read-Across Practice (GRAP) guidance. ALTEX-ALTERNATIVES TO ANIMAL EXPERIMENTATION 2016; 33:149-66. [PMID: 26863606 PMCID: PMC5581000 DOI: 10.14573/altex.1601251] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Accepted: 02/11/2016] [Indexed: 12/04/2022]
Abstract
Grouping of substances and utilizing read-across of data within those groups represents an important data gap filling technique for chemical safety assessments. Categories/analogue groups are typically developed based on structural similarity and, increasingly often, also on mechanistic (biological) similarity. While read-across can play a key role in complying with legislation such as the European REACH regulation, the lack of consensus regarding the extent and type of evidence necessary to support it often hampers its successful application and acceptance by regulatory authorities. Despite a potentially broad user community, expertise is still concentrated across a handful of organizations and individuals. In order to facilitate the effective use of read-across, this document presents the state of the art, summarizes insights learned from reviewing ECHA published decisions regarding the relative successes/pitfalls surrounding read-across under REACH, and compiles the relevant activities and guidance documents. Special emphasis is given to the available existing tools and approaches, an analysis of ECHA's published final decisions associated with all levels of compliance checks and testing proposals, the consideration and expression of uncertainty, the use of biological support data, and the impact of the ECHA Read-Across Assessment Framework (RAAF) published in 2015.
Collapse
Affiliation(s)
| | - Mark T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - Jie Shen
- Research Institute for Fragrance Materials, Inc. Woodcliff Lake, NJ, USA
| | | | - Ewan D Booth
- Syngenta Ltd, Jealott's Hill International Research Centre, Bracknell, Berkshire, UK
| | - Mounir Bouhifd
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | | | - Laura Egnash
- Stemina Biomarker Discovery Inc., Madison, WI, USA
| | - Charles Hastings
- BASF SE, Ludwigshafen am Rhein, Germany, and Research Triangle Park, NC, USA
| | | | - Andre Kleensang
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Nicole Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - E Dinant Kroese
- Risk Analysis for Products in Development, TNO Zeist, The Netherlands
| | - Adam C Lee
- DuPont Haskell Global Centers for Health and Environmental Sciences, Newark, DE, USA
| | - Thomas Luechtefeld
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Alexandra Maertens
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Sue Marty
- The Dow Chemical Company, Midland, MI, USA
| | | | | | - David Pamies
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | | | - Andrea-Nicole Richarz
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - Daniel P Russo
- Department of Chemistry and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
| | | | - Grace Patlewicz
- US EPA/ORD, National Center for Computational Toxicology, Research Triangle Park, NC, USA
| | | | - Shengde Wu
- The Procter and Gamble Co., Cincinatti, OH, USA
| | - Hao Zhu
- Department of Chemistry and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
| | - Thomas Hartung
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA.,University of Konstanz, CAAT-Europe, Konstanz, Germany
| |
Collapse
|
42
|
Akhondi SA, Muresan S, Williams AJ, Kors JA. Ambiguity of non-systematic chemical identifiers within and between small-molecule databases. J Cheminform 2015; 7:54. [PMID: 26579214 PMCID: PMC4646925 DOI: 10.1186/s13321-015-0102-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2015] [Accepted: 10/30/2015] [Indexed: 11/18/2022] Open
Abstract
Background A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers. Results The ambiguity of non-systematic identifiers within databases varied from 0.1 to 15.2 % (median 2.5 %). Standardization reduced the ambiguity only to a small extent for most databases. A wide range of ambiguity existed for non-systematic identifiers that are shared between databases (17.7–60.2 %, median of 40.3 %). Removing stereochemistry information provided the largest reduction in ambiguity across databases (median reduction 13.7 percentage points). Conclusions Ambiguity of non-systematic identifiers within chemical databases is generally low, but ambiguity of non-systematic identifiers that are shared between databases, is high. Chemical structure standardization reduces the ambiguity to a limited extent. Our findings can help to improve database integration, curation, and maintenance. Electronic supplementary material The online version of this article (doi:10.1186/s13321-015-0102-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saber A Akhondi
- Department of Medical Informatics, Erasmus University Medical Centre, P.O. Box 2040, 3000 CA Rotterdam, The Netherlands
| | - Sorel Muresan
- Food Control Department, Banat University of Agricultural Sciences and Veterinary Medicine, Calea Aradului 119, 300645 Timisoara, Romania
| | | | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Centre, P.O. Box 2040, 3000 CA Rotterdam, The Netherlands
| |
Collapse
|
43
|
Hersey A, Chambers J, Bellis L, Patrícia Bento A, Gaulton A, Overington JP. Chemical databases: curation or integration by user-defined equivalence? DRUG DISCOVERY TODAY. TECHNOLOGIES 2015; 14:17-24. [PMID: 26194583 PMCID: PMC6294287 DOI: 10.1016/j.ddtec.2015.01.005] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 01/15/2015] [Accepted: 01/16/2015] [Indexed: 11/30/2022]
Abstract
There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.
Collapse
Affiliation(s)
- Anne Hersey
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
| | - Jon Chambers
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Louisa Bellis
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - A Patrícia Bento
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Anna Gaulton
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - John P Overington
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
44
|
Tarasova OA, Urusova AF, Filimonov DA, Nicklaus MC, Zakharov AV, Poroikov VV. QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors. J Chem Inf Model 2015; 55:1388-99. [PMID: 26046311 DOI: 10.1021/acs.jcim.5b00019] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
Collapse
Affiliation(s)
- Olga A Tarasova
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Aleksandra F Urusova
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Dmitry A Filimonov
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| | - Marc C Nicklaus
- ‡CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States
| | - Alexey V Zakharov
- ‡CADD Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, DHHS, NCI-Frederick, 376 Boyles St., Frederick, Maryland 21702, United States
| | - Vladimir V Poroikov
- †Institute of Biochemical Chemistry, 10-8, Pogodinskaya St., 119121, Moscow, Russia
| |
Collapse
|
45
|
Ai N, Fan X, Ekins S. In silico methods for predicting drug-drug interactions with cytochrome P-450s, transporters and beyond. Adv Drug Deliv Rev 2015; 86:46-60. [PMID: 25796619 DOI: 10.1016/j.addr.2015.03.006] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Revised: 01/05/2015] [Accepted: 03/11/2015] [Indexed: 12/13/2022]
Abstract
Drug-drug interactions (DDIs) are associated with severe adverse effects that may lead to the patient requiring alternative therapeutics and could ultimately lead to drug withdrawal from the market if they are severe. To prevent the occurrence of DDI in the clinic, experimental systems to evaluate drug interaction have been integrated into the various stages of the drug discovery and development process. A large body of knowledge about DDI has also accumulated through these studies and pharmacovigillence systems. Much of this work to date has focused on the drug metabolizing enzymes such as cytochrome P-450s as well as drug transporters, ion channels and occasionally other proteins. This combined knowledge provides a foundation for a hypothesis-driven in silico approach, using either cheminformatics or physiologically based pharmacokinetics (PK) modeling methods to assess DDI potential. Here we review recent advances in these approaches with emphasis on hypothesis-driven mechanistic models for important protein targets involved in PK-based DDI. Recent efforts with other informatics approaches to detect DDI are highlighted. Besides DDI, we also briefly introduce drug interactions with other substances, such as Traditional Chinese Medicines to illustrate how in silico modeling can be useful in this domain. We also summarize valuable data sources and web-based tools that are available for DDI prediction. We finally explore the challenges we see faced by in silico approaches for predicting DDI and propose future directions to make these computational models more reliable, accurate, and publically accessible.
Collapse
Affiliation(s)
- Ni Ai
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, PR China
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, PR China.
| | - Sean Ekins
- Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA.
| |
Collapse
|
46
|
Karapetyan K, Batchelor C, Sharpe D, Tkachenko V, Williams AJ. The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets. J Cheminform 2015; 7:30. [PMID: 26155308 PMCID: PMC4494041 DOI: 10.1186/s13321-015-0072-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2014] [Accepted: 04/28/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There are presently hundreds of online databases hosting millions of chemical compounds and associated data. As a result of the number of cheminformatics software tools that can be used to produce the data, subtle differences between the various cheminformatics platforms, as well as the naivety of the software users, there are a myriad of issues that can exist with chemical structure representations online. In order to help facilitate validation and standardization of chemical structure datasets from various sources we have delivered a freely available internet-based platform to the community for the processing of chemical compound datasets. RESULTS The chemical validation and standardization platform (CVSP) both validates and standardizes chemical structure representations according to sets of systematic rules. The chemical validation algorithms detect issues with submitted molecular representations using pre-defined or user-defined dictionary-based molecular patterns that are chemically suspicious or potentially requiring manual review. Each identified issue is assigned one of three levels of severity - Information, Warning, and Error - in order to conveniently inform the user of the need to browse and review subsets of their data. The validation process includes validation of atoms and bonds (e.g., making aware of query atoms and bonds), valences, and stereo. The standard form of submission of collections of data, the SDF file, allows the user to map the data fields to predefined CVSP fields for the purpose of cross-validating associated SMILES and InChIs with the connection tables contained within the SDF file. This platform has been applied to the analysis of a large number of data sets prepared for deposition to our ChemSpider database and in preparation of data for the Open PHACTS project. In this work we review the results of the automated validation of the DrugBank dataset, a popular drug and drug target database utilized by the community, and ChEMBL 17 data set. CVSP web site is located at http://cvsp.chemspider.com/. CONCLUSION A platform for the validation and standardization of chemical structure representations of various formats has been developed and made available to the community to assist and encourage the processing of chemical structure files to produce more homogeneous compound representations for exchange and interchange between online databases. While the CVSP platform is designed with flexibility inherent to the rules that can be used for processing the data we have produced a recommended rule set based on our own experiences with the large data sets such as DrugBank, ChEMBL, and data sets from ChemSpider.
Collapse
Affiliation(s)
- Karen Karapetyan
- />Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA
| | - Colin Batchelor
- />Thomas Graham House, Science Park, 290 Milton Road, Cambridge, UK
| | - David Sharpe
- />Thomas Graham House, Science Park, 290 Milton Road, Cambridge, UK
| | - Valery Tkachenko
- />Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA
| | - Antony J Williams
- />Royal Society of Chemistry, US Office, 904 Tamaras Circle, Wake Forest, NC 27587 USA
- />Environmental Protection Agency, Research Triangle Park, NC USA
| |
Collapse
|
47
|
Warr WA. Many InChIs and quite some feat. J Comput Aided Mol Des 2015; 29:681-94. [PMID: 26081259 DOI: 10.1007/s10822-015-9854-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 06/10/2015] [Indexed: 12/14/2022]
Affiliation(s)
- Wendy A Warr
- Wendy Warr & Associates, Holmes Chapel, Crewe, Cheshire, CW4 7HZ, UK,
| |
Collapse
|
48
|
Brito-Sánchez Y, Marrero-Ponce Y, Barigye SJ, Yaber-Goenaga I, Morell Pérez C, Le-Thi-Thu H, Cherkasov A. Towards Better BBB Passage Prediction Using an Extensive and Curated Data Set. Mol Inform 2015; 34:308-30. [PMID: 27490276 DOI: 10.1002/minf.201400118] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2014] [Accepted: 01/20/2015] [Indexed: 12/25/2022]
Abstract
In the present report, the challenging task of drug delivery across the blood-brain barrier (BBB) is addressed via a computational approach. The BBB passage was modeled using classification and regression schemes on a novel extensive and curated data set (the largest to the best of our knowledge) in terms of log BB. Prior to the model development, steps of data analysis that comprise chemical data curation, structural, cutoff and cluster analysis (CA) were conducted. Linear Discriminant Analysis (LDA) and Multiple Linear Regression (MLR) were used to fit classification and correlation functions. The best LDA-based model showed overall accuracies over 85 % and 83 % for the training and test sets, respectively. Also a MLR-based model with acceptable explanation of more than 69 % of the variance in the experimental log BB was developed. A brief and general interpretation of proposed models allowed the estimation on how 'near' our computational approach is to the factors that determine the passage of molecules through the BBB. In a final effort some popular and powerful Machine Learning methods were considered. Comparable or similar performance was observed respect to the simpler linear techniques. Most of the compounds with anomalous behavior were put aside into a set denoted as controversial set and discussion regarding to these compounds is provided. Finally, our results were compared with methodologies previously reported in the literature showing comparable to better results. The results could represent useful tools available and reproducible by all scientific community in the early stages of neuropharmaceutical drug discovery/development projects.
Collapse
Affiliation(s)
- Yoan Brito-Sánchez
- Vancouver Prostate Centre, University of British Columbia, Vancouver, British Columbia, V6H 3Z6, Canada.,Unit of Computer-Aided Molecular "Biosilico" Discovery and Bioinformatic Research, International Network (CAMD-BIR International Network), Los Laureles L76MD, Nuevo Bosque, 130015, Cartagena de Indias, Bolivar, Colombia. Homepage: http://www.uv.es/yoma/ Homepage: http://sites.google.com/site/ymponce/home
| | - Yovani Marrero-Ponce
- Unit of Computer-Aided Molecular "Biosilico" Discovery and Bioinformatic Research, International Network (CAMD-BIR International Network), Los Laureles L76MD, Nuevo Bosque, 130015, Cartagena de Indias, Bolivar, Colombia. Homepage: http://www.uv.es/yoma/ Homepage: http://sites.google.com/site/ymponce/home. .,Grupo de Investigación en Estudios Químicos y Biológicos, Facultad de Ciencias Básicas, Universidad Tecnológica de Bolívar, Parque Industrial y Tecnológico Carlos Vélez Pombo Km 1 vía Turbaco, 130010, Cartagena de Indias, Bolívar, Colombia. .,Facultad de Química Farmacéutica, Universidad de Cartagena, Cartagena de Indias, Bolívar, Colombia.
| | - Stephen J Barigye
- Unit of Computer-Aided Molecular "Biosilico" Discovery and Bioinformatic Research, International Network (CAMD-BIR International Network), Los Laureles L76MD, Nuevo Bosque, 130015, Cartagena de Indias, Bolivar, Colombia. Homepage: http://www.uv.es/yoma/ Homepage: http://sites.google.com/site/ymponce/home.,Department of Chemistry, Federal University of Lavras, P.O. Box 3037, 37200-000, Lavras, MG, Brazil
| | - Iván Yaber-Goenaga
- Grupo de Investigación en Estudios Químicos y Biológicos, Facultad de Ciencias Básicas, Universidad Tecnológica de Bolívar, Parque Industrial y Tecnológico Carlos Vélez Pombo Km 1 vía Turbaco, 130010, Cartagena de Indias, Bolívar, Colombia
| | - Carlos Morell Pérez
- Center of Studies on Informatics, Universidad "Marta Abreu" de Las Villas, Santa Clara, 54830, Villa Clara, Cuba
| | - Huong Le-Thi-Thu
- School of Medicine and Pharmacy, Vietnam National University, Hanoi (VNU) 144 Xuan Thuy, CauGiay, Hanoi, Vietnam
| | - Artem Cherkasov
- Vancouver Prostate Centre, University of British Columbia, Vancouver, British Columbia, V6H 3Z6, Canada
| |
Collapse
|
49
|
Clark AM, Williams AJ, Ekins S. Machines first, humans second: on the importance of algorithmic interpretation of open chemistry data. J Cheminform 2015; 7:9. [PMID: 25798198 PMCID: PMC4369291 DOI: 10.1186/s13321-015-0057-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 02/23/2015] [Indexed: 11/12/2022] Open
Abstract
The current rise in the use of open lab notebook techniques means that there are an increasing number of scientists who make chemical information freely and openly available to the entire community as a series of micropublications that are released shortly after the conclusion of each experiment. We propose that this trend be accompanied by a thorough examination of data sharing priorities. We argue that the most significant immediate benefactor of open data is in fact chemical algorithms, which are capable of absorbing vast quantities of data, and using it to present concise insights to working chemists, on a scale that could not be achieved by traditional publication methods. Making this goal practically achievable will require a paradigm shift in the way individual scientists translate their data into digital form, since most contemporary methods of data entry are designed for presentation to humans rather than consumption by machine learning algorithms. We discuss some of the complex issues involved in fixing current methods, as well as some of the immediate benefits that can be gained when open data is published correctly using unambiguous machine readable formats. Lab notebook entries must target both visualisation by scientists and use by machine learning algorithms ![]()
Collapse
Affiliation(s)
- Alex M Clark
- Molecular Materials Informatics, 1900 St. Jacques #302, Montreal, H3J 2S1, QC Canada
| | - Antony J Williams
- Royal Society of Chemistry, 904 Tamaras Circle, Wake Forest, NC 27587 USA
| | - Sean Ekins
- Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526 USA ; Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010 USA
| |
Collapse
|
50
|
Ekins S, Litterman NK, Arnold RJG, Burgess RW, Freundlich JS, Gray SJ, Higgins JJ, Langley B, Willis DE, Notterpek L, Pleasure D, Sereda MW, Moore A. A brief review of recent Charcot-Marie-Tooth research and priorities. F1000Res 2015; 4:53. [PMID: 25901280 PMCID: PMC4392824 DOI: 10.12688/f1000research.6160.1] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/24/2015] [Indexed: 12/14/2022] Open
Abstract
This brief review of current research progress on Charcot-Marie-Tooth (CMT) disease is a summary of discussions initiated at the Hereditary Neuropathy Foundation (HNF) scientific advisory board meeting on November 7, 2014. It covers recent published and unpublished
in vitro and
in vivo research. We discuss recent promising preclinical work for CMT1A, the development of new biomarkers, the characterization of different animal models, and the analysis of the frequency of gene mutations in patients with CMT. We also describe how progress in related fields may benefit CMT therapeutic development, including the potential of gene therapy and stem cell research. We also discuss the potential to assess and improve the quality of life of CMT patients. This summary of CMT research identifies some of the gaps which may have an impact on upcoming clinical trials. We provide some priorities for CMT research and areas which HNF can support. The goal of this review is to inform the scientific community about ongoing research and to avoid unnecessary overlap, while also highlighting areas ripe for further investigation. The general collaborative approach we have taken may be useful for other rare neurological diseases.
Collapse
Affiliation(s)
- Sean Ekins
- Hereditary Neuropathy Foundation, New York, NY, 10016, USA ; Collaborations in Chemistry, Fuquay Varina, NC, 27526, USA ; Collaborative Drug Discovery, Burlingame, CA, 94010, USA
| | | | - Renée J G Arnold
- Arnold Consultancy & Technology LLC, New York, NY, 10023, USA ; Master of Public Health Program, Mount Sinai School of Medicine, New York, NY, 10029, USA ; Quorum Consulting, Inc, San Francisco, CA, 94104, USA
| | - Robert W Burgess
- The Jackson Laboratory in Bar Harbor, Bar Harbour, ME, 04609, USA
| | - Joel S Freundlich
- Department of Medicine, Center for Emerging and Reemerging Pathogens, Rutgers University - New Jersey Medical School, Newark, NJ, 07103, USA
| | - Steven J Gray
- Gene Therapy Center and Dept. of Ophthalmology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599-7352, USA
| | | | - Brett Langley
- Burke-Cornell Medical Research Institute, White Plains, NY, 10605, USA ; Department of Neurology and Neuroscience, Weill Medical College of Cornell University, New York, NY, 10065, USA
| | - Dianna E Willis
- Burke-Cornell Medical Research Institute, White Plains, NY, 10605, USA
| | - Lucia Notterpek
- Department of Neuroscience, College of Medicine, McKnight Brain Institute, University of Florida, Gainesville, FL, 32611, USA
| | - David Pleasure
- Institute for Pediatric Regenerative Medicine, University of California Davis, School of Medicine, Sacramento, CA, 95817, USA ; Department of Neurology, University of California, Davis, School of Medicine, c/o Shriners Hospital, Sacramento, CA, 95817, USA
| | - Michael W Sereda
- Department of Neurogenetics, Max Planck Institute (MPI) of Experimental Medicine, Göttingen, 37075, Germany ; Department of Clinical Neurophysiology, University Medical Center (UMG), Göttingen, D-37075, Germany
| | - Allison Moore
- Hereditary Neuropathy Foundation, New York, NY, 10016, USA
| |
Collapse
|