1
|
Mansouri K, Moreira-Filho JT, Lowe CN, Charest N, Martin T, Tkachenko V, Judson R, Conway M, Kleinstreuer NC, Williams AJ. Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling. J Cheminform 2024; 16:19. [PMID: 38378618 PMCID: PMC10880251 DOI: 10.1186/s13321-024-00814-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/10/2024] [Indexed: 02/22/2024] Open
Abstract
The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.
Collapse
Affiliation(s)
- Kamel Mansouri
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
| | - José T Moreira-Filho
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Charles N Lowe
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Nathaniel Charest
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Todd Martin
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | | | - Richard Judson
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Mike Conway
- National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Nicole C Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Antony J Williams
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
2
|
Bento AP, Hersey A, Félix E, Landrum G, Gaulton A, Atkinson F, Bellis LJ, De Veij M, Leach AR. An open source chemical structure curation pipeline using RDKit. J Cheminform 2020; 12:51. [PMID: 33431044 PMCID: PMC7458899 DOI: 10.1186/s13321-020-00456-1] [Citation(s) in RCA: 144] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 08/24/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources. Incoming compounds are typically not standardised according to consistent rules. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. RESULTS A chemical curation pipeline has been developed using the open source toolkit RDKit. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors; a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. CONCLUSION All the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.
Collapse
Affiliation(s)
- A Patrícia Bento
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Anne Hersey
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Eloy Félix
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | | | - Anna Gaulton
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Francis Atkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
- The Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK
| | - Louisa J Bellis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
- Department of Oncology, University of Cambridge, Cambridge, UK
| | - Marleen De Veij
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK
| | - Andrew R Leach
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, CB10 1SD, Cambridgeshire, UK.
| |
Collapse
|
3
|
Baker CM, Kidley NJ, Papachristos K, Hotson M, Carson R, Gravestock D, Pouliot M, Harrison J, Dowling A. Tautomer Standardization in Chemical Databases: Deriving Business Rules from Quantum Chemistry. J Chem Inf Model 2020; 60:3781-3791. [PMID: 32644790 DOI: 10.1021/acs.jcim.0c00232] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Databases of small, potentially bioactive molecules are ubiquitous across the industry and academia. Designed such that each unique compound should appear only once, the multiplicity of ways in which many compounds can be represented means that these databases require methods for standardizing the representation of chemistry. This is commonly achieved through the use of "Chemistry Business Rules", sets of predefined rules that describe the "house style" of the database in question. At Syngenta, the historical approach to the design of chemistry business rules has been to focus on consistency of representation, with chemical relevance given secondary consideration. In this work, we overturn that convention. Through the use of quantum chemistry calculations, we define a set of chemistry business rules for tautomer standardization that reproduces gas-phase energetic preferences. We go on to show that, compared to our historic approach, this method yields tautomers that are in better agreement with those observed experimentally in condensed phases and that are better suited for use in predictive models.
Collapse
Affiliation(s)
- Christopher M Baker
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Nathan J Kidley
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | | | - Matthew Hotson
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Rob Carson
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - David Gravestock
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| | - Martin Pouliot
- Syngenta Crop Protection, Schaffhauserstrasse, Stein CH-4332, Switzerland
| | - Jim Harrison
- Datacraft Technologies, 110 Parkwood Place, Anstead, QLD 4070, Australia
| | - Alan Dowling
- Syngenta, Jealott's Hill International Research Centre, Bracknell, Berkshire RG42 6EY, U.K
| |
Collapse
|
4
|
Dalecki AG, Zorn KM, Clark AM, Ekins S, Narmore WT, Tower N, Rasmussen L, Bostwick R, Kutsch O, Wolschendorf F. High-throughput screening and Bayesian machine learning for copper-dependent inhibitors of Staphylococcus aureus. Metallomics 2020; 11:696-706. [PMID: 30839007 DOI: 10.1039/c8mt00342d] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
One potential source of new antibacterials is through probing existing chemical libraries for copper-dependent inhibitors (CDIs), i.e., molecules with antibiotic activity only in the presence of copper. Recently, our group demonstrated that previously unknown staphylococcal CDIs were frequently present in a small pilot screen. Here, we report the outcome of a larger industrial anti-staphylococcal screen consisting of 40 771 compounds assayed in parallel, both in standard and in copper-supplemented media. Ultimately, 483 had confirmed copper-dependent IC50 values under 50 μM. Sphere-exclusion clustering revealed that these hits were largely dominated by sulfur-containing motifs, including benzimidazole-2-thiones, thiadiazines, thiazoline formamides, triazino-benzimidazoles, and pyridinyl thieno-pyrimidines. Structure-activity relationship analysis of the pyridinyl thieno-pyrimidines generated multiple improved CDIs, with activity likely dependent on ligand/ion coordination. Molecular fingerprint-based Bayesian classification models were built using Discovery Studio and Assay Central, a new platform for sharing and distributing cheminformatic models in a portable format, based on open-source tools. Finally, we used the latter model to evaluate a library of FDA-approved drugs for copper-dependent activity in silico. Two anti-helminths, albendazole and thiabendazole, scored highly and are known to coordinate copper ions, further validating the model's applicability.
Collapse
Affiliation(s)
- Alex G Dalecki
- Department of Medicine, Division of Infectious Diseases, University of Alabama at Birmingham, BBRB 562, 845 19th St S, Birmingham, AL 35294, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
5
|
Fantke P, Aurisano N, Provoost J, Karamertzanis PG, Hauschild M. Toward effective use of REACH data for science and policy. ENVIRONMENT INTERNATIONAL 2020; 135:105336. [PMID: 31884133 DOI: 10.1016/j.envint.2019.105336] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Revised: 11/15/2019] [Accepted: 11/15/2019] [Indexed: 06/10/2023]
Affiliation(s)
- Peter Fantke
- Quantitative Sustainability Assessment, Department of Technology, Management and Economics, Technical University of Denmark, Produktionstorvet 424, 2800 Kgs. Lyngby, Denmark.
| | - Nicolò Aurisano
- Quantitative Sustainability Assessment, Department of Technology, Management and Economics, Technical University of Denmark, Produktionstorvet 424, 2800 Kgs. Lyngby, Denmark
| | - Jeroen Provoost
- Computational Assessment Unit, Directorate of Prioritisation and Integration, European Chemicals Agency, Annankatu 18, 00121 Helsinki, Finland
| | - Panagiotis G Karamertzanis
- Computational Assessment Unit, Directorate of Prioritisation and Integration, European Chemicals Agency, Annankatu 18, 00121 Helsinki, Finland
| | - Michael Hauschild
- Quantitative Sustainability Assessment, Department of Technology, Management and Economics, Technical University of Denmark, Produktionstorvet 424, 2800 Kgs. Lyngby, Denmark
| |
Collapse
|
6
|
Grulke CM, Williams AJ, Thillanadarajah I, Richard AM. EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research. ACTA ACUST UNITED AC 2019; 12. [PMID: 33426407 PMCID: PMC7787967 DOI: 10.1016/j.comtox.2019.100096] [Citation(s) in RCA: 94] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The US Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database, launched publicly in 2004, currently exceeds 875 K substances spanning hundreds of lists of interest to EPA and environmental researchers. From its inception, DSSTox has focused curation efforts on resolving chemical identifier errors and conflicts in the public domain towards the goal of assigning accurate chemical structures to data and lists of importance to the environmental research and regulatory community. Accurate structure-data associations, in turn, are necessary inputs to structure-based predictive models supporting hazard and risk assessments. In 2014, the legacy, manually curated DSSTox_V1 content was migrated to a MySQL data model, with modern cheminformatics tools supporting both manual and automated curation processes to increase efficiencies. This was followed by sequential auto-loads of filtered portions of three public datasets: EPA's Substance Registry Services (SRS), the National Library of Medicine's ChemID, and PubChem. This process was constrained by a key requirement of uniquely mapped identifiers (i.e., CAS RN, name and structure) for each substance, rejecting content where any two identifiers were conflicted either within or across datasets. This rejected content highlighted the degree of conflicting, inaccurate substance-structure ID mappings in the public domain, ranging from 12% (within EPA SRS) to 49% (across ChemID and PubChem). Substances successfully added to DSSTox from each auto-load were assigned to one of five qc_levels, conveying curator confidence in each dataset. This process enabled a significant expansion of DSSTox content to provide better coverage of the chemical landscape of interest to environmental scientists, while retaining focus on the accuracy of substance-structure-data associations. Currently, DSSTox serves as the core foundation of EPA's CompTox Chemicals Dashboard [https://comptox.epa.gov/dashboard], which provides public access to DSSTox content in support of a broad range of modeling and research activities within EPA and, increasingly, across the field of computational toxicology.
Collapse
Affiliation(s)
- Christopher M Grulke
- National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mail Drop D143-02, Research Triangle Park, NC 27711, USA
| | - Antony J Williams
- National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mail Drop D143-02, Research Triangle Park, NC 27711, USA
| | - Inthirany Thillanadarajah
- Senior Environmental Employment Program, US Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Ann M Richard
- National Center for Computational Toxicology, Office of Research & Development, US Environmental Protection Agency, Mail Drop D143-02, Research Triangle Park, NC 27711, USA
| |
Collapse
|
7
|
Lane T, Russo DP, Zorn KM, Clark AM, Korotcov A, Tkachenko V, Reynolds RC, Perryman AL, Freundlich JS, Ekins AS. Comparing and Validating Machine Learning Models for Mycobacterium tuberculosis Drug Discovery. Mol Pharm 2018; 15:4346-4360. [PMID: 29672063 PMCID: PMC6167198 DOI: 10.1021/acs.molpharmaceut.8b00083] [Citation(s) in RCA: 64] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Tuberculosis is a global health dilemma. In 2016, the WHO reported 10.4 million incidences and 1.7 million deaths. The need to develop new treatments for those infected with Mycobacterium tuberculosis ( Mtb) has led to many large-scale phenotypic screens and many thousands of new active compounds identified in vitro. However, with limited funding, efforts to discover new active molecules against Mtb needs to be more efficient. Several computational machine learning approaches have been shown to have good enrichment and hit rates. We have curated small molecule Mtb data and developed new models with a total of 18,886 molecules with activity cutoffs of 10 μM, 1 μM, and 100 nM. These data sets were used to evaluate different machine learning methods (including deep learning) and metrics and to generate predictions for additional molecules published in 2017. One Mtb model, a combined in vitro and in vivo data Bayesian model at a 100 nM activity yielded the following metrics for 5-fold cross validation: accuracy = 0.88, precision = 0.22, recall = 0.91, specificity = 0.88, kappa = 0.31, and MCC = 0.41. We have also curated an evaluation set ( n = 153 compounds) published in 2017, and when used to test our model, it showed the comparable statistics (accuracy = 0.83, precision = 0.27, recall = 1.00, specificity = 0.81, kappa = 0.36, and MCC = 0.47). We have also compared these models with additional machine learning algorithms showing Bayesian machine learning models constructed with literature Mtb data generated by different laboratories generally were equivalent to or outperformed deep neural networks with external test sets. Finally, we have also compared our training and test sets to show they were suitably diverse and different in order to represent useful evaluation sets. Such Mtb machine learning models could help prioritize compounds for testing in vitro and in vivo.
Collapse
Affiliation(s)
- Thomas Lane
- Collaborations Pharmaceuticals, Inc., Main Campus Drive, Lab 3510 Raleigh, NC 27606, USA
- Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Daniel P. Russo
- Collaborations Pharmaceuticals, Inc., Main Campus Drive, Lab 3510 Raleigh, NC 27606, USA
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ, 08102, USA
| | - Kimberley M. Zorn
- Collaborations Pharmaceuticals, Inc., Main Campus Drive, Lab 3510 Raleigh, NC 27606, USA
| | - Alex M. Clark
- Molecular Materials Informatics, Inc., 1900 St. Jacques #302, Montreal H3J 2S1, Quebec, Canada
| | - Alexandru Korotcov
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Valery Tkachenko
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Robert C. Reynolds
- Department of Medicine, Division of Hematology and Oncology, University of Alabama at Birmingham, NP 2540 J, 1720 2Avenue South, Birmingham, AL 35294-3300, USA
| | - Alexander L. Perryman
- Department of Pharmacology, Physiology and Neuroscience, Rutgers University-New Jersey Medical School, Newark, New Jersey 07103, USA
| | - Joel S. Freundlich
- Department of Pharmacology, Physiology and Neuroscience, Rutgers University-New Jersey Medical School, Newark, New Jersey 07103, USA
- Division of Infectious Diseases, Department of Medicine, and the Ruy V. Lourenço Center for the Study of Emerging and Re-emerging Pathogens, Rutgers University–New Jersey Medical School, Newark, New Jersey 07103, USA
| | - and Sean Ekins
- Collaborations Pharmaceuticals, Inc., Main Campus Drive, Lab 3510 Raleigh, NC 27606, USA
| |
Collapse
|
8
|
McEachran AD, Mansouri K, Grulke C, Schymanski EL, Ruttkies C, Williams AJ. "MS-Ready" structures for non-targeted high-resolution mass spectrometry screening studies. J Cheminform 2018; 10:45. [PMID: 30167882 PMCID: PMC6117229 DOI: 10.1186/s13321-018-0299-2] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 08/21/2018] [Indexed: 02/05/2023] Open
Abstract
Chemical database searching has become a fixture in many non-targeted identification workflows based on high-resolution mass spectrometry (HRMS). However, the form of a chemical structure observed in HRMS does not always match the form stored in a database (e.g., the neutral form versus a salt; one component of a mixture rather than the mixture form used in a consumer product). Linking the form of a structure observed via HRMS to its related form(s) within a database will enable the return of all relevant variants of a structure, as well as the related metadata, in a single query. A Konstanz Information Miner (KNIME) workflow has been developed to produce structural representations observed using HRMS ("MS-Ready structures") and links them to those stored in a database. These MS-Ready structures, and associated mappings to the full chemical representations, are surfaced via the US EPA's Chemistry Dashboard ( https://comptox.epa.gov/dashboard/ ). This article describes the workflow for the generation and linking of ~ 700,000 MS-Ready structures (derived from ~ 760,000 original structures) as well as download, search and export capabilities to serve structure identification using HRMS. The importance of this form of structural representation for HRMS is demonstrated with several examples, including integration with the in silico fragmentation software application MetFrag. The structures, search, download and export functionality are all available through the CompTox Chemistry Dashboard, while the MetFrag implementation can be viewed at https://msbi.ipb-halle.de/MetFragBeta/ .
Collapse
Affiliation(s)
- Andrew D. McEachran
- Oak Ridge Institute for Science and Education (ORISE) Research Participation Program, U.S. Environmental Protection Agency, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| | - Kamel Mansouri
- Oak Ridge Institute for Science and Education (ORISE) Research Participation Program, U.S. Environmental Protection Agency, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
- Present Address: Integrated Laboratory Systems, Inc., 601 Keystone Dr., Morrisville, NC 27650 USA
| | - Chris Grulke
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| | - Emma L. Schymanski
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6, avenue du Swing, 4367 Belvaux, Luxembourg
| | - Christoph Ruttkies
- Department of Stress and Development Biology, Leibniz Institute of Plant Biochemistry (IPB), Weinberg 3, 06120 Halle (Saale), Germany
| | - Antony J. Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Mail Drop D143-02, 109 T.W. Alexander Dr., Research Triangle Park, NC 27711 USA
| |
Collapse
|
9
|
Brown N, Cambruzzi J, Cox PJ, Davies M, Dunbar J, Plumbley D, Sellwood MA, Sim A, Williams-Jones BI, Zwierzyna M, Sheppard DW. Big Data in Drug Discovery. PROGRESS IN MEDICINAL CHEMISTRY 2018; 57:277-356. [PMID: 29680150 DOI: 10.1016/bs.pmch.2017.12.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Interpretation of Big Data in the drug discovery community should enhance project timelines and reduce clinical attrition through improved early decision making. The issues we encounter start with the sheer volume of data and how we first ingest it before building an infrastructure to house it to make use of the data in an efficient and productive way. There are many problems associated with the data itself including general reproducibility, but often, it is the context surrounding an experiment that is critical to success. Help, in the form of artificial intelligence (AI), is required to understand and translate the context. On the back of natural language processing pipelines, AI is also used to prospectively generate new hypotheses by linking data together. We explain Big Data from the context of biology, chemistry and clinical trials, showcasing some of the impressive public domain sources and initiatives now available for interrogation.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Aaron Sim
- BenevolentAI, London, United Kingdom
| | | | - Magdalena Zwierzyna
- BenevolentAI, London, United Kingdom; Institute of Cardiovascular Science, University College London, London, United Kingdom
| | | |
Collapse
|
10
|
Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, Kuhn S, Pluskal T, Rojas-Chertó M, Spjuth O, Torrance G, Evelo CT, Guha R, Steinbeck C. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform 2017; 9:33. [PMID: 29086040 PMCID: PMC5461230 DOI: 10.1186/s13321-017-0220-4] [Citation(s) in RCA: 210] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 05/16/2017] [Indexed: 12/15/2022] Open
Abstract
Background The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. Results We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. Conclusions This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software.CDK 2.0 provides new features and improved performance ![]() Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0220-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Egon L Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, 6200 MD, Maastricht, The Netherlands.
| | | | - Jonathan Alvarsson
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden
| | - Arvid Berg
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden
| | - Lars Carlsson
- AstraZeneca, Innovative Medicines & Early Development, Quantitative Biology, Möndal, Sweden
| | | | - Stefan Kuhn
- Department of Informatics, University of Leicester, Leicester, UK
| | - Tomáš Pluskal
- Whitehead Institute for Biomedical Research, 455 Main Street, Cambridge, MA, 02142, USA
| | | | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, 751 24, Uppsala, Sweden
| | | | - Chris T Evelo
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, 6200 MD, Maastricht, The Netherlands
| | - Rajarshi Guha
- National Center for Advancing Translational Sciences, 9800 Medical Center Drive, Rockville, MD, 20850, USA
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University, Lessingstr. 8, 07743, Jena, Germany
| |
Collapse
|
11
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
12
|
Goldmann D, Zdrazil B, Digles D, Ecker GF. Empowering pharmacoinformatics by linked life science data. J Comput Aided Mol Des 2017; 31:319-328. [PMID: 27830428 PMCID: PMC5385323 DOI: 10.1007/s10822-016-9990-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 10/24/2016] [Indexed: 11/11/2022]
Abstract
With the public availability of large data sources such as ChEMBLdb and the Open PHACTS Discovery Platform, retrieval of data sets for certain protein targets of interest with consistent assay conditions is no longer a time consuming process. Especially the use of workflow engines such as KNIME or Pipeline Pilot allows complex queries and enables to simultaneously search for several targets. Data can then directly be used as input to various ligand- and structure-based studies. In this contribution, using in-house projects on P-gp inhibition, transporter selectivity, and TRPV1 modulation we outline how the incorporation of linked life science data in the daily execution of projects allowed to expand our approaches from conventional Hansch analysis to complex, integrated multilayer models.
Collapse
Affiliation(s)
- Daria Goldmann
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstraße 14, 1090, Vienna, Austria
| | - Barbara Zdrazil
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstraße 14, 1090, Vienna, Austria
| | - Daniela Digles
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstraße 14, 1090, Vienna, Austria
| | - Gerhard F Ecker
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstraße 14, 1090, Vienna, Austria.
| |
Collapse
|
13
|
Sommer K, Friedrich NO, Bietz S, Hilbig M, Inhester T, Rarey M. UNICON: A Powerful and Easy-to-Use Compound Library Converter. J Chem Inf Model 2016; 56:1105-11. [PMID: 27227368 DOI: 10.1021/acs.jcim.6b00069] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The accurate handling of different chemical file formats and the consistent conversion between them play important roles for calculations in complex cheminformatics workflows. Working with different cheminformatic tools often makes the conversion between file formats a mandatory step. Such a conversion might become a difficult task in cases where the information content substantially differs. This paper describes UNICON, an easy-to-use software tool for this task. The functionality of UNICON ranges from file conversion between standard formats SDF, MOL2, SMILES, PDB, and PDBx/mmCIF via the generation of 2D structure coordinates and 3D structures to the enumeration of tautomeric forms, protonation states, and conformer ensembles. For this purpose, UNICON bundles the key elements of the previously described NAOMI library in a single, easy-to-use command line tool.
Collapse
Affiliation(s)
- Kai Sommer
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| | - Nils-Ole Friedrich
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| | - Stefan Bietz
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| | - Matthias Hilbig
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| | - Therese Inhester
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| | - Matthias Rarey
- Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg , Bundesstraße 43, 20146 Hamburg, Germany
| |
Collapse
|
14
|
Ball N, Cronin MTD, Shen J, Blackburn K, Booth ED, Bouhifd M, Donley E, Egnash L, Hastings C, Juberg DR, Kleensang A, Kleinstreuer N, Kroese ED, Lee AC, Luechtefeld T, Maertens A, Marty S, Naciff JM, Palmer J, Pamies D, Penman M, Richarz AN, Russo DP, Stuard SB, Patlewicz G, van Ravenzwaay B, Wu S, Zhu H, Hartung T. Toward Good Read-Across Practice (GRAP) guidance. ALTEX-ALTERNATIVES TO ANIMAL EXPERIMENTATION 2016; 33:149-66. [PMID: 26863606 PMCID: PMC5581000 DOI: 10.14573/altex.1601251] [Citation(s) in RCA: 116] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 01/21/2016] [Accepted: 02/11/2016] [Indexed: 12/04/2022]
Abstract
Grouping of substances and utilizing read-across of data within those groups represents an important data gap filling technique for chemical safety assessments. Categories/analogue groups are typically developed based on structural similarity and, increasingly often, also on mechanistic (biological) similarity. While read-across can play a key role in complying with legislation such as the European REACH regulation, the lack of consensus regarding the extent and type of evidence necessary to support it often hampers its successful application and acceptance by regulatory authorities. Despite a potentially broad user community, expertise is still concentrated across a handful of organizations and individuals. In order to facilitate the effective use of read-across, this document presents the state of the art, summarizes insights learned from reviewing ECHA published decisions regarding the relative successes/pitfalls surrounding read-across under REACH, and compiles the relevant activities and guidance documents. Special emphasis is given to the available existing tools and approaches, an analysis of ECHA's published final decisions associated with all levels of compliance checks and testing proposals, the consideration and expression of uncertainty, the use of biological support data, and the impact of the ECHA Read-Across Assessment Framework (RAAF) published in 2015.
Collapse
Affiliation(s)
| | - Mark T D Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - Jie Shen
- Research Institute for Fragrance Materials, Inc. Woodcliff Lake, NJ, USA
| | | | - Ewan D Booth
- Syngenta Ltd, Jealott's Hill International Research Centre, Bracknell, Berkshire, UK
| | - Mounir Bouhifd
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | | | - Laura Egnash
- Stemina Biomarker Discovery Inc., Madison, WI, USA
| | - Charles Hastings
- BASF SE, Ludwigshafen am Rhein, Germany, and Research Triangle Park, NC, USA
| | | | - Andre Kleensang
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Nicole Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, USA
| | - E Dinant Kroese
- Risk Analysis for Products in Development, TNO Zeist, The Netherlands
| | - Adam C Lee
- DuPont Haskell Global Centers for Health and Environmental Sciences, Newark, DE, USA
| | - Thomas Luechtefeld
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Alexandra Maertens
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | - Sue Marty
- The Dow Chemical Company, Midland, MI, USA
| | | | | | - David Pamies
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA
| | | | - Andrea-Nicole Richarz
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool, UK
| | - Daniel P Russo
- Department of Chemistry and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
| | | | - Grace Patlewicz
- US EPA/ORD, National Center for Computational Toxicology, Research Triangle Park, NC, USA
| | | | - Shengde Wu
- The Procter and Gamble Co., Cincinatti, OH, USA
| | - Hao Zhu
- Department of Chemistry and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, USA
| | - Thomas Hartung
- Johns Hopkins Bloomberg School of Public Health, Center for Alternatives to Animal Testing (CAAT), Baltimore, MD, USA.,University of Konstanz, CAAT-Europe, Konstanz, Germany
| |
Collapse
|
15
|
Hersey A, Chambers J, Bellis L, Patrícia Bento A, Gaulton A, Overington JP. Chemical databases: curation or integration by user-defined equivalence? DRUG DISCOVERY TODAY. TECHNOLOGIES 2015; 14:17-24. [PMID: 26194583 PMCID: PMC6294287 DOI: 10.1016/j.ddtec.2015.01.005] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2014] [Revised: 01/15/2015] [Accepted: 01/16/2015] [Indexed: 11/30/2022]
Abstract
There is a wealth of valuable chemical information in publicly available databases for use by scientists undertaking drug discovery. However finite curation resource, limitations of chemical structure software and differences in individual database applications mean that exact chemical structure equivalence between databases is unlikely to ever be a reality. The ability to identify compound equivalence has been made significantly easier by the use of the International Chemical Identifier (InChI), a non-proprietary line-notation for describing a chemical structure. More importantly, advances in methods to identify compounds that are the same at various levels of similarity, such as those containing the same parent component or having the same connectivity, are now enabling related compounds to be linked between databases where the structure matches are not exact.
Collapse
Affiliation(s)
- Anne Hersey
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
| | - Jon Chambers
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Louisa Bellis
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - A Patrícia Bento
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Anna Gaulton
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - John P Overington
- European Molecular Biology Laboratory - European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|