1
|
Blakey M, Pearman-Kanza S, Frey JG. Zombie cheminformatics: extraction and conversion of Wiswesser Line Notation (WLN) from chemical documents. J Cheminform 2024; 16:42. [PMID: 38622746 PMCID: PMC11017645 DOI: 10.1186/s13321-024-00831-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/23/2024] [Indexed: 04/17/2024] Open
Abstract
PURPOSE Wiswesser Line Notation (WLN) is a old line notation for encoding chemical compounds for storage and processing by computers. Whilst the notation itself has long since been surpassed by SMILES and InChI, distribution of WLN during its active years was extensive. In the context of modernising chemical data, we present a comprehensive WLN parser developed using the OpenBabel toolkit, capable of translating WLN strings into various formats supported by the library. Furthermore, we have devised a specialised Finite State Machine l, constructed from the rules of WLN, enabling the recognition and extraction of chemical strings out of large bodies of text. Available open-access WLN data with corresponding SMILES or InChI notation is rare, however ChEMBL, ChemSpider and PubChem all contain WLN records which were used for conversion scoring. Our investigation revealed a notable proportion of inaccuracies within the database entries, and we have taken steps to rectify these errors whenever feasible. SCIENTIFIC CONTRIBUTION Tools for both the extraction and conversion of WLN from chemical documents have been successfully developed. Both the Deterministic Finite Automaton (DFA) and parser handle the majority of WLN rules officially endorsed in the three major WLN manuals, with the parser showing a clear jump in accuracy and chemical coverage over previous submissions. The GitHub repository can be found here: https://github.com/Mblakey/wiswesser .
Collapse
Affiliation(s)
- Michael Blakey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK.
| | - Samantha Pearman-Kanza
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| | - Jeremy G Frey
- Department of Chemistry, University of Southampton, University Road, Southampton, Hampshire, SO17 1BJ, UK
| |
Collapse
|
2
|
Manelfi C, Tazzari V, Lunghini F, Cerchia C, Fava A, Pedretti A, Stouten PFW, Vistoli G, Beccari AR. "DompeKeys": a set of novel substructure-based descriptors for efficient chemical space mapping, development and structural interpretation of machine learning models, and indexing of large databases. J Cheminform 2024; 16:21. [PMID: 38395961 PMCID: PMC10893756 DOI: 10.1186/s13321-024-00813-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 02/10/2024] [Indexed: 02/25/2024] Open
Abstract
The conversion of chemical structures into computer-readable descriptors, able to capture key structural aspects, is of pivotal importance in the field of cheminformatics and computer-aided drug design. Molecular fingerprints represent a widely employed class of descriptors; however, their generation process is time-consuming for large databases and is susceptible to bias. Therefore, descriptors able to accurately detect predefined structural fragments and devoid of lengthy generation procedures would be highly desirable. To meet additional needs, such descriptors should also be interpretable by medicinal chemists, and suitable for indexing databases with trillions of compounds. To this end, we developed-as integral part of EXSCALATE, Dompé's end-to-end drug discovery platform-the DompeKeys (DK), a new substructure-based descriptor set, which encodes the chemical features that characterize compounds of pharmaceutical interest. DK represent an exhaustive collection of curated SMARTS strings, defining chemical features at different levels of complexity, from specific functional groups and structural patterns to simpler pharmacophoric points, corresponding to a network of hierarchically interconnected substructures. Because of their extended and hierarchical structure, DK can be used, with good performance, in different kinds of applications. In particular, we demonstrate how they are very well suited for effective mapping of chemical space, as well as substructure search and virtual screening. Notably, the incorporation of DK yields highly performing machine learning models for the prediction of both compounds' activity and metabolic reaction occurrence. The protocol to generate the DK is freely available at https://dompekeys.exscalate.eu and is fully integrated with the Molecular Anatomy protocol for the generation and analysis of hierarchically interconnected molecular scaffolds and frameworks, thus providing a comprehensive and flexible tool for drug design applications.
Collapse
Affiliation(s)
- Candida Manelfi
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Napoli, Italy
| | - Valerio Tazzari
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Napoli, Italy
| | - Filippo Lunghini
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Napoli, Italy
| | - Carmen Cerchia
- Department of Pharmacy, University of Naples "Federico II", Via D. Montesano 49, 80131, Napoli, Italy
| | - Anna Fava
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Napoli, Italy
| | - Alessandro Pedretti
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Mangiagalli, 25, 20133, Milano, Italy
| | - Pieter F W Stouten
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Napoli, Italy
- Stouten Pharma Consultancy BV, Kempenarestraat 47, 2860, Sint-Katelijne-Waver, Belgium
| | - Giulio Vistoli
- Dipartimento di Scienze Farmaceutiche, Università degli Studi di Milano, Via Mangiagalli, 25, 20133, Milano, Italy
| | | |
Collapse
|
3
|
Liu Z, Zubatiuk T, Roitberg A, Isayev O. Auto3D: Automatic Generation of the Low-Energy 3D Structures with ANI Neural Network Potentials. J Chem Inf Model 2022; 62:5373-5382. [PMID: 36112860 DOI: 10.1021/acs.jcim.2c00817] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Computational programs accelerate the chemical discovery processes but often need proper three-dimensional molecular information as part of the input. Getting optimal molecular structures is challenging because it requires enumerating and optimizing a huge space of stereoisomers and conformers. We developed the Python-based Auto3D package for generating the low-energy 3D structures using SMILES as the input. Auto3D is based on state-of-the-art algorithms and can automatize the isomer enumeration and duplicate filtering process, 3D building process, geometry optimization, and ranking process. Tested on 50 molecules with multiple unspecified stereocenters, Auto3D is guaranteed to find the stereoconfiguration that yields the lowest-energy conformer. With Auto3D, we provide an extension of the ANI model. The new model, dubbed ANI-2xt, is trained on a tautomer-rich data set. ANI-2xt is benchmarked with DFT methods on geometry optimization and electronic and Gibbs free energy calculations. Compared with ANI-2x, ANI-2xt provides a 42% error reduction for tautomeric reaction energy calculations when using the gold-standard coupled-cluster calculation as the reference. ANI-2xt can accurately predict the energies and is several orders of magnitude faster than DFT methods.
Collapse
Affiliation(s)
- Zhen Liu
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| | - Tetiana Zubatiuk
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| | - Adrian Roitberg
- Department of Chemistry, University of Florida, Gainesville, Florida32611, United States
| | - Olexandr Isayev
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| |
Collapse
|
4
|
Krenn M, Ai Q, Barthel S, Carson N, Frei A, Frey NC, Friederich P, Gaudin T, Gayle AA, Jablonka KM, Lameiro RF, Lemm D, Lo A, Moosavi SM, Nápoles-Duarte JM, Nigam A, Pollice R, Rajan K, Schatzschneider U, Schwaller P, Skreta M, Smit B, Strieth-Kalthoff F, Sun C, Tom G, Falk von Rudorff G, Wang A, White AD, Young A, Yu R, Aspuru-Guzik A. SELFIES and the future of molecular string representations. PATTERNS (NEW YORK, N.Y.) 2022; 3:100588. [PMID: 36277819 PMCID: PMC9583042 DOI: 10.1016/j.patter.2022.100588] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
Collapse
Affiliation(s)
- Mario Krenn
- Max Planck Institute for the Science of Light (MPL), Erlangen, Germany
| | - Qianxiang Ai
- Department of Chemistry, Fordham University, The Bronx, NY, USA
| | - Senja Barthel
- Department of Mathematics, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Nessa Carson
- Syngenta Jealott’s Hill International Research Centre, Bracknell, Berkshire, UK
| | - Angelo Frei
- Department of Chemistry, Imperial College London, Molecular Sciences Research Hub, White City Campus, Wood Lane, London, UK
| | - Nathan C. Frey
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Pascal Friederich
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - Théophile Gaudin
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- IBM Research Europe, Zürich, Switzerland
| | | | - Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Sion, Valais, Switzerland
| | - Rafael F. Lameiro
- Medicinal and Biological Chemistry Group, São Carlos Institute of Chemistry, University of São Paulo, São Paulo, Brazil
| | - Dominik Lemm
- Faculty of Physics, University of Vienna, Vienna, Austria
| | - Alston Lo
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Seyed Mohamad Moosavi
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | | | - AkshatKumar Nigam
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Robert Pollice
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller Universität Jena, Jena, Germany
| | - Ulrich Schatzschneider
- Institut für Anorganische Chemie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany
| | - Philippe Schwaller
- IBM Research Europe, Zürich, Switzerland
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Sion, Valais, Switzerland
| | - Felix Strieth-Kalthoff
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Chong Sun
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Gary Tom
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | | | - Andrew Wang
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Solar Fuels Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, USA
| | - Adamo Young
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Rose Yu
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Alán Aspuru-Guzik
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
- Department of Materials Science, University of Toronto, Toronto, ON, Canada
- Canadian Institute for Advanced Research (CIFAR) Lebovic Fellow, Toronto, ON, Canada
| |
Collapse
|
5
|
Tomczak J, Herzog E, Fischer M, Swienty-Busch J, van den Broek F, Whittick G, Kappler M, Jones B, Blanke G. UDM (Unified Data Model) for chemical reactions – past, present and future. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2021-3013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
The UDM (Unified Data Model) is an open, extendable and freely available data format for the exchange of experimental information about compound synthesis and testing. The UDM had been initially developed in a collaborative project between Elsevier and Roche, where chemical reactions data from a variety of disparate data sources existing at Roche was consolidated and integrated into the Roche in-house version of the Reaxys database. Elsevier adapted the UDM model to its needs and finally donated its pre-4.0 release to the Pistoia Alliance for further development together with the five project founders (Elsevier, Roche, BIOVIA, GSK and Novartis, joined later by BMS), who contributed with funding and expertise to the Pistoia Alliance UDM project between 2017 and 2020. The latest UDM version 6.0 has been made freely available for the community under the MIT license in January 2021. The past, present, and future of the UDM exchange format are discussed in this article and factors that contribute to the successful adoption of the UDM format.
Collapse
Affiliation(s)
| | - Elena Herzog
- Elsevier Information Systems GmbH, Reaxys , Frankfurt am Main , Germany
| | - Markus Fischer
- Elsevier Information Systems GmbH, Reaxys , Frankfurt am Main , Germany
| | | | | | | | - Michael Kappler
- Moonlight Informatics & Computing Knowledge LLC , Foster City , CA , USA
| | - Brian Jones
- F. Hoffmann-La Roche AG , Basel , Switzerland
| | - Gerd Blanke
- StructurePendium Technologies GmbH , Essen , Germany
| |
Collapse
|
6
|
Dolciami D, Villasclaras-Fernandez E, Kannas C, Meniconi M, Al-Lazikani B, Antolin AA. canSAR chemistry registration and standardization pipeline. J Cheminform 2022; 14:28. [PMID: 35643512 PMCID: PMC9148294 DOI: 10.1186/s13321-022-00606-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 04/04/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Integration of medicinal chemistry data from numerous public resources is an increasingly important part of academic drug discovery and translational research because it can bring a wealth of important knowledge related to compounds in one place. However, different data sources can report the same or related compounds in various forms (e.g., tautomers, racemates, etc.), thus highlighting the need of organising related compounds in hierarchies that alert the user on important bioactivity data that may be relevant. To generate these compound hierarchies, we have developed and implemented canSARchem, a new compound registration and standardization pipeline as part of the canSAR public knowledgebase. canSARchem builds on previously developed ChEMBL and PubChem pipelines and is developed using KNIME. We describe the pipeline which we make publicly available, and we provide examples on the strengths and limitations of the use of hierarchies for bioactivity data exploration. Finally, we identify canonicalization enrichment in FDA-approved drugs, illustrating the benefits of our approach.
Results
We created a chemical registration and standardization pipeline in KNIME and made it freely available to the research community. The pipeline consists of five steps to register the compounds and create the compounds’ hierarchy: 1. Structure checker, 2. Standardization, 3. Generation of canonical tautomers and representative structures, 4. Salt strip, and 5. Generation of abstract structure to generate the compound hierarchy. Unlike ChEMBL’s RDKit pipeline, we carry out compound canonicalization ahead of getting the parent structure, similar to PubChem’s OpenEye pipeline. canSARchem has a lower rejection rate compared to both PubChem and ChEMBL. We use our pipeline to assess the impact of grouping the compounds in hierarchies for bioactivity data exploration. We find that FDA-approved drugs show statistically significant sensitivity to canonicalization compared to the majority of bioactive compounds which demonstrates the importance of this step.
Conclusions
We use canSARchem to standardize all the compounds uploaded in canSAR (> 3 million) enabling efficient data integration and the rapid identification of alternative compound forms with useful bioactivity data. Comparison with PubChem and ChEMBL pipelines evidenced comparable performances in compound standardization, but only PubChem and canSAR canonicalize tautomers and canSAR has a slightly lower rejection rate. Our results highlight the importance of compound hierarchies for bioactivity data exploration. We make canSARchem available under a Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0) at https://gitlab.icr.ac.uk/cansar-public/compound-registration-pipeline.
Collapse
|
7
|
Yang Q, Liu Y, Cheng J, Li Y, Liu S, Duan Y, Zhang L, Luo S. An Ensemble Structure and Physiochemical (SPOC) Descriptor for Machine-Learning Prediction of Chemical Reaction and Molecular Properties. Chemphyschem 2022; 23:e202200255. [PMID: 35478429 DOI: 10.1002/cphc.202200255] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Indexed: 11/08/2022]
Abstract
Feature representations, or descriptors, are machines' chemical language that largely shapes the prediction capability, generalizability and interpretability of machine learning models. To develop a generally applicable descriptor is highly warranted for chemists to deal with conventional prediction tasks in the context of sparsely distributed and small datasets. Inspired by the chemist's vision on molecules, we presented herein an ensemble descriptor, SPOC, curated on the principles of physical organic chemistry that integrates Structure and Physicochemical property (SPOC) of a molecule. SPOC could be readily constructed by combining molecular fingerprints, representing the structure of a given molecule, and molecular physicochemical properties extracted from RDKit or Mordred molecular descriptors. The applicability of SPOC was fully surveyed in a range of well-structured chemical databases with machine learning tasks varying from regression to classifications.
Collapse
Affiliation(s)
- Qi Yang
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Yidi Liu
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Junjie Cheng
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Yao Li
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Siyuan Liu
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Yingdong Duan
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Long Zhang
- Tsinghua University, CBMS, Department of Chemistry, CHINA
| | - Sanzhong Luo
- Tsinghua University, Department of Chemistry, Tsinghua University, 100084, Beijing, CHINA
| |
Collapse
|
8
|
Costanzi S, Slavick CK, Abides JM, Koblentz GD, Vecellio M, Cupitt RT. Supporting the fight against the proliferation of chemical weapons through cheminformatics. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2021-1107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
International frameworks have been put in place to foster chemical weapons nonproliferation and disarmament. These frameworks feature lists of chemicals that can be used as chemical weapons or precursors for their synthesis (CW-control lists). In these lists, chemicals of concern are described through chemical names and CAS Registry Numbers®. Importantly, in some CW-control lists, some entries, rather than specifying individual chemicals, describe families of related chemicals. Working with CW-control lists poses challenges for frontline customs and export control officers implementing these frameworks. Entries that describe families of chemicals are not easy to interpret, especially for non-chemists. Moreover, synonyms and chemical variants complicate the issue of checking CW-control lists through names and registry numbers. To ameliorate these problems, we have developed a functioning prototype of a cheminformatics tool that automates the task of assessing whether a chemical is part of a CW-control list. The tool, dubbed the Nonproliferation Cheminformatics Compliance Tool (NCCT), is a database management system (based on ChemAxon’s Instant JChem) with an embedded database of chemical structures. The key feature of the database is that it contains not only the structures of the individually listed chemicals, but also the generic structures that describe the entries relative to families of chemicals (Markush structures).
Collapse
Affiliation(s)
- Stefano Costanzi
- Department of Chemistry , American University , 4400 Massachusetts Avenue, NW , Washington , DC 20016 , USA
| | - Charlotte K. Slavick
- Department of Chemistry , American University , 4400 Massachusetts Avenue, NW , Washington , DC 20016 , USA
| | - Joyce M. Abides
- Department of Chemistry , American University , 4400 Massachusetts Avenue, NW , Washington , DC 20016 , USA
| | - Gregory D. Koblentz
- Schar School of Policy and Government, George Mason University , 3351 Fairfax Drive , Arlington , VA 22201 , USA
| | - Mary Vecellio
- The Henry L. Stimson Center , 1211 Connecticut Ave, NW , Washington , DC 20036 , USA
| | - Richard T. Cupitt
- The Henry L. Stimson Center , 1211 Connecticut Ave, NW , Washington , DC 20036 , USA
| |
Collapse
|
9
|
Wigh DS, Goodman JM, Lapkin AA. A review of molecular representation in the age of machine learning. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1603] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Daniel S. Wigh
- Department of Chemical Engineering and Biotechnology University of Cambridge Cambridge UK
| | | | - Alexei A. Lapkin
- Department of Chemical Engineering and Biotechnology University of Cambridge Cambridge UK
| |
Collapse
|
10
|
Bisht N, Sah AN, Bisht S, Joshi H. Emerging Need of Today: Significant Utilization of Various Databases and Softwares in Drug Design and Development. Mini Rev Med Chem 2021; 21:1025-1032. [PMID: 33319657 DOI: 10.2174/1389557520666201214101329] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 10/05/2020] [Accepted: 10/09/2020] [Indexed: 11/22/2022]
Abstract
In drug discovery, in silico methods have become a very important part of the process. These approaches impact the entire development process by discovering and identifying new target proteins as well as designing potential ligands with a significant reduction of time and cost. Furthermore, in silico approaches are also preferred because of reduction in the experimental use of animals as; in vivo testing for safer drug design and repositioning of known drugs. Novel software-based discovery and development such as direct/indirect drug design, molecular modelling, docking, screening, drug-receptor interaction, and molecular simulation studies are very important tools for the predictions of ligand-target interaction pattern, pharmacodynamics as well as pharmacokinetic properties of ligands. On the other part, the computational approaches can be numerous, requiring interdisciplinary studies and the application of advanced computer technology to design effective and commercially feasible drugs. This review mainly focuses on the various databases and software used in drug design and development to speed up the process.
Collapse
Affiliation(s)
- Neema Bisht
- Assistant Professor, College of Pharmacy, Graphic Era Hill University, Bhimtal Campus, Sattal Road, Bhimtal, Uttarakhand 263136, India
| | - Archana N Sah
- Head and Dean, Department of Pharmaceutical Sciences, Faculty of Technology, Sir J.C. Bose Technical Campus, Bhimtal, Kumaun University Nainital, Uttarakhand 263136, India
| | - Sandeep Bisht
- Assistant Professor, School of Management, Graphic Era Hill University, Bhimtal Campus, Sattal Road, Bhimtal, Uttarakhand 263136, India
| | - Himanshu Joshi
- Professor, College of Pharmacy, Graphic Era Hill University, Bhimtal Campus, Sattal Road, Bhimtal, Uttarakhand 263136, India
| |
Collapse
|
11
|
Nayarisseri A, Khandelwal R, Madhavi M, Selvaraj C, Panwar U, Sharma K, Hussain T, Singh SK. Shape-based Machine Learning Models for the Potential Novel COVID-19 Protease Inhibitors Assisted by Molecular Dynamics Simulation. Curr Top Med Chem 2020; 20:2146-2167. [PMID: 32621718 DOI: 10.2174/1568026620666200704135327] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 03/20/2020] [Accepted: 04/25/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The vast geographical expansion of novel coronavirus and an increasing number of COVID-19 affected cases have overwhelmed health and public health services. Artificial Intelligence (AI) and Machine Learning (ML) algorithms have extended their major role in tracking disease patterns, and in identifying possible treatments. OBJECTIVE This study aims to identify potential COVID-19 protease inhibitors through shape-based Machine Learning assisted by Molecular Docking and Molecular Dynamics simulations. METHODS 31 Repurposed compounds have been selected targeting the main coronavirus protease (6LU7) and a machine learning approach was employed to generate shape-based molecules starting from the 3D shape to the pharmacophoric features of their seed compound. Ligand-Receptor Docking was performed with Optimized Potential for Liquid Simulations (OPLS) algorithms to identify highaffinity compounds from the list of selected candidates for 6LU7, which were subjected to Molecular Dynamic Simulations followed by ADMET studies and other analyses. RESULTS Shape-based Machine learning reported remdesivir, valrubicin, aprepitant, and fulvestrant as the best therapeutic agents with the highest affinity for the target protein. Among the best shape-based compounds, a novel compound identified was not indexed in any chemical databases (PubChem, Zinc, or ChEMBL). Hence, the novel compound was named 'nCorv-EMBS'. Further, toxicity analysis showed nCorv-EMBS to be suitable for further consideration as the main protease inhibitor in COVID-19. CONCLUSION Effective ACE-II, GAK, AAK1, and protease 3C blockers can serve as a novel therapeutic approach to block the binding and attachment of the main COVID-19 protease (PDB ID: 6LU7) to the host cell and thus inhibit the infection at AT2 receptors in the lung. The novel compound nCorv- EMBS herein proposed stands as a promising inhibitor to be evaluated further for COVID-19 treatment.
Collapse
Affiliation(s)
- Anuraj Nayarisseri
- In silico Research Laboratory, Eminent Biosciences, Mahalakshmi Nagar, Indore-452010, Madhya Pradesh, India,Bioinformatics Research Laboratory, LeGene Biosciences Pvt Ltd., Mahalakshmi Nagar, Indore-452010, Madhya
Pradesh, India,Research Chair for Biomedical Applications of Nanomaterials, Biochemistry Department, College of Science, King
Saud University, Riyadh, Saudi Arabia,Computer Aided Drug Designing and Molecular Modeling Lab, Department of Bioinformatics, Alagappa University, Karaikudi-630 003, Tamil Nadu, India
| | - Ravina Khandelwal
- In silico Research Laboratory, Eminent Biosciences, Mahalakshmi Nagar, Indore-452010, Madhya Pradesh, India
| | - Maddala Madhavi
- Department of Zoology, Nizam College, Osmania University, Hyderabad-500001, Telangana State, India
| | - Chandrabose Selvaraj
- Computer Aided Drug Designing and Molecular Modeling Lab, Department of Bioinformatics, Alagappa University, Karaikudi-630 003, Tamil Nadu, India
| | - Umesh Panwar
- Computer Aided Drug Designing and Molecular Modeling Lab, Department of Bioinformatics, Alagappa University, Karaikudi-630 003, Tamil Nadu, India
| | - Khushboo Sharma
- In silico Research Laboratory, Eminent Biosciences, Mahalakshmi Nagar, Indore-452010, Madhya Pradesh, India
| | - Tajamul Hussain
- Center of Excellence in Biotechnology Research, College of Science, King Saud University, Riyadh, Saudi Arabia,Research Chair for Biomedical Applications of Nanomaterials, Biochemistry Department, College of Science, King
Saud University, Riyadh, Saudi Arabia
| | - Sanjeev Kumar Singh
- Computer Aided Drug Designing and Molecular Modeling Lab, Department of Bioinformatics, Alagappa University, Karaikudi-630 003, Tamil Nadu, India
| |
Collapse
|
12
|
David L, Thakkar A, Mercado R, Engkvist O. Molecular representations in AI-driven drug discovery: a review and practical guide. J Cheminform 2020; 12:56. [PMID: 33431035 PMCID: PMC7495975 DOI: 10.1186/s13321-020-00460-5] [Citation(s) in RCA: 165] [Impact Index Per Article: 41.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Accepted: 09/05/2020] [Indexed: 02/08/2023] Open
Abstract
The technological advances of the past century, marked by the computer revolution and the advent of high-throughput screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various fields. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complexity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these fields.
Collapse
Affiliation(s)
- Laurianne David
- Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca Gothenburg, Sweden.
| | - Amol Thakkar
- Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca Gothenburg, Sweden
- Department of Chemistry and Biochemistry, University of Bern, Bern, Switzerland
| | - Rocío Mercado
- Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca Gothenburg, Sweden
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca Gothenburg, Sweden
| |
Collapse
|
13
|
Jablonka K, Ongari D, Moosavi SM, Smit B. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning. Chem Rev 2020; 120:8066-8129. [PMID: 32520531 PMCID: PMC7453404 DOI: 10.1021/acs.chemrev.0c00004] [Citation(s) in RCA: 155] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Indexed: 12/16/2022]
Abstract
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal-organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
Collapse
Affiliation(s)
- Kevin
Maik Jablonka
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Daniele Ongari
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Seyed Mohamad Moosavi
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| | - Berend Smit
- Laboratory of Molecular Simulation
(LSMO), Institut des Sciences et Ingénierie Chimiques (ISIC), École Polytechnique Fédérale
de Lausanne (EPFL), Sion, Switzerland
| |
Collapse
|
14
|
The Literature of Chemoinformatics: 1978-2018. Int J Mol Sci 2020; 21:ijms21155576. [PMID: 32759729 PMCID: PMC7432360 DOI: 10.3390/ijms21155576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 07/30/2020] [Accepted: 07/31/2020] [Indexed: 11/16/2022] Open
Abstract
This article presents a study of the literature of chemoinformatics, updating and building upon an analogous bibliometric investigation that was published in 2008. Data on outputs in the field, and citations to those outputs, were obtained by means of topic searches of the Web of Science Core Collection. The searches demonstrate that chemoinformatics is by now a well-defined sub-discipline of chemistry, and one that forms an essential part of the chemical educational curriculum. There are three core journals for the subject: The Journal of Chemical Information and Modeling, the Journal of Cheminformatics, and Molecular Informatics, and, having established itself, chemoinformatics is now starting to export knowledge to disciplines outside of chemistry.
Collapse
|
15
|
Voicu A, Duteanu N, Voicu M, Vlad D, Dumitrascu V. The rcdk and cluster R packages applied to drug candidate selection. J Cheminform 2020; 12:3. [PMID: 33430987 PMCID: PMC6970292 DOI: 10.1186/s13321-019-0405-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Accepted: 12/20/2019] [Indexed: 11/10/2022] Open
Abstract
The aim of this article is to show how thevpower of statistics and cheminformatics can be combined, in R, using two packages: rcdk and cluster.We describe the role of clustering methods for identifying similar structures in a group of 23 molecules according to their fingerprints. The most commonly used method is to group the molecules using a "score" obtained by measuring the average distance between them. This score reflects the similarity/non-similarity between compounds and helps us identify active or potentially toxic substances through predictive studies.Clustering is the process by which the common characteristics of a particular class of compounds are identified. For clustering applications, we are generally measure the molecular fingerprint similarity with the Tanimoto coefficient. Based on the molecular fingerprints, we calculated the molecular distances between the methotrexate molecule and the other 23 molecules in the group, and organized them into a matrix. According to the molecular distances and Ward 's method, the molecules were grouped into 3 clusters. We can presume structural similarity between the compounds and their locations in the cluster map. Because only 5 molecules were included in the methotrexate cluster, we considered that they might have similar properties and might be further tested as potential drug candidates.
Collapse
Affiliation(s)
- Adrian Voicu
- Department of Medical Informatics and Biostatistics, Victor Babes University of Medicine and Pharmacy, E. Murgu 2, 300041, Timisoara, Romania
| | - Narcis Duteanu
- Dep. CAICAM, Politehnica University of Timisoara, Pirvan Boulevard 6, Timisoara, Romania.
| | - Mirela Voicu
- Department of Pharmacology-Clinical Pharmacy, Victor Babes University of Medicine and Pharmacy, E. Murgu 2, 300041, Timisoara, Romania
| | - Daliborca Vlad
- Department of Pharmacology, Victor Babes University of Medicine and Pharmacy, E. Murgu 2, 300041, Timisoara, Romania
| | - Victor Dumitrascu
- Department of Pharmacology, Victor Babes University of Medicine and Pharmacy, E. Murgu 2, 300041, Timisoara, Romania
| |
Collapse
|
16
|
Hähnke VD, Kim S, Bolton EE. PubChem chemical structure standardization. J Cheminform 2018; 10:36. [PMID: 30097821 PMCID: PMC6086778 DOI: 10.1186/s13321-018-0293-8] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/01/2018] [Indexed: 11/15/2022] Open
Abstract
BACKGROUND PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. RESULTS The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). CONCLUSIONS Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource ( https://pubchem.ncbi.nlm.nih.gov/standardize ), and via programmatic interfaces.
Collapse
Affiliation(s)
- Volker D. Hähnke
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894 USA
- Present Address: European Patent Office, Patentlaan 2, 2288 EE Rijswijk, The Netherlands
| | - Sunghwan Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Evan E. Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, 8600 Rockville Pike, Bethesda, MD 20894 USA
| |
Collapse
|
17
|
Taraji M, Haddad PR, Amos RIJ, Talebi M, Szucs R, Dolan JW, Pohl CA. Chemometric-assisted method development in hydrophilic interaction liquid chromatography: A review. Anal Chim Acta 2017; 1000:20-40. [PMID: 29289311 DOI: 10.1016/j.aca.2017.09.041] [Citation(s) in RCA: 62] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 09/22/2017] [Accepted: 09/24/2017] [Indexed: 02/09/2023]
Abstract
With an enormous growth in the application of hydrophilic interaction liquid chromatography (HILIC), there has also been significant progress in HILIC method development. HILIC is a chromatographic method that utilises hydro-organic mobile phases with a high organic content, and a hydrophilic stationary phase. It has been applied predominantly in the determination of small polar compounds. Theoretical studies in computer-aided modelling tools, most importantly the predictive, quantitative structure retention relationship (QSRR) modelling methods, have attracted the attention of researchers and these approaches greatly assist the method development process. This review focuses on the application of computer-aided modelling tools in understanding the retention mechanism, the classification of HILIC stationary phases, prediction of retention times in HILIC systems, optimisation of chromatographic conditions, and description of the interaction effects of the chromatographic factors in HILIC separations. Additionally, what has been achieved in the potential application of QSRR methodology in combination with experimental design philosophy in the optimisation of chromatographic separation conditions in the HILIC method development process is communicated. Developing robust predictive QSRR models will undoubtedly facilitate more application of this chromatographic mode in a broader variety of research areas, significantly minimising cost and time of the experimental work.
Collapse
Affiliation(s)
- Maryam Taraji
- Australian Centre for Research on Separation Science (ACROSS), School of Physical Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia
| | - Paul R Haddad
- Australian Centre for Research on Separation Science (ACROSS), School of Physical Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia.
| | - Ruth I J Amos
- Australian Centre for Research on Separation Science (ACROSS), School of Physical Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia
| | - Mohammad Talebi
- Australian Centre for Research on Separation Science (ACROSS), School of Physical Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia
| | - Roman Szucs
- Pfizer Global Research and Development, CT13 9NJ, Sandwich, UK
| | - John W Dolan
- LC Resources, 1795 NW Wallace Rd., McMinnville, OR 97128, USA
| | | |
Collapse
|
18
|
Use of dual-filtering to create training sets leading to improved accuracy in quantitative structure-retention relationships modelling for hydrophilic interaction liquid chromatographic systems. J Chromatogr A 2017; 1507:53-62. [DOI: 10.1016/j.chroma.2017.05.044] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 05/17/2017] [Accepted: 05/18/2017] [Indexed: 01/31/2023]
|
19
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
20
|
Przybylak K, Madden J, Covey-Crump E, Gibson L, Barber C, Patel M, Cronin M. Characterisation of data resources for in silico modelling: benchmark datasets for ADME properties. Expert Opin Drug Metab Toxicol 2017; 14:169-181. [DOI: 10.1080/17425255.2017.1316449] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- K.R. Przybylak
- School of Pharmacy and Chemistry, Liverpool John Moores University, Liverpool, UK
| | - J.C. Madden
- School of Pharmacy and Chemistry, Liverpool John Moores University, Liverpool, UK
| | | | - L. Gibson
- Lhasa Limited, Granary Wharf House, Leeds, UK
| | - C. Barber
- Lhasa Limited, Granary Wharf House, Leeds, UK
| | - M. Patel
- Lhasa Limited, Granary Wharf House, Leeds, UK
| | - M.T.D. Cronin
- School of Pharmacy and Chemistry, Liverpool John Moores University, Liverpool, UK
| |
Collapse
|
21
|
Jacob PM, Lan T, Goodman JM, Lapkin AA. A possible extension to the RInChI as a means of providing machine readable process data. J Cheminform 2017; 9:23. [PMID: 29086180 PMCID: PMC5388667 DOI: 10.1186/s13321-017-0210-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2016] [Accepted: 04/01/2017] [Indexed: 12/21/2022] Open
Abstract
The algorithmic, large-scale use and analysis of reaction databases such as Reaxys is currently hindered by the absence of widely adopted standards for publishing reaction data in machine readable formats. Crucial data such as yields of all products or stoichiometry are frequently not explicitly stated in the published papers and, hence, not reported in the database entry for those reactions, limiting their usefulness for algorithmic analysis. This paper presents a possible extension to the IUPAC RInChI standard via an auxiliary layer, termed ProcAuxInfo, which is a standardised, extensible form in which to report certain key reaction parameters such as declaration of all products and reactants as well as auxiliaries known in the reaction, reaction stoichiometry, amounts of substances used, conversion, yield and operating conditions. The standard is demonstrated via creation of the RInChI including the ProcAuxInfo layer based on three published reactions and demonstrates accurate data recoverability via reverse translation of the created strings. Implementation of this or another method of reporting process data by the publishing community would ensure that databases, such as Reaxys, would be able to abstract crucial data for big data analysis of their contents.
Collapse
Affiliation(s)
- Philipp-Maximilian Jacob
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS UK
| | - Tian Lan
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS UK
| | | | - Alexei A. Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS UK
| |
Collapse
|
22
|
Barth A, Stengel T, Litterst E, Kraut H, Matuszczyk H, Ailer F, Hajkowski S. A Novel Concept for the Search and Retrieval of the Derwent Markush Resource Database. J Chem Inf Model 2016; 56:821-9. [PMID: 27123583 DOI: 10.1021/acs.jcim.6b00082] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The representation of and search for generic chemical structures (Markush) remains a continuing challenge. Several research groups have addressed this problem, and over time a limited number of practical solutions have been proposed. Today there are two large commercial providers of Markush databases: Chemical Abstracts Service (CAS) and Thomson Reuters. The Thomson Reuters "Derwent" Markush database is currently offered via the online services Questel and STN and as a data feed for in-house use. The aim of this paper is to briefly review the existing Markush systems (databases plus search engines) and to describe our new approach for the implementation of the Derwent Markush Resource on STN. Our new approach demonstrates the integration of the Derwent Markush Resource database into the existing chemistry-focused STN platform without loss of detail. This provides compatibility with other structure and Markush databases on STN and at the same time makes it possible to deploy the specific features and functions of the Derwent approach. It is shown that the different Markush languages developed by CAS and Derwent can be combined into a single general Markush description. In this concept the generic nodes are grouped together in a unique hierarchy where all chemical elements and fragments can be integrated. As a consequence, both systems are searchable using a single structure query. Moreover, the presented concept could serve as a promising starting point for a common generalized description of Markush structures.
Collapse
Affiliation(s)
- Andreas Barth
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure , D-76344 Eggenstein-Leopoldshafen, Germany
| | - Thomas Stengel
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure , D-76344 Eggenstein-Leopoldshafen, Germany
| | - Edwin Litterst
- FIZ Karlsruhe - Leibniz Institute for Information Infrastructure , D-76344 Eggenstein-Leopoldshafen, Germany
| | | | | | | | | |
Collapse
|
23
|
Dönertaş HM, Martínez Cuesta S, Rahman SA, Thornton JM. Characterising Complex Enzyme Reaction Data. PLoS One 2016; 11:e0147952. [PMID: 26840640 PMCID: PMC4740462 DOI: 10.1371/journal.pone.0147952] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2015] [Accepted: 01/11/2016] [Indexed: 01/05/2023] Open
Abstract
The relationship between enzyme-catalysed reactions and the Enzyme Commission (EC) number, the widely accepted classification scheme used to characterise enzyme activity, is complex and with the rapid increase in our knowledge of the reactions catalysed by enzymes needs revisiting. We present a manual and computational analysis to investigate this complexity and found that almost one-third of all known EC numbers are linked to more than one reaction in the secondary reaction databases (e.g., KEGG). Although this complexity is often resolved by defining generic, alternative and partial reactions, we have also found individual EC numbers with more than one reaction catalysing different types of bond changes. This analysis adds a new dimension to our understanding of enzyme function and might be useful for the accurate annotation of the function of enzymes and to study the changes in enzyme function during evolution.
Collapse
Affiliation(s)
- Handan Melike Dönertaş
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Sergio Martínez Cuesta
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Syed Asad Rahman
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Janet M. Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
24
|
Chemoinformatics: Achievements and Challenges, a Personal View. Molecules 2016; 21:151. [PMID: 26828468 PMCID: PMC6273366 DOI: 10.3390/molecules21020151] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 01/14/2016] [Accepted: 01/20/2016] [Indexed: 11/16/2022] Open
Abstract
Chemoinformatics provides computer methods for learning from chemical data and for modeling tasks a chemist is facing. The field has evolved in the past 50 years and has substantially shaped how chemical research is performed by providing access to chemical information on a scale unattainable by traditional methods. Many physical, chemical and biological data have been predicted from structural data. For the early phases of drug design, methods have been developed that are used in all major pharmaceutical companies. However, all domains of chemistry can benefit from chemoinformatics methods; many areas that are not yet well developed, but could substantially gain from the use of chemoinformatics methods. The quality of data is of crucial importance for successful results. Computer-assisted structure elucidation and computer-assisted synthesis design have been attempted in the early years of chemoinformatics. Because of the importance of these fields to the chemist, new approaches should be made with better hardware and software techniques. Society's concern about the impact of chemicals on human health and the environment could be met by the development of methods for toxicity prediction and risk assessment. In conjunction with bioinformatics, our understanding of the events in living organisms could be deepened and, thus, novel strategies for curing diseases developed. With so many challenging tasks awaiting solutions, the future is bright for chemoinformatics.
Collapse
|
25
|
Abstract
BACKGROUND Atom environments and fragments find wide-spread use in chemical information and cheminformatics. They are the basis of prediction models, an integral part in similarity searching, and employed in structure search techniques. Most of these methods were developed and evaluated on the relatively small sets of chemical structures available at the time. An analysis of fragment distributions representative of most known chemical structures was published in the 1970s using the Chemical Abstracts Service data system. More recently, advances in automated synthesis of chemicals allow millions of chemicals to be synthesized by a single organization. In addition, open chemical databases are readily available containing tens of millions of chemical structures from a multitude of data sources, including chemical vendors, patents, and the scientific literature, making it possible for scientists to readily access most known chemical structures. With this availability of information, one can now address interesting questions, such as: what chemical fragments are known today? How do these fragments compare to earlier studies? How unique are chemical fragments found in chemical structures? RESULTS For our analysis, after hydrogen suppression, atoms were characterized by atomic number, formal charge, implicit hydrogen count, explicit degree (number of neighbors), valence (bond order sum), and aromaticity. Bonds were differentiated as single, double, triple or aromatic bonds. Atom environments were created in a circular manner focused on a central atom with radii from 0 (atom types) up to 3 (representative of ECFP_6 fragments). In total, combining atom types and atom environments that include up to three spheres of nearest neighbors, our investigation identified 28,462,319 unique fragments in the 46 million structures found in the PubChem Compound database as of January 2013. We could identify several factors inflating the number of environments involving transition metals, with many seemingly due to erroneous interpretation of structures from patent data. Compared to fragmentation statistics published 40 years ago, the exponential growth in chemistry is mirrored in a nearly eightfold increase in the number of unique chemical fragments; however, this result is clearly an upper bound estimate as earlier studies employed structure sampling approaches and this study shows that a relatively high rate of atom fragments are found in only a single chemical structure (singletons). In addition, the percentage of singletons grows as the size of the chemical fragment is increased. CONCLUSIONS The observed growth of the numbers of unique fragments over time suggests that many chemically possible connections of atom types to larger fragments have yet to be explored by chemists. A dramatic drop in the relative rate of increase of atom environments from smaller to larger fragments shows that larger fragments mainly consist of diverse combinations of a limited subset of smaller fragments. This is further supported by the observed concomitant increase of singleton atom environments. Combined, these findings suggest that there is considerable opportunity for chemists to combine known fragments to novel chemical compounds. The comparison of PubChem to an older study of known chemical structures shows noticeable differences. The changes suggest advances in synthetic capabilities of chemists to combine atoms in new patterns. Log-log plots of fragment incidence show small numbers of fragments are found in many structures and that large numbers of fragments are found in very few structures, with nearly half being novel using the methods in this work. The relative decrease in the count of new fragments as a function of size further suggests considerable opportunity for more novel chemicals exists. Lastly, the differences in atom environment diversity between PubChem Substance and Compound showcase the effect of PubChem standardization protocols, but also indicate that a normalization procedure for atom types, functional groups, and tautomeric/resonance forms based on atom environments is possible. The complete sets of atom types and atom environments are supplied as supporting information.
Collapse
Affiliation(s)
- Volker D Hähnke
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Evan E Bolton
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Stephen H Bryant
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA
| |
Collapse
|
26
|
Warr WA. Many InChIs and quite some feat. J Comput Aided Mol Des 2015; 29:681-94. [PMID: 26081259 DOI: 10.1007/s10822-015-9854-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 06/10/2015] [Indexed: 12/14/2022]
Affiliation(s)
- Wendy A Warr
- Wendy Warr & Associates, Holmes Chapel, Crewe, Cheshire, CW4 7HZ, UK,
| |
Collapse
|
27
|
Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D. InChI, the IUPAC International Chemical Identifier. J Cheminform 2015; 7:23. [PMID: 26136848 PMCID: PMC4486400 DOI: 10.1186/s13321-015-0068-4] [Citation(s) in RCA: 415] [Impact Index Per Article: 46.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2015] [Accepted: 04/15/2015] [Indexed: 11/16/2022] Open
Abstract
This paper documents the design, layout and algorithms of the IUPAC International Chemical Identifier, InChI.
Collapse
Affiliation(s)
- Stephen R Heller
- Biomolecular Measurement Division, National Institute of Standards and Technology, Gaithersburg, MD 20899-8362 USA
| | | | - Igor Pletnev
- Department of Chemistry, Lomonosov Moscow State University, 119991 Moscow, Russia
| | - Stephen Stein
- Biomolecular Measurement Division, National Institute of Standards and Technology, Gaithersburg, MD 20899-8362 USA
| | - Dmitrii Tchekhovskoi
- Biomolecular Measurement Division, National Institute of Standards and Technology, Gaithersburg, MD 20899-8362 USA
| |
Collapse
|
28
|
Willett P. The Calculation of Molecular Structural Similarity: Principles and Practice. Mol Inform 2014; 33:403-13. [DOI: 10.1002/minf.201400024] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2014] [Accepted: 03/14/2014] [Indexed: 01/28/2023]
|
29
|
Schomburg KT, Wetzer L, Rarey M. Interactive design of generic chemical patterns. Drug Discov Today 2013; 18:651-8. [DOI: 10.1016/j.drudis.2013.02.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2012] [Revised: 12/19/2012] [Accepted: 02/01/2013] [Indexed: 11/17/2022]
|
30
|
Karthikeyan M, Vyas R. Chemical structure representations and applications in computational toxicity. Methods Mol Biol 2013; 929:167-92. [PMID: 23007430 DOI: 10.1007/978-1-62703-050-2_8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
Abstract
Efficient storage and retrieval of chemical structures is one of the most important prerequisite for solving any computational-based problem in life sciences. Several resources including research publications, text books, and articles are available on chemical structure representation. Chemical substances that have same molecular formula but several structural formulae, conformations, and skeleton framework/scaffold/functional groups of the molecule convey various characteristics of the molecule. Today with the aid of sophisticated mathematical models and informatics tools, it is possible to design a molecule of interest with specified characteristics based on their applications in pharmaceuticals, agrochemicals, biotechnology, nanomaterials, petrochemicals, and polymers. This chapter discusses both traditional and current state of art representation of chemical structures and their applications in chemical information management, bioactivity- and toxicity-based predictive studies.
Collapse
Affiliation(s)
- Muthukumarasamy Karthikeyan
- National Chemical Laboratory, Digital Information Resource Centre & Centre of Excellence in Scientific Computing, Pune, India.
| | | |
Collapse
|
31
|
Gurulingappa H, Mudi A, Toldo L, Hofmann-Apitius M, Bhate J. Challenges in mining the literature for chemical information. RSC Adv 2013. [DOI: 10.1039/c3ra40787j] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
|
32
|
Blurock E, Battin-Leclerc F, Faravelli T, Green WH. Automatic Generation of Detailed Mechanisms. CLEANER COMBUSTION 2013. [DOI: 10.1007/978-1-4471-5307-8_3] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
33
|
O'Boyle NM. Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI. J Cheminform 2012; 4:22. [PMID: 22989151 PMCID: PMC3495655 DOI: 10.1186/1758-2946-4-22] [Citation(s) in RCA: 132] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2012] [Accepted: 09/03/2012] [Indexed: 11/10/2022] Open
Abstract
UNLABELLED BACKGROUND There are two line notations of chemical structures that have established themselves in the field: the SMILES string and the InChI string. The InChI aims to provide a unique, or canonical, identifier for chemical structures, while SMILES strings are widely used for storage and interchange of chemical structures, but no standard exists to generate a canonical SMILES string. RESULTS I describe how to use the InChI canonicalisation to derive a canonical SMILES string in a straightforward way, either incorporating the InChI normalisations (Inchified SMILES) or not (Universal SMILES). This is the first description of a method to generate canonical SMILES that takes stereochemistry into account. When tested on the 1.1 m compounds in the ChEMBL database, and a 1 m compound subset of the PubChem Substance database, no canonicalisation failures were found with Inchified SMILES. Using Universal SMILES, 99.79% of the ChEMBL database was canonicalised successfully and 99.77% of the PubChem subset. CONCLUSIONS The InChI canonicalisation algorithm can successfully be used as the basis for a common standard for canonical SMILES. While challenges remain - such as the development of a standard aromatic model for SMILES - the ability to create the same SMILES using different toolkits will mean that for the first time it will be possible to easily compare the chemical models used by different toolkits.
Collapse
Affiliation(s)
- Noel M O'Boyle
- Analytical and Biological Chemistry Research Facility, Cavanagh Pharmacy Building, University College Cork, Cork, Co, Cork, Ireland.
| |
Collapse
|
34
|
Guha R, Nguyen DT, Southall N, Jadhav A. Dealing with the Data Deluge: Handling the Multitude Of Chemical Biology Data Sources. CURRENT PROTOCOLS IN CHEMICAL BIOLOGY 2012; 4:193-209. [PMID: 26609498 PMCID: PMC4655879 DOI: 10.1002/9780470559277.ch110262] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task - many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using multiple data sources together and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources.
Collapse
Affiliation(s)
- Rajarshi Guha
- NIH Center for Advancing Translational Science, 9800 Medical Center Drive Rockville, MD 20850
| | - Dac-Trung Nguyen
- NIH Center for Advancing Translational Science, 9800 Medical Center Drive Rockville, MD 20850
| | - Noel Southall
- NIH Center for Advancing Translational Science, 9800 Medical Center Drive Rockville, MD 20850
| | - Ajit Jadhav
- NIH Center for Advancing Translational Science, 9800 Medical Center Drive Rockville, MD 20850
| |
Collapse
|
35
|
Warr WA. Silver threads. J Comput Aided Mol Des 2011; 26:151-2. [PMID: 22160657 DOI: 10.1007/s10822-011-9502-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2011] [Accepted: 11/29/2011] [Indexed: 11/27/2022]
|