1
|
Liang T, Liu W, Tan K, Wu A, Lu X. Advancing Ionic Liquid Research with pSCNN: A Novel Approach for Accurate Normal Melting Temperature Predictions. ACS OMEGA 2024; 9:31694-31702. [PMID: 39072063 PMCID: PMC11270577 DOI: 10.1021/acsomega.4c02393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/12/2024] [Accepted: 06/25/2024] [Indexed: 07/30/2024]
Abstract
Ionic liquids (ILs), known for their distinct and tunable properties, offer a broad spectrum of potential applications across various fields, including chemistry, materials science, and energy storage. However, practical applications of ILs are often limited by their unfavorable physicochemical properties. Experimental screening becomes impractical due to the vast number of potential IL combinations. Therefore, the development of a robust and efficient model for predicting the IL properties is imperative. As the defining feature, it is of practice significance to establish an accurate yet efficient model to predict the normal melting point of IL (T m), which may facilitate the discovery and design of novel ILs for specific applications. In this study, we presented a pseudo-Siamese convolution neural network (pSCNN) inspired by SCNN and focused on the T m. Utilizing a data set of 3098 ILs, we systematically assess various deep learning models (ANN, pSCNN, and Transformer-CNF), along with molecular descriptors (ECFP fingerprint and Mordred properties), for their performance in predicting the T m of ILs. Remarkably, among the investigated modeling schemes, the pSCNN, coupled with filtered Mordred descriptors, demonstrates superior performance, yielding mean absolute error (MAE) and root-mean-square error (RMSE) values of 24.36 and 31.56 °C, respectively. Feature analysis further highlights the effectiveness of the pSCNN model. Moreover, the pSCNN method, with a pair of inputs, can be extended beyond ionic liquid melting point prediction.
Collapse
Affiliation(s)
- Tao Liang
- State Key Laboratory of Physical
Chemistry of Solid Surface, Fujian Provincial Key Laboratory for Theoretical
and Computational Chemistry, Departmental of Chemistry, College of
Chemistry and Chemical Engineering, Xiamen
University, Xiamen 361005, P. R. China
| | - Wei Liu
- State Key Laboratory of Physical
Chemistry of Solid Surface, Fujian Provincial Key Laboratory for Theoretical
and Computational Chemistry, Departmental of Chemistry, College of
Chemistry and Chemical Engineering, Xiamen
University, Xiamen 361005, P. R. China
| | - Kai Tan
- State Key Laboratory of Physical
Chemistry of Solid Surface, Fujian Provincial Key Laboratory for Theoretical
and Computational Chemistry, Departmental of Chemistry, College of
Chemistry and Chemical Engineering, Xiamen
University, Xiamen 361005, P. R. China
| | - Anan Wu
- State Key Laboratory of Physical
Chemistry of Solid Surface, Fujian Provincial Key Laboratory for Theoretical
and Computational Chemistry, Departmental of Chemistry, College of
Chemistry and Chemical Engineering, Xiamen
University, Xiamen 361005, P. R. China
| | - Xin Lu
- State Key Laboratory of Physical
Chemistry of Solid Surface, Fujian Provincial Key Laboratory for Theoretical
and Computational Chemistry, Departmental of Chemistry, College of
Chemistry and Chemical Engineering, Xiamen
University, Xiamen 361005, P. R. China
| |
Collapse
|
2
|
|
3
|
Ksenofontov AA, Lukanov MM, Bocharov PS. Can machine learning methods accurately predict the molar absorption coefficient of different classes of dyes? SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2022; 279:121442. [PMID: 35660154 DOI: 10.1016/j.saa.2022.121442] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 05/25/2022] [Accepted: 05/26/2022] [Indexed: 06/15/2023]
Abstract
In this article, we provide a convenient tool for all researchers to predict the value of the molar absorption coefficient for a wide number of dyes without any computer costs. The new model is based on RFR method (ALogPS, OEstate + Fragmentor + QNPR) and is able to predict the molar absorption coefficient with an accuracy (5-fold cross-validation RMSE) of 0.26 log unit. This accuracy was achieved due to the fact that the model was trained on data for more than 20,000 unique dye molecules. To our knowledge, this is the first model for predicting the molar absorption coefficient trained on such a large and diverse set of dyes. The model is available at https://ochem.eu/article/145413. We hope that the new model will allow researchers to predict dyes with practically significant spectral characteristics and verify existing experimental data.
Collapse
Affiliation(s)
- Alexander A Ksenofontov
- G.A. Krestov Institute of Solution Chemistry of the Russian Academy of Sciences, Akademicheskaya Street, 153045 Ivanovo, Russia.
| | - Michail M Lukanov
- G.A. Krestov Institute of Solution Chemistry of the Russian Academy of Sciences, Akademicheskaya Street, 153045 Ivanovo, Russia; Ivanovo State University of Chemistry and Technology, 7, Sheremetevskiy Avenue, Ivanovo 153000, Russia
| | - Pavel S Bocharov
- G.A. Krestov Institute of Solution Chemistry of the Russian Academy of Sciences, Akademicheskaya Street, 153045 Ivanovo, Russia
| |
Collapse
|
4
|
Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Hadi Parastar
- Department of Chemistry Sharif University of Technology Tehran Iran
| | - Roma Tauler
- Department of Environmental Chemistry IDAEA-CSIC 08034 Barcelona Spain
| |
Collapse
|
5
|
Tinkov OV, Grigorev VY, Grigoreva LD, Osipov VN, Kolotaev AV, Khachatryan DS. QSAR analysis and experimental evaluation of new quinazoline-containing hydroxamic acids as histone deacetylase 6 inhibitors. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2022; 33:513-532. [PMID: 35786151 DOI: 10.1080/1062936x.2022.2092210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 06/14/2022] [Indexed: 06/15/2023]
Abstract
Histone deacetylase inhibitors represent the most important class of drugs for the treatment of human cancer and other diseases due to their influence on cell growth, differentiation, and apoptosis. Among the well-known eighteen histone deacetylases, histone deacetylase 6 (HDAC6), which is involved in oncogenesis, cell survival, and cancer cell metastasis, is of great importance. Using the CDK and alvaDesc molecular descriptors and the Random Forest and EXtreme Gradient Boosting methods, we propose a number of adequate QSAR classification models, which are integrated into a consensus model and are freely available on the OCHEM web platform (https://ochem.eu). The consensus QSAR model is used for virtual screening of a series of seven new compounds, the derivatives of N-((hydroxyamino)-oxoalkyl)-2-(quinazoline-4-ilamino)-benzamides, the synthesis schemes of which are also presented in this work. In vitro evaluation of the inhibitory activity (IC50) of this series of compounds against HDAC6 allowed us to confirm the results of virtual screening and to reveal promising compounds V-2 and V-4, IC50 of which is 3.25 nM and 0.04 nM, respectively. The subsequent in silico evaluation of the main ADMET properties of active compounds V-2 and V-4 allowed us to find that they have acceptable pharmacokinetic parameters and level of acute toxicity.
Collapse
Affiliation(s)
- O V Tinkov
- Department of Pharmacology and Pharmaceutical Chemistry, Medical Faculty, Shevchenko Transnistria State University, Tiraspol, Moldova
| | - V Y Grigorev
- Molecular Design, Institute of Physiologically Active Compounds of the Russian Academy of SciencesDepartment of Computer-aided, Chernogolovka, Russia
| | - L D Grigoreva
- Department of Fundamental Physicochemical Engineering, Moscow State University, Moscow, Russia
| | - V N Osipov
- Department of Chemical Synthesis, Blokhin National Medical Research Center of Oncology, Ministry of Health of the Russian Federation, Moscow, Russia
| | - A V Kolotaev
- Laboratory of Natural Compounds, National Research Centre "Kurchatov Institute", Moscow, Russia
- Laboratory of Natural Compounds, Institute of Chemical Reagents and High Purity Chemical Substances of the National Research Centre "Kurchatov Institute", Moscow, Russia
| | - D S Khachatryan
- Laboratory of Natural Compounds, National Research Centre "Kurchatov Institute", Moscow, Russia
- Laboratory of Natural Compounds, Institute of Chemical Reagents and High Purity Chemical Substances of the National Research Centre "Kurchatov Institute", Moscow, Russia
| |
Collapse
|
6
|
Bujak M, Podsiadło M, Katrusiak A. Response to comment on Properties and interactions - melting point of tribromobenzene isomers. ACTA CRYSTALLOGRAPHICA SECTION B, STRUCTURAL SCIENCE, CRYSTAL ENGINEERING AND MATERIALS 2022; 78:276-278. [PMID: 35411867 DOI: 10.1107/s2052520622003067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Affiliation(s)
- Maciej Bujak
- Faculty of Chemistry, University of Opole, Oleska 48, Opole, 45-052, Poland
| | - Marcin Podsiadło
- Faculty of Chemistry, Adam Mickiewicz University, Uniwersytetu Poznańskiego 8, Poznań, 61-614, Poland
| | - Andrzej Katrusiak
- Faculty of Chemistry, Adam Mickiewicz University, Uniwersytetu Poznańskiego 8, Poznań, 61-614, Poland
| |
Collapse
|
7
|
What Features of Ligands Are Relevant to the Opening of Cryptic Pockets in Drug Targets? INFORMATICS 2022. [DOI: 10.3390/informatics9010008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Small-molecule drug design aims to identify inhibitors that can specifically bind to a functionally important region on the target, i.e., an active site of an enzyme. Identification of potential binding pockets is typically based on static three-dimensional structures. However, small molecules may induce and select a dynamic binding pocket that is not visible in the apo protein, which presents a well-recognized challenge for structure-based drug discovery. Here, we assessed whether it is possible to identify features in molecules, which we refer to as inducers, that can induce the opening of cryptic pockets. The volume change between apo and bound protein conformations was used as a metric to differentiate chemical features in inducers vs. non-inducers. Based on the dataset of holo–apo pairs, classification models were built to determine an optimum threshold. The model analysis suggested that inducers preferred to be more hydrophobic and aromatic. The impact of sulfur was ambiguous, while phosphorus and halogen atoms were overrepresented in inducers. The fragment analysis showed that small changes in the structures of molecules can strongly affect the potential to induce a cryptic pocket. This analysis and developed model can be used to design inducers that can potentially open cryptic pockets for undruggable proteins.
Collapse
|
8
|
Shin HK. Topological Distance-Based Electron Interaction Tensor to Apply a Convolutional Neural Network on Drug-like Compounds. ACS OMEGA 2021; 6:35757-35768. [PMID: 34984306 PMCID: PMC8717557 DOI: 10.1021/acsomega.1c05693] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 12/08/2021] [Indexed: 05/15/2023]
Abstract
Deep learning (DL) models in quantitative structure-activity relationship fed the molecular structure directly to the network without using human-designed descriptors by representing molecule as a graph or string (e.g., SMILES code). However, these two representations were oversimplification of real molecules to reflect chemical properties of molecular structures. Given that the choice of molecular representation determines the architecture of the DL model to apply, a novel way of molecular representation can open a way to apply diverse DL networks developed and used in other fields. A topological distance-based electron interaction (TDEi) tensor has been developed in this study inspired by the quantum mechanical model of the molecule, which defines a molecule with electrons and protons. In the TDEi tensor, the atomic orbital (AO) of each atom is represented by an electron configuration (EC) vector, which is a bit string based on the presence and absence of electrons in each AO according to spin indicated by positive and negative signs. Interactions between EC vectors were calculated based on the topological distance between atoms in a molecule. As a molecular structure was translated into 3D array, CNN models (modified VGGNet) were applied using a TDEi tensor to predict four physicochemical properties of drug-like compound datasets: MP (275,131), Lipop (4193), Esol (1127), and Freesolv (639). Models achieved good prediction accuracy. PCA showed that a stronger correlation was observed between the extracted features and the target endpoint as features were extracted from the deeper layer.
Collapse
Affiliation(s)
- Hyun Kil Shin
- Department
of Predictive Toxicology, Korea Institute
of Toxicology, Daejeon 34114, Republic of Korea
- Human
and Environmental Toxicology, University
of Science and Technology, Daejeon 34113, Republic of Korea
| |
Collapse
|
9
|
Makarov D, Fadeeva Y, Shmukler L, Tetko I. Beware of proper validation of models for ionic Liquids! J Mol Liq 2021. [DOI: 10.1016/j.molliq.2021.117722] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
10
|
On prediction of melting points without computer simulation: a focus on energetic molecular crystals. FIREPHYSCHEM 2021. [DOI: 10.1016/j.fpc.2021.11.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
11
|
Ghosh D, Koch U, Hadian K, Sattler M, Tetko IV. Highly Accurate Filters to Flag Frequent Hitters in AlphaScreen Assays by Suggesting their Mechanism. Mol Inform 2021; 41:e2100151. [PMID: 34676998 DOI: 10.1002/minf.202100151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2021] [Accepted: 09/29/2021] [Indexed: 11/06/2022]
Abstract
AlphaScreen is one of the most widely used assay technologies in drug discovery due to its versatility, dynamic range and sensitivity. However, a presence of false positives and frequent hitters contributes to difficulties with an interpretation of measured HTS data. Although filters do exist to identify frequent hitters for AlphaScreen, they are frequently based on privileged scaffolds. The development of such filters is time consuming and requires deep domain knowledge. Recently, machine learning and artificial intelligence methods are emerging as important tools to advance drug discovery and chemoinformatics, including their application to identification of frequent hitters in screening assays. However, the relative performance and complementarity of the Machine Learning and scaffold-based techniques has not yet been comprehensively compared. In this study, we analysed filters based on the privileged scaffolds with filters built using machine learning. Our results demonstrate that machine-learning methods provide more accurate filters for identification of frequent hitters in AlphaScreen assays than scaffold-based methods and can be easily redeveloped once new data are measured. We present highly accurate models to identify frequent hitters in AlphaScreen assays.
Collapse
Affiliation(s)
- Dipan Ghosh
- Lead Discovery Center GmbH, Otto-Hahn-Straße 15, 44227, Dortmund, Germany
| | - Uwe Koch
- Lead Discovery Center GmbH, Otto-Hahn-Straße 15, 44227, Dortmund, Germany
| | - Kamyar Hadian
- Assay Development and Screening Platform, Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Ingolstädter Landstraße 1, D-85764, Neuherberg, Germany
| | - Michael Sattler
- Bavarian NMR Center, Department Chemie, Technische Universität München, Ernst-Otto-Fischerstraße 2, D-85747, Garching, Germany.,Institute of Structural Biology, Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Ingolstädter Landstraße 1, D-85764, Neuherberg, Germany
| | - Igor V Tetko
- Institute of Structural Biology, Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Ingolstädter Landstraße 1, D-85764, Neuherberg, Germany.,G.A. Krestov Institute of Solution Chemistry of the Russian Academy of Sciences, Akademicheskaya Street 1, 153045, Ivanovo, Russia.,BIGCHEM GmbH, Valerystr. 49, D-85716, Unterschleißheim, Germany
| |
Collapse
|
12
|
Tinkov OV, Grigorev VY, Grigoreva LD. Prediction of an Organic Compound’s Biotransformation Time: A Study Using Avermectins. MOSCOW UNIVERSITY CHEMISTRY BULLETIN 2021. [PMCID: PMC8382113 DOI: 10.3103/s0027131421040088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The current spread of the SARS-CoV-2 coronavirus is a challenge for the entire world. Ivermectin is a promising agent, which could be used to combat the SARS-CoV-2 coronavirus. It represents a complex of semisynthetic derivatives of natural avermectins that have been taken advantage of for a long time in medicine and agriculture as antiparasitic drugs. However, the experimental ecotoxicology assessment data for individual avermectins are still scarce. In relation to this, the aim of this study is to develop a mathematical model that would allow reliably predicting the biotransformation ability of natural and semisynthetic avermectins and identifying the structural fragments of avermectin molecules that have the largest impact on this biological activity. The base for the model construction was a structurally heterogeneous set including organic compounds with experimentally determined biotransformation half-life periods (KmHL). Using the OCHEM web platform (https://ochem.eu) with the implemented PyDescriptor plugin for the descriptor calculation and Random Forest and Transformer-CNN algorithms, a satisfactory (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$R_{{{\text{test}}}}^{2}$$\end{document} = 0.81) Quantitative Relationship Structure—Activity (QSAR) model was developed. The subsequent calculations have shown that natural avermectins undergo on average faster biotransformation in fish than the semisynthetic ones. In addition, structural fragments that increase and decrease the biotransformation rate are identified.
Collapse
|
13
|
Tinkov OV, Grigorev VY, Grigoreva LD. QSAR analysis of the acute toxicity of avermectins towards Tetrahymena pyriformis. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2021; 32:541-571. [PMID: 34157880 DOI: 10.1080/1062936x.2021.1932583] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 05/17/2021] [Indexed: 06/13/2023]
Abstract
Avermectins have been effectively used in medicine, veterinary medicine, and agriculture as antiparasitic agents for many years. However, there are still no reliable data on the main ecotoxicological characteristics of most individual avermectins. Although many QSAR models have been proposed to describe the acute toxicity of organic compounds towards Tetrahymena pyriformis (T. pyriformis), avermectins are outside the applicability domain of these models. The influence of the molecular structures of various organic compounds on the acute toxicity towards T. pyriformis was studied using the OCHEM web platform (https://ochem.eu). A data set of 1792 toxicants was used to create models. The QSAR (Quantitative Structure-Activity Relationship) models were developed using the molecular descriptors Dragon, ISIDA, CDK, PyDescriptor, alvaDesc, and SIRMS and machine learning methods, such as Least Squares Support Vector Machine and Transformer Convolutional Neural Network. The HYBOT descriptors and Random Forest were used for a comparative QSAR investigation. Since the best predictive ability was demonstrated by the Transformer Convolutional Neural Network model, it was used to predict the toxicity of individual avermectins towards T. pyriformis. During a structural interpretation of the developed QSAR model, we determined the significant molecular transformations that increase and decrease the acute toxicity of organic compounds.
Collapse
Affiliation(s)
- O V Tinkov
- Department of Pharmacology and Pharmaceutical Chemistry, Medical Faculty, Shevchenko Transnistria State University, Tiraspol, Moldova
- Department of Computer Science, Military Institute of the Ministry of Defense, Tiraspol, Moldova
| | - V Y Grigorev
- Department of Computer-aided Molecular Design, Institute of Physiologically Active Compounds of the Russian Academy of Science, Chernogolovka, Russia
| | - L D Grigoreva
- Department of Fundamental Physicochemical Engineering, Moscow State University, Moscow, Russia
| |
Collapse
|
14
|
Xiang Y, Tang YH, Liu H, Lin G, Sun H. Predicting Single-Substance Phase Diagrams: A Kernel Approach on Graph Representations of Molecules. J Phys Chem A 2021; 125:4488-4497. [PMID: 33999627 DOI: 10.1021/acs.jpca.1c02391] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
This work presents a Gaussian process regression (GPR) model on top of a novel graph representation of chemical molecules that predicts thermodynamic properties of pure substances in single, double, and triple phases. A transferable molecular graph representation is proposed as the input for a marginalized graph kernel, which is the major component of the covariance function in our GPR models. Radial basis function kernels of temperature and pressure are also incorporated into the covariance function when necessary. We predicted three types of representative properties of pure substances in single, double, and triple phases, i.e., critical temperature, vapor-liquid equilibrium (VLE) density, and pressure-temperature density. The accuracy of the models is nearly identical to the precision of the experimental measurements. Moreover, the reliability of our predictions can be quantified on a per-sample basis using the posterior uncertainty of the GPR model. We compare our model against Morgan fingerprints and a graph neural network to further demonstrate the advantage of the proposed method.
Collapse
Affiliation(s)
- Yan Xiang
- School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yu-Hang Tang
- Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
| | - Hongyi Liu
- School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Guang Lin
- Department of Mathematics & School of Mechanical Engineering, Purdue University, West Lafayette, Indiana 47907, United States
| | - Huai Sun
- School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
15
|
Mi W, Chen H, Zhu DA, Zhang T, Qian F. Melting point prediction of organic molecules by deciphering the chemical structure into a natural language. Chem Commun (Camb) 2021; 57:2633-2636. [PMID: 33587048 DOI: 10.1039/d0cc07384a] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Establishing quantitative structure-property relationships for the rational design of small molecule drugs at the early discovery stage is highly desirable. Using natural language processing (NLP), we proposed a machine learning model to process the line notation of small organic molecules, allowing the prediction of their melting points. The model prediction accuracy benefits from training upon different canonicalized SMILES forms of the same molecules and does not decrease with increasing size, complexity, and structural flexibility. When a combination of two different canonicalized SMILES forms is used to train the model, the prediction accuracy improves. Largely distinguished from the previous fragment-based or descriptor-based models, the prediction accuracy of this NLP-based model does not decrease with increasing size, complexity, and structural flexibility of molecules. By representing the chemical structure as a natural language, this NLP-based model offers a potential tool for quantitative structure-property prediction for drug discovery and development.
Collapse
Affiliation(s)
- Weiming Mi
- Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Beijing 100084, P. R. China.
| | | | | | | | | |
Collapse
|
16
|
Lai J, Li X, Wang Y, Yin S, Zhou J, Liu Z. AIScaffold: A Web-Based Tool for Scaffold Diversification Using Deep Learning. J Chem Inf Model 2020; 61:1-6. [PMID: 33356237 DOI: 10.1021/acs.jcim.0c00867] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Molecular scaffolds are widely used in drug design. Many methods and tools have been developed to utilize the information in scaffolds. Scaffold diversification is frequently used by medicinal chemists in tasks such as lead compound optimization, but tools for scaffold diversification are still lacking. Here, we propose AIScaffold (https://iaidrug.stonewise.cn), a web-based tool for scaffold diversification using the deep generative model. This tool can perform large-scale (up to 500,000 molecules) diversification in several minutes and recommend the top 500 (top 0.1%) molecules. Features such as site-specific diversification are also supported. This tool can facilitate the scaffold diversification process for medicinal chemists, thereby accelerating drug design.
Collapse
Affiliation(s)
- Junyong Lai
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 100191 Beijing, P. R. China
| | - Xiangbin Li
- Stonewise, No. 19 Zhongguancun Street, Haidian District, 100080 Beijing, P. R. China
| | - Yanxing Wang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 100191 Beijing, P. R. China
| | - Shiqiu Yin
- Stonewise, No. 19 Zhongguancun Street, Haidian District, 100080 Beijing, P. R. China
| | - Jielong Zhou
- Stonewise, No. 19 Zhongguancun Street, Haidian District, 100080 Beijing, P. R. China
| | - Zhenming Liu
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 100191 Beijing, P. R. China
| |
Collapse
|
17
|
Tinkov O, Polishchuk P, Matveieva M, Grigorev V, Grigoreva L, Porozov Y. The Influence of Structural Patterns on Acute Aquatic Toxicity of Organic Compounds. Mol Inform 2020; 40:e2000209. [PMID: 33029954 DOI: 10.1002/minf.202000209] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 10/01/2020] [Indexed: 12/28/2022]
Abstract
Investigation of the influence of molecular structure of different organic compounds on acute toxicity towards Fathead minnow, Daphnia magna, and Tetrahymena pyriformis has been carried out using 2D simplex representation of molecular structure and two modelling methods: Random Forest (RF) and Gradient Boosting Machine (GBM). Suitable QSAR (Quantitative Structure - Activity Relationships) models were obtained. The study was focused on QSAR models interpretation. The aim of the study was to develop a set of structural fragments that simultaneously consistently increase toxicity toward Fathead minnow, Daphnia magna, Tetrahymena pyriformis. The interpretation allowed to gain more details about known toxicophores and to propose new fragments. The results obtained made it possible to rank the contributions of molecular fragments to various types of toxicity to aquatic organisms. This information can be used for molecular optimization of chemicals. According to the results of structural interpretation, the most significant common mechanisms of the toxic effect of organic compounds on Fathead minnow, Daphnia magna and Tetrahymena pyriformis are reactions of nucleophilic substitution and inhibition of oxidative phosphorylation in mitochondria. In addition acetylcholinesterase and voltage-gated ion channel of Fathead minnow and Daphnia magna are important targets for toxicants. The on-line version of the OCHEM expert system (https://ochem.eu) were used for a comparative QSAR investigation. The proposed QSAR models comply with the OECD principles and can be used to reliably predict acute toxicity of organic compounds towards Fathead minnow, Daphnia magna and Tetrahymena pyriformis with allowance for applicability domain estimation.
Collapse
Affiliation(s)
- Oleg Tinkov
- Department of Computer Science, Military Institute of the Ministry of Defense, 3300, Gogol str. 2"B", Tiraspol, Transdniestria, Moldova.,Department of Pharmacology and Pharmaceutical Chemistry, Medical Faculty, Transnistrian State University, 3300, October 25 str. 128, Tiraspol, Transdniestria, Moldova
| | - Pavel Polishchuk
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Mariia Matveieva
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Veniamin Grigorev
- Institute of Physiologically Active Compounds, Russian Academy of Sciences, 142432, Severniy proezd 1, Chernogolovka, Moscow region, Russia
| | - Ludmila Grigoreva
- Department of Fundamental Physical and Chemical Engineering, Moscow State University, 119991, Leninskiye Gory 1/51, Moscow, Russia
| | - Yuri Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow, Russia.,Department of Computational Biology, Sirius University of Science and Technology, 354340, Olympic Ave 1, Sochi, Russia
| |
Collapse
|
18
|
Mohammed AI, Ahmed AM, Bhadbhade MM, Ho J, Read RW. Sugar-substituted fluorous 1,2,3-triazoles: Helical twists in fluoroalkyl chains and their molecular association in the solid state and correlations with physicochemical properties. J Fluor Chem 2020. [DOI: 10.1016/j.jfluchem.2020.109536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
19
|
Sivaraman G, Jackson NE, Sanchez-Lengeling B, Vázquez-Mayagoitia Á, Aspuru-Guzik A, Vishwanath V, de Pablo JJ. A machine learning workflow for molecular analysis: application to melting points. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2020. [DOI: 10.1088/2632-2153/ab8aa3] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Abstract
Computational tools encompassing integrated molecular prediction, analysis, and generation are key for molecular design in a variety of critical applications. In this work, we develop a workflow for molecular analysis (MOLAN) that integrates an ensemble of supervised and unsupervised machine learning techniques to analyze molecular data sets. The MOLAN workflow combines molecular featurization, clustering algorithms, uncertainty analysis, low-bias dataset construction, high-performance regression models, graph-based molecular embeddings and attribution, and a semi-supervised variational autoencoder based on the novel SELFIES representation to enable molecular design. We demonstrate the utility of the MOLAN workflow in the context of a challenging multi-molecule property prediction problem: the determination of melting points solely from single molecule structure. This application serves as a case study for how to employ the MOLAN workflow in the context of molecular property prediction.
Collapse
|
20
|
Cui X, Yang R, Li S, Liu J, Wu Q, Li X. Modeling and insights into molecular basis of low molecular weight respiratory sensitizers. Mol Divers 2020; 25:847-859. [PMID: 32166484 DOI: 10.1007/s11030-020-10069-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2019] [Accepted: 03/03/2020] [Indexed: 01/10/2023]
Abstract
Respiratory sensitization has been considered an important toxicological endpoint, because of the severe risk to human health. A great part of sensitization events were caused by low molecular weight (< 1000) respiratory sensitizers in the past decades. However, there is currently no widely accepted test method that can identify prospective low molecular weight respiratory sensitisers. Herein, we performed the study of modeling and insights into molecular basis of low molecular weight respiratory sensitizers with a high-quality data set containing 136 respiratory sensitizers and 518 nonsensitizers. We built a number of classification models by using OCHEM tools, and a consensus model was developed based on the ten best individual models. The consensus model showed good predictive ability with a balanced accuracy of 0.78 and 0.85 on fivefold cross-validation and external validation, respectively. The readers can predict the respiratory sensitization of organic compounds via https://ochem.eu/article/114857 . The effect of several molecular properties on respiratory sensitization was also evaluated. The results indicated that these properties differ significantly between respiratory sensitizers and nonsensitizers. Furthermore, 14 privileged substructures responsible for respiratory sensitization were identified. We hope the models and the findings could provide useful help for environmental risk assessment.
Collapse
Affiliation(s)
- Xueyan Cui
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China
| | - Rui Yang
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China
| | - Siwen Li
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China
| | - Juan Liu
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China
| | - Qiuyun Wu
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China
| | - Xiao Li
- Department of Clinical pharmacy, Shandong Provincial Qianfoshan Hospital, Shandong University, Jinan, 250014, China. .,Department of Clinical pharmacy, The First Affiliated Hospital of Shandong First Medical University, Shandong First Medical University, Jinan, 250014, China.
| |
Collapse
|
21
|
Shin HK. Electron configuration-based neural network model to predict physicochemical properties of inorganic compounds. RSC Adv 2020; 10:33268-33278. [PMID: 35515036 PMCID: PMC9056678 DOI: 10.1039/d0ra05873d] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Accepted: 09/01/2020] [Indexed: 11/21/2022] Open
Abstract
Registration, evaluation, and authorization of chemicals (REACH), the regulation of chemicals in use, imposes the characterization and report of the physicochemical properties of compounds. To cope with the financial burden of the experiments, the use of computational models is permitted for prediction of properties. Although a number of physicochemical property prediction models have been developed, their applicability domain is limited to organic molecules since most available data are concerned with organic molecules, and most of the molecular descriptors are restricted to organic molecule calculations. Prediction models developed for inorganic compounds were intended to predict endpoints relevant to novel material design. Therefore, no models were available for predicting endpoints of inorganic compounds that are significant to regulatory perspectives. In this study, boiling point, water solubility, melting point, and pyrolysis point prediction models were developed for inorganic compounds based on their composition. The electron configuration of each element in the molecule was used as a descriptor in this study. The dataset covered a wide range of endpoints and diverse elements in their structure. The performance of the models was measured using R2, mean absolute error, and Spearman's correlation coefficient, and indicated good prediction accuracy of continuous endpoints and prioritization of inorganic compounds. Registration, evaluation, and authorization of chemicals (REACH), the regulation of chemicals in use, imposes the characterization and report of the physicochemical properties of compounds.![]()
Collapse
Affiliation(s)
- Hyun Kil Shin
- Toxicoinformatics Group
- Department of Predictive Toxicology
- Korea Institute of Toxicology
- Daejeon
- Republic of Korea
| |
Collapse
|
22
|
Modeling Physico-Chemical ADMET Endpoints with Multitask Graph Convolutional Networks. Molecules 2019; 25:molecules25010044. [PMID: 31877719 PMCID: PMC6982787 DOI: 10.3390/molecules25010044] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 11/19/2022] Open
Abstract
Simple physico-chemical properties, like logD, solubility, or melting point, can reveal a great deal about how a compound under development might later behave. These data are typically measured for most compounds in drug discovery projects in a medium throughput fashion. Collecting and assembling all the Bayer in-house data related to these properties allowed us to apply powerful machine learning techniques to predict the outcome of those assays for new compounds. In this paper, we report our finding that, especially for predicting physicochemical ADMET endpoints, a multitask graph convolutional approach appears a highly competitive choice. For seven endpoints of interest, we compared the performance of that approach to fully connected neural networks and different single task models. The new model shows increased predictive performance compared to previous modeling methods and will allow early prioritization of compounds even before they are synthesized. In addition, our model follows the generalized solubility equation without being explicitly trained under this constraint.
Collapse
|
23
|
Tarasova OA, Biziukova NY, Filimonov DA, Poroikov VV, Nicklaus MC. Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications. J Chem Inf Model 2019; 59:3635-3644. [PMID: 31453694 DOI: 10.1021/acs.jcim.9b00164] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure-activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.
Collapse
Affiliation(s)
- Olga A Tarasova
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Nadezhda Yu Biziukova
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Dmitry A Filimonov
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Vladimir V Poroikov
- Department of Bioinformatics , Institute of Biomedical Chemistry , 10 Building 8, Pogodinskaya Street , Moscow 119121 , Russia
| | - Marc C Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research , National Cancer Institute , Frederick , Maryland 21702 , United States
| |
Collapse
|
24
|
Dalavitsou A, Vasiliadis A, Mordos MD, Kouskoura MG, Markopoulou CK. Analytes’ Structure and Signal Response in Evaporating Light Scattering Detector (ELSD). CURR ANAL CHEM 2019. [DOI: 10.2174/1573411014666180330161557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Working with an Evaporative Light Scattering Detector (ELSD), the target
components are converted to a suspension of particles in a gas phase by a nebulizer and heated while
the mobile phase is evaporated. Then, the incident light is directed at the remaining particles which are
scattered and detected.
Methods:
The signal response of an ELS detector is studied through the correlation of the signal intensity
of 65 compounds (at 30, 45 and 80°C) with their structural and physicochemical characteristics.
Therefore, 67 physicochemical properties as well as structural features of the analytes were inserted as
X variables and they were studied in correlation with their signal intensity (Y variable).
Results:
The collected data were statistically processed with the use of partial least squares method. The
results proved that several properties were those that mainly affected the signal intensity either increasing
or decreasing this response.
Conclusion:
The derived results proved that properties related to vapor pressure, size, density, melting
and boiling point of the analytes were responsible for changes in the signal intensity. The light detected
was also affected by properties relevant to the ability of a molecule to form hydrogen bonds (HBA and
HBD) and its polarizability or refractivity, but at a lower extent.
Collapse
Affiliation(s)
- Antonia Dalavitsou
- Laboratory of Pharmaceutical Analysis, Department of Pharmaceutical Technology, School of Pharmacy, Faculty of Health Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
| | - Alexandros Vasiliadis
- Laboratory of Pharmaceutical Analysis, Department of Pharmaceutical Technology, School of Pharmacy, Faculty of Health Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
| | - Michail D. Mordos
- Laboratory of Pharmaceutical Analysis, Department of Pharmaceutical Technology, School of Pharmacy, Faculty of Health Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
| | - Maria G. Kouskoura
- Laboratory of Pharmaceutical Analysis, Department of Pharmaceutical Technology, School of Pharmacy, Faculty of Health Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
| | - Catherine K. Markopoulou
- Laboratory of Pharmaceutical Analysis, Department of Pharmaceutical Technology, School of Pharmacy, Faculty of Health Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
| |
Collapse
|
25
|
Marcou G, Flamme B, Beck G, Chagnes A, Mokshyna O, Horvath D, Varnek A. In silico
Design, Virtual Screening and Synthesis of Novel Electrolytic Solvents. Mol Inform 2019; 38:e1900014. [DOI: 10.1002/minf.201900014] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 05/07/2019] [Indexed: 11/11/2022]
Affiliation(s)
- G. Marcou
- Faculty of Chemistry – UMR7140University of Strasbourg 4, rue Blaise Pascal 67000 Strasbourg France
| | - B. Flamme
- Ecole Nationale Supérieure de Chimie de Paris 11 Rue Pierre et Marie Curie 75005 Paris France
| | - G. Beck
- Faculty of Chemistry – UMR7140University of Strasbourg 4, rue Blaise Pascal 67000 Strasbourg France
| | - A. Chagnes
- Université de Lorraine, CNRS, GeoRessources F-54000 Nancy France
| | - O. Mokshyna
- Faculty of Chemistry – UMR7140University of Strasbourg 4, rue Blaise Pascal 67000 Strasbourg France
| | - D. Horvath
- Faculty of Chemistry – UMR7140University of Strasbourg 4, rue Blaise Pascal 67000 Strasbourg France
| | - A. Varnek
- Faculty of Chemistry – UMR7140University of Strasbourg 4, rue Blaise Pascal 67000 Strasbourg France
| |
Collapse
|
26
|
Cardoso‐Silva J, Papadatos G, Papageorgiou LG, Tsoka S. Optimal Piecewise Linear Regression Algorithm for QSAR Modelling. Mol Inform 2018; 38:e1800028. [DOI: 10.1002/minf.201800028] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 08/02/2018] [Indexed: 12/20/2022]
Affiliation(s)
- Jonathan Cardoso‐Silva
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| | - George Papadatos
- European Molecular Biology Laboratory – European Bioinformatics InstituteWellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD UK
- GlaxoSmithKline Gunnels Wood Road Stevenage, Hertfordshire SG1 2NY UK
| | - Lazaros G. Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical EngineeringUniversity College London Torrington Place London WC1E 7JE UK
| | - Sophia Tsoka
- Department of Informatics, Faculty of Natural and Mathematical SciencesKing's College London, Bush House London WC2B 4BG UK
| |
Collapse
|
27
|
Ghosh D, Koch U, Hadian K, Sattler M, Tetko IV. Luciferase Advisor: High-Accuracy Model To Flag False Positive Hits in Luciferase HTS Assays. J Chem Inf Model 2018; 58:933-942. [PMID: 29667823 DOI: 10.1021/acs.jcim.7b00574] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Firefly luciferase is an enzyme that has found ubiquitous use in biological assays in high-throughput screening (HTS) campaigns. The inhibition of luciferase in such assays could lead to a false positive result. This issue has been known for a long time, and there have been significant efforts to identify luciferase inhibitors in order to enhance recognition of false positives in screening assays. However, although a large amount of publicly accessible luciferase counterscreen data is available, to date little effort has been devoted to building a chemoinformatic model that can identify such molecules in a given data set. In this study we developed models to identify these molecules using various methods, such as molecular docking, SMARTS screening, pharmacophores, and machine learning methods. Among the structure-based methods, the pharmacophore-based method showed promising results, with a balanced accuracy of 74.2%. However, machine-learning approaches using associative neural networks outperformed all of the other methods explored, producing a final model with a balanced accuracy of 89.7%. The high predictive accuracy of this model is expected to be useful for advising which compounds are potential luciferase inhibitors present in luciferase HTS assays. The models developed in this work are freely available at the OCHEM platform at http://ochem.eu .
Collapse
Affiliation(s)
- Dipan Ghosh
- Institute of Structural Biology , Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) , Ingolstaedter Landstrasse 1 , 85764 Neuherberg , Germany
| | - Uwe Koch
- Lead Discovery Center GmbH , Otto-Hahn-Straße 15 , 44227 Dortmund , Germany
| | - Kamyar Hadian
- Assay Development and Screening Platform , Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) , Ingolstaedter Landstrasse 1 , 85764 Neuherberg , Germany
| | - Michael Sattler
- Bayerisches NMR-Zentrum, Department of Chemistry , Technical University of Munich , Ernst-Otto-Fischer-Straße 2 , 85747 Garching , Germany
| | - Igor V Tetko
- Institute of Structural Biology , Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) , Ingolstaedter Landstrasse 1 , 85764 Neuherberg , Germany.,BIGCHEM GmbH , Ingolstaedter Landstrasse 1 b. 60w , 85764 Neuherberg , Germany
| |
Collapse
|
28
|
Bergström CAS, Larsson P. Computational prediction of drug solubility in water-based systems: Qualitative and quantitative approaches used in the current drug discovery and development setting. Int J Pharm 2018; 540:185-193. [PMID: 29421301 PMCID: PMC5861307 DOI: 10.1016/j.ijpharm.2018.01.044] [Citation(s) in RCA: 108] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Revised: 01/20/2018] [Accepted: 01/22/2018] [Indexed: 01/18/2023]
Abstract
In this review we will discuss recent advances in computational prediction of solubility in water-based solvents. Our focus is set on recent advances in predictions of biorelevant solubility in media mimicking the human intestinal fluids and on new methods to predict the thermodynamic cycle rather than prediction of solubility in pure water through quantitative structure property relationships (QSPR). While the literature is rich in QSPR models for both solubility and melting point, a physicochemical property strongly linked to the solubility, recent advances in the modelling of these properties make use of theory and computational simulations to better predict these properties or processes involved therein (e.g. solid state crystal lattice packing, dissociation of molecules from the lattice and solvation). This review serves to provide an update on these new approaches and how they can be used to more accurately predict solubility, and also importantly, inform us on molecular interactions and processes occurring during drug dissolution and solubilisation.
Collapse
Affiliation(s)
- Christel A S Bergström
- Department of Pharmacy, Uppsala University, Biomedical Centre P.O. Box 580, SE-751 23 Uppsala, Sweden.
| | - Per Larsson
- Department of Pharmacy, Uppsala University, Biomedical Centre P.O. Box 580, SE-751 23 Uppsala, Sweden
| |
Collapse
|
29
|
Tauler R, Parastar H. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2018; 61:e201801134. [DOI: 10.1002/anie.201801134] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Indexed: 11/08/2022]
Affiliation(s)
- Roma Tauler
- IDAEA-CSIC Environmental Chemistry Jordi Girona 18-26 08034 Barcelona SPAIN
| | | |
Collapse
|
30
|
Withnall M, Chen H, Tetko IV. Matched Molecular Pair Analysis on Large Melting Point Datasets: A Big Data Perspective. ChemMedChem 2018; 13:599-606. [PMID: 28650584 PMCID: PMC5900986 DOI: 10.1002/cmdc.201700303] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Revised: 06/26/2017] [Indexed: 11/11/2022]
Abstract
A matched molecular pair (MMP) analysis was used to examine the change in melting point (MP) between pairs of similar molecules in a set of ∼275k compounds. We found many cases in which the change in MP (ΔMP) of compounds correlates with changes in functional groups. In line with the results of a previous study, correlations between ΔMP and simple molecular descriptors, such as the number of hydrogen bond donors, were identified. In using a larger dataset, covering a wider chemical space and range of melting points, we observed that this method remains stable and scales well with larger datasets. This MMP-based method could find use as a simple privacy-preserving technique to analyze large proprietary databases and share findings between participating research groups.
Collapse
Affiliation(s)
- Michael Withnall
- Helmholtz Zentrum München—German Research Center for Environmental Health, GmbHInstitute of Structural BiologyNeuherbergGermany
| | - Hongming Chen
- External Sciences, Discovery Sciences, Innovative Medicines and Early Development Biotech Unit, AstraZeneca R&D GothenburgMölndal43183Sweden
| | - Igor V. Tetko
- Helmholtz Zentrum München—German Research Center for Environmental Health, GmbHInstitute of Structural BiologyNeuherbergGermany
- BIGCHEM GmbHIngolstädter Landstraße 1, b. 60w85764NeuherbergGermany
- Institute of Structural Biology, Helmholtz Zentrum München—German Research Center for Environmental Health, GmbHIngolstädter Landstraße 185764NeuherbergGermany
| |
Collapse
|
31
|
Mansouri K, Grulke CM, Judson RS, Williams AJ. OPERA models for predicting physicochemical properties and environmental fate endpoints. J Cheminform 2018. [PMID: 29520515 PMCID: PMC5843579 DOI: 10.1186/s13321-018-0263-1] [Citation(s) in RCA: 267] [Impact Index Per Article: 44.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
The collection of chemical structure information and associated experimental data for quantitative structure–activity/property relationship (QSAR/QSPR) modeling is facilitated by an increasing number of public databases containing large amounts of useful data. However, the performance of QSAR models highly depends on the quality of the data and modeling methodology used. This study aims to develop robust QSAR/QSPR models for chemical properties of environmental interest that can be used for regulatory purposes. This study primarily uses data from the publicly available PHYSPROP database consisting of a set of 13 common physicochemical and environmental fate properties. These datasets have undergone extensive curation using an automated workflow to select only high-quality data, and the chemical structures were standardized prior to calculation of the molecular descriptors. The modeling procedure was developed based on the five Organization for Economic Cooperation and Development (OECD) principles for QSAR models. A weighted k-nearest neighbor approach was adopted using a minimum number of required descriptors calculated using PaDEL, an open-source software. The genetic algorithms selected only the most pertinent and mechanistically interpretable descriptors (2–15, with an average of 11 descriptors). The sizes of the modeled datasets varied from 150 chemicals for biodegradability half-life to 14,050 chemicals for logP, with an average of 3222 chemicals across all endpoints. The optimal models were built on randomly selected training sets (75%) and validated using fivefold cross-validation (CV) and test sets (25%). The CV Q2 of the models varied from 0.72 to 0.95, with an average of 0.86 and an R2 test value from 0.71 to 0.96, with an average of 0.82. Modeling and performance details are described in QSAR model reporting format and were validated by the European Commission’s Joint Research Center to be OECD compliant. All models are freely available as an open-source, command-line application called OPEn structure–activity/property Relationship App (OPERA). OPERA models were applied to more than 750,000 chemicals to produce freely available predicted data on the U.S. Environmental Protection Agency’s CompTox Chemistry Dashboard.![]()
Collapse
Affiliation(s)
- Kamel Mansouri
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA. .,Oak Ridge Institute for Science and Education, 1299 Bethel Valley Road, Oak Ridge, TN, 37830, USA. .,ScitoVation LLC, 6 Davis Drive, Research Triangle Park, NC, 27709, USA.
| | - Chris M Grulke
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Richard S Judson
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Antony J Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
32
|
Tebes-Stevens C, Patel JM, Koopmans M, Olmstead J, Hilal SH, Pope N, Weber EJ, Wolfe K. Demonstration of a consensus approach for the calculation of physicochemical properties required for environmental fate assessments. CHEMOSPHERE 2018; 194:94-106. [PMID: 29197820 PMCID: PMC6146973 DOI: 10.1016/j.chemosphere.2017.11.137] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Revised: 11/21/2017] [Accepted: 11/22/2017] [Indexed: 05/21/2023]
Abstract
Eight software applications are compared for their performance in estimating the octanol-water partition coefficient (Kow), melting point, vapor pressure and water solubility for a dataset of polychlorinated biphenyls, polybrominated diphenyl ethers, polychlorinated dibenzodioxins, and polycyclic aromatic hydrocarbons. The predicted property values are compared against a curated dataset of measured property values compiled from the scientific literature with careful consideration given to the analytical methods used for property measurements of these hydrophobic chemicals. The variability in the predicted values from different calculators generally increases for higher values of Kow and melting point and for lower values of water solubility and vapor pressure. For each property, no individual calculator outperforms the others for all four of the chemical classes included in the analysis. Because calculator performance varies based on chemical class and property value, the geometric mean and the median of the calculated values from multiple calculators that use different estimation algorithms are recommended as more reliable estimates of the property value than the value from any single calculator.
Collapse
Affiliation(s)
- Caroline Tebes-Stevens
- U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States.
| | - Jay M Patel
- ORISE Fellow, U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| | - Michaela Koopmans
- ORAU, U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| | - John Olmstead
- ORAU, U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| | - Said H Hilal
- U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| | - Nick Pope
- Independent Contractor, Hildebran, NC, United States
| | - Eric J Weber
- U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| | - Kurt Wolfe
- U.S. Environmental Protection Agency, National Exposure Research Laboratory, Athens, GA 30605, United States
| |
Collapse
|
33
|
Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF. Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction. J Chem Inf Model 2017; 57:1757-1772. [PMID: 28696688 DOI: 10.1021/acs.jcim.6b00601] [Citation(s) in RCA: 220] [Impact Index Per Article: 31.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
The task of learning an expressive molecular representation is central to developing quantitative structure-activity and property relationships. Traditional approaches rely on group additivity rules, empirical measurements or parameters, or generation of thousands of descriptors. In this paper, we employ a convolutional neural network for this embedding task by treating molecules as undirected graphs with attributed nodes and edges. Simple atom and bond attributes are used to construct atom-specific feature vectors that take into account the local chemical environment using different neighborhood radii. By working directly with the full molecular graph, there is a greater opportunity for models to identify important features relevant to a prediction task. Unlike other graph-based approaches, our atom featurization preserves molecule-level spatial information that significantly enhances model performance. Our models learn to identify important features of atom clusters for the prediction of aqueous solubility, octanol solubility, melting point, and toxicity. Extensions and limitations of this strategy are discussed.
Collapse
Affiliation(s)
- Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Tommi S Jaakkola
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology , 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
34
|
Zang Q, Mansouri K, Williams AJ, Judson RS, Allen DG, Casey WM, Kleinstreuer NC. In Silico Prediction of Physicochemical Properties of Environmental Chemicals Using Molecular Fingerprints and Machine Learning. J Chem Inf Model 2017; 57:36-49. [PMID: 28006899 PMCID: PMC6131700 DOI: 10.1021/acs.jcim.6b00625] [Citation(s) in RCA: 77] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
There are little available toxicity data on the vast majority of chemicals in commerce. High-throughput screening (HTS) studies, such as those being carried out by the U.S. Environmental Protection Agency (EPA) ToxCast program in partnership with the federal Tox21 research program, can generate biological data to inform models for predicting potential toxicity. However, physicochemical properties are also needed to model environmental fate and transport, as well as exposure potential. The purpose of the present study was to generate an open-source quantitative structure-property relationship (QSPR) workflow to predict a variety of physicochemical properties that would have cross-platform compatibility to integrate into existing cheminformatics workflows. In this effort, decades-old experimental property data sets available within the EPA EPI Suite were reanalyzed using modern cheminformatics workflows to develop updated QSPR models capable of supplying computationally efficient, open, and transparent HTS property predictions in support of environmental modeling efforts. Models were built using updated EPI Suite data sets for the prediction of six physicochemical properties: octanol-water partition coefficient (logP), water solubility (logS), boiling point (BP), melting point (MP), vapor pressure (logVP), and bioconcentration factor (logBCF). The coefficient of determination (R2) between the estimated values and experimental data for the six predicted properties ranged from 0.826 (MP) to 0.965 (BP), with model performance for five of the six properties exceeding those from the original EPI Suite models. The newly derived models can be employed for rapid estimation of physicochemical properties within an open-source HTS workflow to inform fate and toxicity prediction models of environmental chemicals.
Collapse
Affiliation(s)
- Qingda Zang
- Integrated Laboratory Systems, Inc., Research Triangle Park, NC 27709, USA
| | - Kamel Mansouri
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Antony J. Williams
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - Richard S. Judson
- National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC 27711, USA
| | - David G. Allen
- Integrated Laboratory Systems, Inc., Research Triangle Park, NC 27709, USA
| | - Warren M. Casey
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| | - Nicole C. Kleinstreuer
- National Toxicology Program, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA
| |
Collapse
|
35
|
Mansouri K, Grulke CM, Richard AM, Judson RS, Williams AJ. An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:939-965. [PMID: 27885862 DOI: 10.1080/1062936x.2016.1253611] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 10/24/2016] [Indexed: 05/18/2023]
Abstract
The increasing availability of large collections of chemical structures and associated experimental data provides an opportunity to build robust QSAR models for applications in different fields. One common concern is the quality of both the chemical structure information and associated experimental data. Here we describe the development of an automated KNIME workflow to curate and correct errors in the structure and identity of chemicals using the publicly available PHYSPROP physicochemical properties and environmental fate datasets. The workflow first assembles structure-identity pairs using up to four provided chemical identifiers, including chemical name, CASRNs, SMILES, and MolBlock. Problems detected included errors and mismatches in chemical structure formats, identifiers and various structure validation issues, including hypervalency and stereochemistry descriptions. Subsequently, a machine learning procedure was applied to evaluate the impact of this curation process. The performance of QSAR models built on only the highest-quality subset of the original dataset was compared with the larger curated and corrected dataset. The latter showed statistically improved predictive performance. The final workflow was used to curate the full list of PHYSPROP datasets, and is being made publicly available for further usage and integration by the scientific community.
Collapse
Affiliation(s)
- K Mansouri
- a Oak Ridge Institute for Science and Education (ORISE) , Oak Ridge , TN , USA
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - C M Grulke
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - A M Richard
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - R S Judson
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| | - A J Williams
- b US Environmental Protection Agency, Office of Research and Development , National Center for Computational Toxicology , Research Triangle Park, NC , USA
| |
Collapse
|
36
|
Tetko IV, Maran U, Tropsha A. Public (Q)SAR Services, Integrated Modeling Environments, and Model Repositories on the Web: State of the Art and Perspectives for Future Development. Mol Inform 2016; 36. [PMID: 27778468 DOI: 10.1002/minf.201600082] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2016] [Accepted: 10/03/2016] [Indexed: 01/08/2023]
Abstract
Thousands of (Quantitative) Structure-Activity Relationships (Q)SAR models have been described in peer-reviewed publications; however, this way of sharing seldom makes models available for the use by the research community outside of the developer's laboratory. Conversely, on-line models allow broad dissemination and application representing the most effective way of sharing the scientific knowledge. Approaches for sharing and providing on-line access to models range from web services created by individual users and laboratories to integrated modeling environments and model repositories. This emerging transition from the descriptive and informative, but "static", and for the most part, non-executable print format to interactive, transparent and functional delivery of "living" models is expected to have a transformative effect on modern experimental research in areas of scientific and regulatory use of (Q)SAR models.
Collapse
Affiliation(s)
- Igor V Tetko
- Institute of Structural Biology, Helmholtz Zentrum München -, German Research Center for Environmental Health (GmbH), Institute of Structural Biology, Ingolstädter Landstraße 1, D-, 85764, Neuherberg, Germany.,BigChem GmbH, Ingolstädter Landstraße 1, b. 60w, D-, 85764, Neuherberg, Germany
| | - Uko Maran
- Institute of Chemistry, University of Tartu, Ravila 14A, Tartu, 50411, Estonia
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, Division of Chemical Biology and Medicinal Chemistry, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA.,Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya St. 18, 420008, Kazan, Russia
| |
Collapse
|
37
|
Swain MC, Cole JM. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J Chem Inf Model 2016; 56:1894-1904. [PMID: 27669338 DOI: 10.1021/acs.jcim.6b00207] [Citation(s) in RCA: 158] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .
Collapse
Affiliation(s)
- Matthew C Swain
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| | - Jacqueline M Cole
- Cavendish Laboratory, University of Cambridge , J. J. Thomson Avenue, Cambridge, CB3 0HE, U.K
| |
Collapse
|
38
|
Does ‘Big Data’ exist in medicinal chemistry, and if so, how can it be harnessed? Future Med Chem 2016; 8:1801-1806. [DOI: 10.4155/fmc-2016-0163] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
|
39
|
Ekins S. The Next Era: Deep Learning in Pharmaceutical Research. Pharm Res 2016; 33:2594-603. [PMID: 27599991 DOI: 10.1007/s11095-016-2029-7] [Citation(s) in RCA: 99] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2016] [Accepted: 08/23/2016] [Indexed: 01/22/2023]
Abstract
Over the past decade we have witnessed the increasing sophistication of machine learning algorithms applied in daily use from internet searches, voice recognition, social network software to machine vision software in cameras, phones, robots and self-driving cars. Pharmaceutical research has also seen its fair share of machine learning developments. For example, applying such methods to mine the growing datasets that are created in drug discovery not only enables us to learn from the past but to predict a molecule's properties and behavior in future. The latest machine learning algorithm garnering significant attention is deep learning, which is an artificial neural network with multiple hidden layers. Publications over the last 3 years suggest that this algorithm may have advantages over previous machine learning methods and offer a slight but discernable edge in predictive performance. The time has come for a balanced review of this technique but also to apply machine learning methods such as deep learning across a wider array of endpoints relevant to pharmaceutical research for which the datasets are growing such as physicochemical property prediction, formulation prediction, absorption, distribution, metabolism, excretion and toxicity (ADME/Tox), target prediction and skin permeation, etc. We also show that there are many potential applications of deep learning beyond cheminformatics. It will be important to perform prospective testing (which has been carried out rarely to date) in order to convince skeptics that there will be benefits from investing in this technique.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations Pharmaceuticals, Inc, 5616 Hilltop Needmore Road, Fuquay-Varina, North Carolina, 27526, USA. .,Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, California, 94010, USA.
| |
Collapse
|
40
|
Tetko IV, Engkvist O, Koch U, Reymond JL, Chen H. BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry. Mol Inform 2016; 35:615-621. [PMID: 27464907 PMCID: PMC5129546 DOI: 10.1002/minf.201600073] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 07/06/2016] [Indexed: 01/19/2023]
Abstract
The increasing volume of biomedical data in chemistry and life sciences requires the development of new methods and approaches for their handling. Here, we briefly discuss some challenges and opportunities of this fast growing area of research with a focus on those to be addressed within the BIGCHEM project. The article starts with a brief description of some available resources for “Big Data” in chemistry and a discussion of the importance of data quality. We then discuss challenges with visualization of millions of compounds by combining chemical and biological data, the expectations from mining the “Big Data” using advanced machine‐learning methods, and their applications in polypharmacology prediction and target de‐convolution in phenotypic screening. We show that the efficient exploration of billions of molecules requires the development of smart strategies. We also address the issue of secure information sharing without disclosing chemical structures, which is critical to enable bi‐party or multi‐party data sharing. Data sharing is important in the context of the recent trend of “open innovation” in pharmaceutical industry, which has led to not only more information sharing among academics and pharma industries but also the so‐called “precompetitive” collaboration between pharma companies. At the end we highlight the importance of education in “Big Data” for further progress of this area.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Structural Biology, Ingolstädter Landstraße 1, b. 60w, D-85764, Neuherberg, Germany.,BIGCHEM GmbH, Ingolstädter Landstraße 1, b. 60w, D-85764, Neuherberg, Germany
| | - Ola Engkvist
- Discovery Sciences, AstraZeneca R&D Gothenburg, Pepparedsleden 1, Mölndal, SE-43183, Sweden
| | - Uwe Koch
- Lead Discovery Center GmbH, Otto-Hahn Strasse 15, Dortmund, 44227, Germany
| | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland
| | - Hongming Chen
- Discovery Sciences, AstraZeneca R&D Gothenburg, Pepparedsleden 1, Mölndal, SE-43183, Sweden
| |
Collapse
|
41
|
Salmina ES, Haider N, Tetko IV. Extended Functional Groups (EFG): An Efficient Set for Chemical Characterization and Structure-Activity Relationship Studies of Chemical Compounds. Molecules 2015; 21:E1. [PMID: 26703557 PMCID: PMC6273096 DOI: 10.3390/molecules21010001] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Revised: 12/09/2015] [Accepted: 12/15/2015] [Indexed: 11/16/2022] Open
Abstract
The article describes a classification system termed “extended functional groups” (EFG), which are an extension of a set previously used by the CheckMol software, that covers in addition heterocyclic compound classes and periodic table groups. The functional groups are defined as SMARTS patterns and are available as part of the ToxAlerts tool (http://ochem.eu/alerts) of the On-line CHEmical database and Modeling (OCHEM) environment platform. The article describes the motivation and the main ideas behind this extension and demonstrates that EFG can be efficiently used to develop and interpret structure-activity relationship models.
Collapse
Affiliation(s)
- Elena S Salmina
- Institute for Organic Chemistry, Technical University Bergakademie Freiberg, Leipziger Str. 29, Freiberg D-09596, Germany.
| | - Norbert Haider
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstraße 14, Vienna A-1090, Austria.
| | - Igor V Tetko
- Institute of Structural Biology, Helmholtz Zentrum München-Research Center for Environmental Health (GmbH), Ingolstädter Landstraße 1, b. 60w, Neuherberg D-85764, Germany.
- BigChem GmbH, Ingolstädter Landstraße 1, b. 60w, Neuherberg D-85764, Germany.
| |
Collapse
|