1
|
Isaev VV, Minenkov Y. Comparative study of various molecular feature representations for solvation free energy predictions of neutral species. J Mol Graph Model 2025; 134:108901. [PMID: 39515275 DOI: 10.1016/j.jmgm.2024.108901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 10/13/2024] [Accepted: 10/31/2024] [Indexed: 11/16/2024]
Abstract
Predicting molecular properties with the help of Neural Networks is a common way to substitute or enhance comprehensive quantum-chemical calculations. One of the problems facing researchers is the choice of vectorization approach to representing the solvent and the solute for the estimator model. In this work, 10 different approaches have been investigated for both organic solute and solvent including vectorizers that relied on macroscopic parameters, functional groups classification, molecular graphs, or atomic coordinates. A variation of the Bag of Bonds approach called JustBonds, trained on the MNSol database, showed the best overall performance resulting in RMSD <2 kcal/mol for the blind dataset that contains the solutes not presented in the training subset and <1 kcal/mol on records from Solv@TUM database, which is close to contemporary continuum models. We have also demonstrated that the most important bags usually contain heteroatom and play a key role in the solvation process. Furthermore, the small role of solvent vectorization was demonstrated and revealed that approaches based on functional groups or macroscopic solvent parameters are often enough to efficiently represent solvent media.
Collapse
Affiliation(s)
- Valerii V Isaev
- Lomonosov Moscow State University, Leninskie gory 1 bld. 3, 119991, Moscow, Russia; N.N. Semenov Federal Research Center for Chemical Physics, Kosygina Street 4, 119991, Moscow, Russia.
| | - Yury Minenkov
- N.N. Semenov Federal Research Center for Chemical Physics, Kosygina Street 4, 119991, Moscow, Russia
| |
Collapse
|
2
|
Xie J, Liu S, Su L, Zhao X, Wang Y, Tan F. Elucidating per- and polyfluoroalkyl substances (PFASs) soil-water partitioning behavior through explainable machine learning models. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 954:176575. [PMID: 39343411 DOI: 10.1016/j.scitotenv.2024.176575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 09/15/2024] [Accepted: 09/26/2024] [Indexed: 10/01/2024]
Abstract
In this study, an optimized random forest (RF) model was employed to better understand the soil-water partitioning behavior of per- and polyfluoroalkyl substances (PFASs). The model demonstrated strong predictive performance, achieving an R2 of 0.93 and an RMSE of 0.86. Moreover, it required only 11 easily obtainable features, with molecular weight and soil pH being the predominant factors. Using three-dimensional interaction analyses identified specific conditions associated with varying soil-water partitioning coefficients (Kd). Results showed that soils with high organic carbon (OC) content, cation exchange capacity (CEC), and lower soil pH, especially when combined with PFASs of higher molecular weight, were linked to higher Kd values, indicating stronger adsorption. Conversely, low Kd values (< 2.8 L/kg) typically observed in soils with higher pH (8.0), but lower CEC (8 cmol+/kg), lesser OC content (1 %), and lighter molecular weight (380 g/mol), suggested weaker adsorption capacities and a heightened potential for environmental migration. Furthermore, the model was used to predict Kd values for 142 novel PFASs in diverse soil conditions. Our research provides essential insights into the factors governing PFASs partitioning in soil and highlights the significant role of machine learning models in enhancing the understanding of environmental distribution and migration of PFASs.
Collapse
Affiliation(s)
- Jiaxing Xie
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Shun Liu
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Lihao Su
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Xinting Zhao
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yan Wang
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Feng Tan
- Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
3
|
Welsch M, Hirte S, Kirchmair J. Deciphering Molecular Embeddings with Centered Kernel Alignment. J Chem Inf Model 2024; 64:7303-7312. [PMID: 39321215 PMCID: PMC11482110 DOI: 10.1021/acs.jcim.4c00837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/24/2024] [Accepted: 09/06/2024] [Indexed: 09/27/2024]
Abstract
Analyzing machine learning models, especially nonlinear ones, poses significant challenges. In this context, centered kernel alignment (CKA) has emerged as a promising model analysis tool that assesses the similarity between two embeddings. CKA's efficacy depends on selecting a kernel that adequately captures the underlying properties of the compared models. The model analysis tool was designed for neural networks (NNs) with their invariance to data rotation in mind and has been successfully employed in various scientific domains. However, CKA has rarely been adopted in cheminformatics, partly because of the popularity of the random forest (RF) machine learning algorithm, which is not rotationally invariant. In this work, we present the adaptation of CKA that builds on the RF kernel to match the properties of RF. As part of the method validation, we show that the model analysis method is well-correlated with the prediction similarity of RF models. Furthermore, we demonstrate how CKA with the RF kernel can be utilized to analyze and explain the behavior of RF models derived from molecular and rooted fingerprints.
Collapse
Affiliation(s)
- Matthias Welsch
- Department of Pharmaceutical Sciences, Division of
Pharmaceutical Chemistry, Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, Vienna 1090,
Austria
- Christian Doppler Laboratory for Molecular Informatics
in the Biosciences, Department for Pharmaceutical Sciences, University of
Vienna, Vienna 1090, Austria
- Vienna Doctoral School of Pharmaceutical, Nutritional
and Sport Sciences (PhaNuSpo), University of Vienna, Vienna
1090, Austria
| | - Steffen Hirte
- Department of Pharmaceutical Sciences, Division of
Pharmaceutical Chemistry, Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, Vienna 1090,
Austria
- Vienna Doctoral School of Pharmaceutical, Nutritional
and Sport Sciences (PhaNuSpo), University of Vienna, Vienna
1090, Austria
| | - Johannes Kirchmair
- Department of Pharmaceutical Sciences, Division of
Pharmaceutical Chemistry, Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, Vienna 1090,
Austria
- Christian Doppler Laboratory for Molecular Informatics
in the Biosciences, Department for Pharmaceutical Sciences, University of
Vienna, Vienna 1090, Austria
| |
Collapse
|
4
|
Srisongkram T. DeepRA: A novel deep learning-read-across framework and its application in non-sugar sweeteners mutagenicity prediction. Comput Biol Med 2024; 178:108731. [PMID: 38870727 DOI: 10.1016/j.compbiomed.2024.108731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 05/07/2024] [Accepted: 06/08/2024] [Indexed: 06/15/2024]
Abstract
Non-sugar sweeteners (NSSs) or artificial sweeteners have long been used as food chemicals since World War II. NSSs, however, also raise a concern about their mutagenicity. Evaluating the mutagenic ability of NSSs is crucial for food safety; this step is needed for every new chemical registration in the food and pharmaceutical industries. A computational assessment provides less time, money, and involved animals than the in vivo experiments; thus, this study developed a novel computational method from an ensemble convolutional deep neural network and read-across algorithms, called DeepRA, to classify the mutagenicity of chemicals. The mutagenicity data were obtained from the curated Ames test data set. The DeepRA model was developed using both molecular descriptors and molecular fingerprints. The obtained DeepRA model provides accurate and reliable mutagenicity classification through an independent test set. This model was then used to examine the NSSs-related chemicals, enabling the evaluation of mutagenicity from the NSSs-like substances. Finally, this model was publicly available at https://github.com/taraponglab/deepra for further use in chemical regulation and risk assessment.
Collapse
Affiliation(s)
- Tarapong Srisongkram
- Division of Pharmaceutical Chemistry, Faculty of Pharmaceutical Sciences, Khon Kaen University, 40002, Thailand.
| |
Collapse
|
5
|
Cankara F, Senyuz S, Sayin AZ, Gursoy A, Keskin O. DiPPI: A Curated Data Set for Drug-like Molecules in Protein-Protein Interfaces. J Chem Inf Model 2024; 64:5041-5051. [PMID: 38907989 DOI: 10.1021/acs.jcim.3c01905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/24/2024]
Abstract
Proteins interact through their interfaces, and dysfunction of protein-protein interactions (PPIs) has been associated with various diseases. Therefore, investigating the properties of the drug-modulated PPIs and interface-targeting drugs is critical. Here, we present a curated large data set for drug-like molecules in protein interfaces. We further introduce DiPPI (Drugs in Protein-Protein Interfaces), a two-module web site to facilitate the search for such molecules and their properties by exploiting our data set in drug repurposing studies. In the interface module of the web site, we present several properties, of interfaces, such as amino acid properties, hotspots, evolutionary conservation of drug-binding amino acids, and post-translational modifications of these residues. On the drug-like molecule side, we list drug-like small molecules and FDA-approved drugs from various databases and highlight those that bind to the interfaces. We further clustered the drugs based on their molecular fingerprints to confine the search for an alternative drug to a smaller space. Drug properties, including Lipinski's rules and various molecular descriptors, are also calculated and made available on the web site to guide the selection of drug molecules. Our data set contains 534,203 interfaces for 98,632 protein structures, of which 55,135 are detected to bind to a drug-like molecule. 2214 drug-like molecules are deposited on our web site, among which 335 are FDA-approved. DiPPI provides users with an easy-to-follow scheme for drug repurposing studies through its well-curated and clustered interface and drug data and is freely available at http://interactome.ku.edu.tr:8501.
Collapse
Affiliation(s)
- Fatma Cankara
- Graduate School of Sciences and Engineering, Koç University, İstanbul 34450, Turkey
| | - Simge Senyuz
- Graduate School of Sciences and Engineering, Koç University, İstanbul 34450, Turkey
| | - Ahenk Zeynep Sayin
- Department of Chemical and Biological Engineering, Koç University, İstanbul 34450, Turkey
| | - Attila Gursoy
- Department of Computer Engineering, Koç University, İstanbul 34450, Turkey
| | - Ozlem Keskin
- Department of Chemical and Biological Engineering, Koç University, İstanbul 34450, Turkey
| |
Collapse
|
6
|
Chen B, Pan Z, Mou M, Zhou Y, Fu W. Is fragment-based graph a better graph-based molecular representation for drug design? A comparison study of graph-based models. Comput Biol Med 2024; 169:107811. [PMID: 38168647 DOI: 10.1016/j.compbiomed.2023.107811] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 11/23/2023] [Accepted: 12/03/2023] [Indexed: 01/05/2024]
Abstract
Graph Neural Networks (GNNs) have gained significant traction in various sectors of AI-driven drug design. Over recent years, the integration of fragmentation concepts into GNNs has emerged as a potent strategy to augment the efficacy of molecular generative models. Nonetheless, challenges such as symmetry breaking and potential misrepresentation of intricate cycles and undefined functional groups raise questions about the superiority of fragment-based graph representation over traditional methods. In our research, we undertook a rigorous evaluation, contrasting the predictive prowess of eight models-developed using deep learning algorithms-across 12 benchmark datasets that span a range of properties. These models encompass established methods like GCN, AttentiveFP, and D-MPNN, as well as innovative fragment-based representation techniques. Our results indicate that fragment-based methodologies, notably PharmHGT, significantly improve model performance and interpretability, particularly in scenarios characterized by limited data availability. However, in situations with extensive training, fragment-based molecular graph representations may not necessarily eclipse traditional methods. In summation, we posit that the integration of fragmentation, as an avant-garde technique in drug design, harbors considerable promise for the future of AI-enhanced drug design.
Collapse
Affiliation(s)
- Baiyu Chen
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 202103, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Yuan Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Wei Fu
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 202103, China.
| |
Collapse
|
7
|
Song Y, Chang S, Tian J, Pan W, Feng L, Ji H. A Comprehensive Comparative Analysis of Deep Learning Based Feature Representations for Molecular Taste Prediction. Foods 2023; 12:3386. [PMID: 37761095 PMCID: PMC10529232 DOI: 10.3390/foods12183386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 08/30/2023] [Accepted: 09/01/2023] [Indexed: 09/29/2023] Open
Abstract
Taste determination in small molecules is critical in food chemistry but traditional experimental methods can be time-consuming. Consequently, computational techniques have emerged as valuable tools for this task. In this study, we explore taste prediction using various molecular feature representations and assess the performance of different machine learning algorithms on a dataset comprising 2601 molecules. The results reveal that GNN-based models outperform other approaches in taste prediction. Moreover, consensus models that combine diverse molecular representations demonstrate improved performance. Among these, the molecular fingerprints + GNN consensus model emerges as the top performer, highlighting the complementary strengths of GNNs and molecular fingerprints. These findings have significant implications for food chemistry research and related fields. By leveraging these computational approaches, taste prediction can be expedited, leading to advancements in understanding the relationship between molecular structure and taste perception in various food components and related compounds.
Collapse
Affiliation(s)
- Yu Song
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou 450001, China;
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen 518120, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Sihao Chang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen 518120, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jing Tian
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen 518120, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen 518120, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Lu Feng
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou 450001, China;
| | - Hongchao Ji
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen 518120, China
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| |
Collapse
|
8
|
Dost K, Pullar-Strecker Z, Brydon L, Zhang K, Hafner J, Riddle PJ, Wicker JS. Combatting over-specialization bias in growing chemical databases. J Cheminform 2023; 15:53. [PMID: 37208694 DOI: 10.1186/s13321-023-00716-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 03/25/2023] [Indexed: 05/21/2023] Open
Abstract
BACKGROUND Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers' experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. PROPOSED SOLUTION In this paper, we propose CANCELS (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. CANCELS does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. RESULTS An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that CANCELS produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor's performance while reducing the number of required experiments. Overall, we believe that CANCELS can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .
Collapse
Affiliation(s)
- Katharina Dost
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand.
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany.
| | - Zac Pullar-Strecker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Liam Brydon
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Kunyang Zhang
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Jasmin Hafner
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Patricia J Riddle
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Jörg S Wicker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany
| |
Collapse
|
9
|
Cai H, Zhang H, Zhao D, Wu J, Wang L. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022; 23:6702671. [PMID: 36124766 DOI: 10.1093/bib/bbac408] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 07/28/2022] [Accepted: 08/22/2022] [Indexed: 12/14/2022] Open
Abstract
Accurate prediction of molecular properties, such as physicochemical and bioactive properties, as well as ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties, remains a fundamental challenge for molecular design, especially for drug design and discovery. In this study, we advanced a novel deep learning architecture, termed FP-GNN (fingerprints and graph neural networks), which combined and simultaneously learned information from molecular graphs and fingerprints for molecular property prediction. To evaluate the FP-GNN model, we conducted experiments on 13 public datasets, an unbiased LIT-PCBA dataset and 14 phenotypic screening datasets for breast cell lines. Extensive evaluation results showed that compared to advanced deep learning and conventional machine learning algorithms, the FP-GNN algorithm achieved state-of-the-art performance on these datasets. In addition, we analyzed the influence of different molecular fingerprints, and the effects of molecular graphs and molecular fingerprints on the performance of the FP-GNN model. Analysis of the anti-noise ability and interpretation ability also indicated that FP-GNN was competitive in real-world situations. Collectively, FP-GNN algorithm can assist chemists, biologists and pharmacists in predicting and discovering better molecules with desired functions or properties.
Collapse
Affiliation(s)
- Hanxuan Cai
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Huimin Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Duancheng Zhao
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Jingxing Wu
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|
10
|
Woodward DJ, Bradley AR, van Hoorn WP. Coverage Score: A Model Agnostic Method to Efficiently Explore Chemical Space. J Chem Inf Model 2022; 62:4391-4402. [PMID: 35867814 DOI: 10.1021/acs.jcim.2c00258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Selecting the most appropriate compounds to synthesize and test is a vital aspect of drug discovery. Methods like clustering and diversity present weaknesses in selecting the optimal sets for information gain. Active learning techniques often rely on an initial model and computationally expensive semi-supervised batch selection. Herein, we describe a new subset-based selection method, Coverage Score, that combines Bayesian statistics and information entropy to balance representation and diversity to select a maximally informative subset. Coverage Score can be influenced by prior selections and desirable properties. In this paper, subsets selected through Coverage Score are compared against subsets selected through model-independent and model-dependent techniques for several datasets. In drug-like chemical space, Coverage Score consistently selects subsets that lead to more accurate predictions compared to other selection methods. Subsets selected through Coverage Score produced Random Forest models that have a root-mean-square-error up to 12.8% lower than subsets selected at random and can retain up to 99% of the structural dissimilarity of a diversity selection.
Collapse
Affiliation(s)
- Daniel J Woodward
- Exscientia plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K
| | - Anthony R Bradley
- Exscientia plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K
| | - Willem P van Hoorn
- Exscientia plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K
| |
Collapse
|
11
|
Schmidt S, Schindler M, Eriksson L. Block-wise Exploration of Molecular Descriptors with Multi-block Orthogonal Component Analysis (MOCA). Mol Inform 2022; 41:e2100165. [PMID: 34878230 PMCID: PMC9285065 DOI: 10.1002/minf.202100165] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 11/24/2021] [Indexed: 11/13/2022]
Abstract
Data tables for machine learning and structure-activity relationship modelling (QSAR) are often naturally organized in blocks of data, where multiple molecular representations or sets of descriptors form the blocks. Multi-block Orthogonal Component Analysis (MOCA), a new analytical tool, can be used to explore such data structures in a single model, identifying principal components that are unique to a single block or joint over multiple blocks. We applied MOCA to two sets of 550 and 300 molecules and up to 9213 molecular descriptors organized in 11 blocks. The MOCA models reveal relationships between the blocks and overarching trends across the whole dataset. Based on the MOCA joint components, we propose a quantitative metric for the redundancy of blocks, useful for a priori block-wise feature selection or evaluation of new molecular representations. The second data set includes 7 ecotoxicological study endpoints for crop protection chemicals, for which we (re-)discovered some general trends and linked them to molecular properties. Using a single MOCA model we estimated the predictive potential of each block and the model-ability of the target block.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Bayer AG, Crop Science Division, Environmental SafetyAlfred-Nobel-Str. 5040789MonheimGermany
| | - Michael Schindler
- Bayer AG, Crop Science Division, Environmental SafetyAlfred-Nobel-Str. 5040789MonheimGermany
| | - Lennart Eriksson
- Sartorius Stedim Data Analytics ABÖstra Strandgatan 24SE-903 33UmeåSweden
| |
Collapse
|
12
|
Zhang K, Zhang H. Predicting Solute Descriptors for Organic Chemicals by a Deep Neural Network (DNN) Using Basic Chemical Structures and a Surrogate Metric. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2022; 56:2054-2064. [PMID: 34995441 DOI: 10.1021/acs.est.1c05398] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Solute descriptors have been widely used to model chemical transfer processes through poly-parameter linear free energy relationships (pp-LFERs); however, there are still substantial difficulties in obtaining these descriptors accurately and quickly for new organic chemicals. In this research, models (PaDEL-DNN) that require only SMILES of chemicals were built to satisfactorily estimate pp-LFER descriptors using deep neural networks (DNN) and the PaDEL chemical representation. The PaDEL-DNN-estimated pp-LFER descriptors demonstrated good performance in modeling storage-lipid/water partitioning coefficient (log Kstorage-lipid/water), bioconcentration factor (BCF), aqueous solubility (ESOL), and hydration free energy (freesolve). Then, assuming that the accuracy in the estimated values of widely available properties, e.g., logP (octanol-water partition coefficient), can calibrate estimates for less available but related properties, we proposed logP as a surrogate metric for evaluating the overall accuracy of the estimated pp-LFER descriptors. When using the pp-LFER descriptors to model log Kstorage-lipid/water, BCF, ESOL, and freesolve, we achieved around 0.1 log unit lower errors for chemicals whose estimated pp-LFER descriptors were deemed "accurate" by the surrogate metric. The interpretation of the PaDEL-DNN models revealed that, for a given test chemical, having several (around 5) "similar" chemicals in the training data set was crucial for accurate estimation while the remaining less similar training chemicals provided reasonable baseline estimates. Lastly, pp-LFER descriptors for over 2800 persistent, bioaccumulative, and toxic chemicals were reasonably estimated by combining PaDEL-DNN with the surrogate metric. Overall, the PaDEL-DNN/surrogate metric and newly estimated descriptors will greatly benefit chemical transfer modeling.
Collapse
Affiliation(s)
- Kai Zhang
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| | - Huichun Zhang
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio 44106, United States
| |
Collapse
|
13
|
MacKinnon SS, Madani Tonekaboni SA, Windemuth A. Proteome-Scale Drug-Target Interaction Predictions: Approaches and Applications. Curr Protoc 2021; 1:e302. [PMID: 34794211 DOI: 10.1002/cpz1.302] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Drug-Target interaction predictions are an important cornerstone of computer-aided drug discovery. While predictive methods around individual targets have a long history, the application of proteome-scale models is relatively recent. In this overview, we will provide the context required to understand advances in this emerging field within computational drug discovery, evaluate emerging technologies for suitability to given tasks, and provide guidelines for the design and implementation of new drug-target interaction prediction models. We will discuss the validation approaches used, and propose a set of key criteria that should be applied to evaluate their validity. We note that we find widespread deficiencies in the existing literature, making it difficult to judge the practical effectiveness of some of the techniques proposed from their publications alone. We hope that this review may help remedy this situation and increase awareness of several sources of bias that may enter into commonly used cross-validation methods. © 2021 Cyclica Inc. Current Protocols published by Wiley Periodicals LLC.
Collapse
|