1
|
Palazzotti D, Fiorelli M, Sabatini S, Massari S, Barreca ML, Astolfi A. Q-raKtion: A Semiautomated KNIME Workflow for Bioactivity Data Points Curation. J Chem Inf Model 2022; 62:6309-6315. [PMID: 36442071 PMCID: PMC9795488 DOI: 10.1021/acs.jcim.2c01199] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The recent increase of bioactivity data freely available to the scientific community and stored as activity data points in chemogenomic repositories provides a huge amount of ready-to-use information to support the development of predictive models. However, the benefits provided by the availability of such a vast amount of accessible information are strongly counteracted by the lack of uniformity and consistency of data from multiple sources, requiring a process of integration and harmonization. While different automated pipelines for processing and assessing chemical data have emerged in the last years, the curation of bioactivity data points is a less investigated topic, with useful concepts provided but no tangible tools available. In this context, the present work represents a first step toward the filling of this gap, by providing a tool to meet the needs of end-user in building proprietary high-quality data sets for further studies. Specifically, we herein describe Q-raKtion, a systematic, semiautomated, flexible, and, above all, customizable KNIME workflow that effectively aggregates information on biological activities of compounds retrieved by two of the most comprehensive and widely used repositories, PubChem and ChEMBL.
Collapse
|
2
|
Moshawih S, Goh HP, Kifli N, Idris AC, Yassin H, Kotra V, Goh KW, Liew KB, Ming LC. Synergy between machine learning and natural products cheminformatics: Application to the lead discovery of anthraquinone derivatives. Chem Biol Drug Des 2022; 100:185-217. [PMID: 35490393 DOI: 10.1111/cbdd.14062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 04/15/2022] [Accepted: 04/23/2022] [Indexed: 11/28/2022]
Abstract
Cheminformatics utilizing machine learning (ML) techniques have opened up a new horizon in drug discovery. This is owing to vast chemical space expansion with rocketing numbers of expected hits and lead compounds that match druggable macromolecular targets, in particular from natural compounds. Due to the natural products' (NP) structural complexity, uniqueness, and diversity, they could occupy a bigger space in pharmaceuticals, allowing the industry to pursue more selective leads in the nanomolar range of binding affinity. ML is an essential part of each step of the drug design pipeline, such as target prediction, compound library preparation, and lead optimization. Notably, molecular mechanic and dynamic simulations, induced docking, and free energy perturbations are essential in predicting best binding poses, binding free energy values, and molecular mechanics force fields. Those applications have leveraged from artificial intelligence (AI), which decreases the computational costs required for such costly simulations. This review aimed to describe chemical space and compound libraries related to NPs. High-throughput screening utilized for fractionating NPs and high-throughput virtual screening and their strategies, and significance, are reviewed. Particular emphasis was given to AI approaches, ML tools, algorithms, and techniques, especially in drug discovery of macrocyclic compounds and approaches in computer-aided and ML-based drug discovery. Anthraquinone derivatives were discussed as a source of new lead compounds that can be developed using ML tools for diverse medicinal uses such as cancer, infectious diseases, and metabolic disorders. Furthermore, the power of principal component analysis in understanding relevant protein conformations, and molecular modeling of protein-ligand interaction were also presented. Apart from being a concise reference for cheminformatics, this review is a useful text to understand the application of ML-based algorithms to molecular dynamics simulation and in silico absorption, distribution, metabolism, excretion, and toxicity prediction.
Collapse
Affiliation(s)
- Said Moshawih
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Hui Poh Goh
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Nurolaini Kifli
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Azam Che Idris
- Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Hayati Yassin
- Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Vijay Kotra
- Faculty of Pharmacy, Quest International University, Perak, Malaysia
| | - Khang Wen Goh
- Faculty of Data Science and Information Technology, INTI International University, Nilai, Malaysia
| | - Kai Bin Liew
- Faculty of Pharmacy, University of Cyberjaya, Cyberjaya, Malaysia
| | - Long Chiau Ming
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| |
Collapse
|
3
|
Watson OP, Cortes-Ciriano I, Taylor AR, Watson JA. A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery. Bioinformatics 2020; 35:4656-4663. [PMID: 31070704 PMCID: PMC6853675 DOI: 10.1093/bioinformatics/btz293] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 03/22/2019] [Accepted: 04/17/2019] [Indexed: 02/07/2023] Open
Abstract
Motivation Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs. Results The quantile-activity bootstrap is proposed as a new model validation framework using quantile splits on the activity distribution function to construct training and testing sets. In addition, we propose two novel rank-based loss functions which penalize only the out-of-sample predicted ranks of high-activity molecules. The combination of these methods was used to assess the performance of neural nets, random forests, support vector machines (regression) and ridge regression applied to 25 diverse high-quality structure-activity datasets publicly available on ChEMBL. Model validation based on random partitioning of available data favours models that overfit and ‘memorize’ the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes extrapolation of models onto structurally different molecules outside of the training data. Simpler, traditional statistical methods such as ridge regression can outperform state-of-the-art machine learning methods in this setting. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand. Availability and implementation All software and data are available as Jupyter notebooks found at https://github.com/owatson/QuantileBootstrap. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Isidro Cortes-Ciriano
- Goring on Thames, Evariste Technologies Ltd., RG8 9AL UK.,Department of Chemistry, Centre for Molecular Science Informatics, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Aimee R Taylor
- Department of Epidemiology, Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, MA 02115 USA.,Infectious Disease Microbiome Program, Broad Institute, Cambridge, MA 02142 USA
| | - James A Watson
- Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford OX3, 7LF UK.,Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok 10400, Thailand
| |
Collapse
|
4
|
Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 2020; 12:39. [PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 02/11/2023] Open
Abstract
An affinity fingerprint is the vector consisting of compound’s affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based affinity fingerprint, components of which are predicted by Random Forest regression models trained on bioactivity data from the ChEMBL database. Both real-valued (rv-QAFFP) and binary (b-QAFFP) versions of the QAFFP fingerprint were implemented and their performance in similarity searching, biological activity classification and scaffold hopping was assessed and compared to that of the 1024 bits long Morgan2 fingerprint (the RDKit implementation of the ECFP4 fingerprint). In both similarity searching and biological activity classification, the QAFFP fingerprint yields retrieval rates, measured by AUC (~ 0.65 and ~ 0.70 for similarity searching depending on data sets, and ~ 0.85 for classification) and EF5 (~ 4.67 and ~ 5.82 for similarity searching depending on data sets, and ~ 2.10 for classification), comparable to that of the Morgan2 fingerprint (similarity searching AUC of ~ 0.57 and ~ 0.66, and EF5 of ~ 4.09 and ~ 6.41, depending on data sets, classification AUC of ~ 0.87, and EF5 of ~ 2.16). However, the QAFFP fingerprint outperforms the Morgan2 fingerprint in scaffold hopping as it is able to retrieve 1146 out of existing 1749 scaffolds, while the Morgan2 fingerprint reveals only 864 scaffolds.![]()
Collapse
Affiliation(s)
- C Škuta
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic
| | - I Cortés-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - W Dehaen
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic.,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - P Kříž
- Department of Mathematics, Faculty of Chemical Engineering, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic
| | - G J P van Westen
- Computational Drug Discovery, Drug Discovery and Safety, LACDR, Leiden University, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
| | - I V Tetko
- Helmholtz Zentrum Muenchen - German Research Center for Environmental Health (GmbH) and BIGCHEM GmbH, Ingolstaedter Landstrasse 1, 85764, Neuherberg, Germany
| | - A Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - D Svozil
- CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Institute of Molecular Genetics of the ASCR, v. v. i., Vídeňská 1083, 142 20, Prague 4, Czech Republic. .,CZ-OPENSCREEN: National Infrastructure for Chemical Biology, Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, Technická 5, 166 28, Prague, Czech Republic.
| |
Collapse
|
5
|
Baldo F. Prediction of modes of action of components of traditional medicinal preparations. PHYSICAL SCIENCES REVIEWS 2020. [DOI: 10.1515/psr-2018-0115] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
AbstractTraditional medicine preparations are used to treat many ailments in multiple regions across the world. Despite their widespread use, the mode of action of these preparations and their constituents are not fully understood. Traditional methods of elucidating the modes of action of these natural products (NPs) can be expensive and time consuming e. g. biochemical methods, bioactivity guided fractionation, etc. In this review, we discuss some methods for the prediction of the modes of action of traditional medicine preparations, both in mixtures and as isolated NPs. These methods are useful to predict targets of NPs before they are experimentally validated. Case studies of the applications of these methods are also provided herein.
Collapse
|
6
|
Pogodin PV, Lagunin AA, Rudik AV, Filimonov DA, Druzhilovskiy DS, Nicklaus MC, Poroikov VV. How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors. Front Chem 2018; 6:133. [PMID: 29755970 PMCID: PMC5935003 DOI: 10.3389/fchem.2018.00133] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2018] [Accepted: 04/09/2018] [Indexed: 12/16/2022] Open
Abstract
Discovery of new pharmaceutical substances is currently boosted by the possibility of utilization of the Synthetically Accessible Virtual Inventory (SAVI) library, which includes about 283 million molecules, each annotated with a proposed synthetic one-step route from commercially available starting materials. The SAVI database is well-suited for ligand-based methods of virtual screening to select molecules for experimental testing. In this study, we compare the performance of three approaches for the analysis of structure-activity relationships that differ in their criteria for selecting of "active" and "inactive" compounds included in the training sets. PASS (Prediction of Activity Spectra for Substances), which is based on a modified Naïve Bayes algorithm, was applied since it had been shown to be robust and to provide good predictions of many biological activities based on just the structural formula of a compound even if the information in the training set is incomplete. We used different subsets of kinase inhibitors for this case study because many data are currently available on this important class of drug-like molecules. Based on the subsets of kinase inhibitors extracted from the ChEMBL 20 database we performed the PASS training, and then applied the model to ChEMBL 23 compounds not yet present in ChEMBL 20 to identify novel kinase inhibitors. As one may expect, the best prediction accuracy was obtained if only the experimentally confirmed active and inactive compounds for distinct kinases in the training procedure were used. However, for some kinases, reasonable results were obtained even if we used merged training sets, in which we designated as inactives the compounds not tested against the particular kinase. Thus, depending on the availability of data for a particular biological activity, one may choose the first or the second approach for creating ligand-based computational tools to achieve the best possible results in virtual screening.
Collapse
Affiliation(s)
- Pavel V. Pogodin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Alexey A. Lagunin
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
- Department of Bioinformatics, Medical-Biological Department, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Anastasia V. Rudik
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Dmitry A. Filimonov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | | | - Mark C. Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, NCI-Frederick, Frederick, MD, United States
| | | |
Collapse
|
7
|
Cortes Cabrera A, Petrone PM. Optimal HTS Fingerprint Definitions by Using a Desirability Function and a Genetic Algorithm. J Chem Inf Model 2018; 58:641-646. [DOI: 10.1021/acs.jcim.7b00447] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Alvaro Cortes Cabrera
- GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, U.K
| | - Paula M. Petrone
- BarcelonaBeta Brain Research Center, Carrer de Wellington, 30, 08005 Barcelona, Spain
| |
Collapse
|
8
|
Liu Z, Su M, Han L, Liu J, Yang Q, Li Y, Wang R. Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions. Acc Chem Res 2017; 50:302-309. [PMID: 28182403 DOI: 10.1021/acs.accounts.6b00491] [Citation(s) in RCA: 207] [Impact Index Per Article: 29.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
In structure-based drug design, scoring functions are widely used for fast evaluation of protein-ligand interactions. They are often applied in combination with molecular docking and de novo design methods. Since the early 1990s, a whole spectrum of protein-ligand interaction scoring functions have been developed. Regardless of their technical difference, scoring functions all need data sets combining protein-ligand complex structures and binding affinity data for parametrization and validation. However, data sets of this kind used to be rather limited in terms of size and quality. On the other hand, standard metrics for evaluating scoring function used to be ambiguous. Scoring functions are often tested in molecular docking or even virtual screening trials, which do not directly reflect the genuine quality of scoring functions. Collectively, these underlying obstacles have impeded the invention of more advanced scoring functions. In this Account, we describe our long-lasting efforts to overcome these obstacles, which involve two related projects. On the first project, we have created the PDBbind database. It is the first database that systematically annotates the protein-ligand complexes in the Protein Data Bank (PDB) with experimental binding data. This database has been updated annually since its first public release in 2004. The latest release (version 2016) provides binding data for 16 179 biomolecular complexes in PDB. Data sets provided by PDBbind have been applied to many computational and statistical studies on protein-ligand interaction and various subjects. In particular, it has become a major data resource for scoring function development. On the second project, we have established the Comparative Assessment of Scoring Functions (CASF) benchmark for scoring function evaluation. Our key idea is to decouple the "scoring" process from the "sampling" process, so scoring functions can be tested in a relatively pure context to reflect their quality. In our latest work on this track, i.e. CASF-2013, the performance of a scoring function was quantified in four aspects, including "scoring power", "ranking power", "docking power", and "screening power". All four performance tests were conducted on a test set containing 195 high-quality protein-ligand complexes selected from PDBbind. A panel of 20 standard scoring functions were tested as demonstration. Importantly, CASF is designed to be an open-access benchmark, with which scoring functions developed by different researchers can be compared on the same grounds. Indeed, it has become a popular choice for scoring function validation in recent years. Despite the considerable progress that has been made so far, the performance of today's scoring functions still does not meet people's expectations in many aspects. There is a constant demand for more advanced scoring functions. Our efforts have helped to overcome some obstacles underlying scoring function development so that the researchers in this field can move forward faster. We will continue to improve the PDBbind database and the CASF benchmark in the future to keep them as useful community resources.
Collapse
Affiliation(s)
- Zhihai Liu
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Minyi Su
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Li Han
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Jie Liu
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Qifan Yang
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Yan Li
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
| | - Renxiao Wang
- State
Key Laboratory of Bioorganic and Natural Products Chemistry, Collaborative
Innovation Center of Chemistry for Life Sciences, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, 345 Lingling Road, Shanghai 200032, People’s Republic of China
- State
Key Laboratory of Quality Research in Chinese Medicine, Macau Institute
for Applied Research in Medicine and Health, Macau University of Science and Technology, Macau, People’s Republic of China
| |
Collapse
|
9
|
Thermodynamics of protein–ligand interactions as a reference for computational analysis: how to assess accuracy, reliability and relevance of experimental data. J Comput Aided Mol Des 2015; 29:867-83. [DOI: 10.1007/s10822-015-9867-y] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Accepted: 09/05/2015] [Indexed: 12/11/2022]
|
10
|
Pogodin PV, Lagunin AA, Filimonov DA, Poroikov VV. PASS Targets: Ligand-based multi-target computational system based on a public data and naïve Bayes approach. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2015; 26:783-793. [PMID: 26305108 DOI: 10.1080/1062936x.2015.1078407] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Estimation of interactions between drug-like compounds and drug targets is very important for drug discovery and toxicity assessment. Using data extracted from the 19th version of the ChEMBL database ( https://www.ebi.ac.uk/chembl ) as a training set and a Bayesian-like method realized in PASS software ( http://www.way2drug.com/PASSOnline ), we developed a computational tool for the prediction of interactions between protein targets and drug-like compounds. After training, PASS Targets became able to predict interactions of drug-like compounds with 2507 protein targets from different organisms based on analysis of structure-activity relationships for 589,107 different chemical compounds. The prediction accuracy, estimated as AUC ROC calculated by the leave-one-out cross-validation and 20-fold cross-validation procedures, was about 96%. Average AUC ROC value was about 90% for the external test set from approximately 700 known drugs interacting with 206 protein targets.
Collapse
Affiliation(s)
- P V Pogodin
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| | - A A Lagunin
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| | - D A Filimonov
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
| | - V V Poroikov
- a Department for Bioinformatics; Institute of Biomedical Chemistry , Pirogov Russian National Research Medical University , Moscow , Russia
- b Medico-Biological Faculty , Pirogov Russian National Research Medical University , Moscow , Russia
| |
Collapse
|
11
|
Abstract
The emergence of a number of publicly available bioactivity databases, such as ChEMBL, PubChem BioAssay and BindingDB, has raised awareness about the topics of data curation, quality and integrity. Here we provide an overview and discussion of the current and future approaches to activity, assay and target data curation of the ChEMBL database. This curation process involves several manual and automated steps and aims to: (1) maximise data accessibility and comparability; (2) improve data integrity and flag outliers, ambiguities and potential errors; and (3) add further curated annotations and mappings thus increasing the usefulness and accuracy of the ChEMBL data for all users and modellers in particular. Issues related to activity, assay and target data curation and integrity along with their potential impact for users of the data are discussed, alongside robust selection and filter strategies in order to avoid or minimise these, depending on the desired application.
Collapse
|
12
|
Kramer C, Fuchs JE, Liedl KR. Strong nonadditivity as a key structure-activity relationship feature: distinguishing structural changes from assay artifacts. J Chem Inf Model 2015; 55:483-94. [PMID: 25760829 PMCID: PMC4372821 DOI: 10.1021/acs.jcim.5b00018] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Nonadditivity
in protein–ligand affinity data represents
highly instructive structure–activity relationship (SAR) features
that indicate structural changes and have the potential to guide rational
drug design. At the same time, nonadditivity is a challenge for both
basic SAR analysis as well as many ligand-based data analysis techniques
such as Free-Wilson Analysis and Matched Molecular Pair analysis,
since linear substituent contribution models inherently assume additivity
and thus do not work in such cases. While structural causes for nonadditivity
have been analyzed anecdotally, no systematic approaches to interpret
and use nonadditivity prospectively have been developed yet. In this
contribution, we lay the statistical framework for systematic analysis
of nonadditivity in a SAR series. First, we develop a general metric
to quantify nonadditivity. Then, we demonstrate the non-negligible
impact of experimental uncertainty that creates apparent nonadditivity,
and we introduce techniques to handle experimental uncertainty. Finally,
we analyze public SAR data sets for strong nonadditivity and use recourse
to the original publications and available X-ray structures to find
structural explanations for the nonadditivity observed. We find that
all cases of strong nonadditivity (ΔΔpKi and ΔΔpIC50 > 2.0 log units)
with sufficient structural information to generate reasonable hypothesis
involve changes in binding mode. With the appropriate statistical
basis, nonadditivity analysis offers a variety of new attempts for
various areas in computer-aided drug design, including the validation
of scoring functions and free energy perturbation approaches, binding
pocket classification, and novel features in SAR analysis tools.
Collapse
Affiliation(s)
- Christian Kramer
- †Department of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria
| | - Julian E Fuchs
- †Department of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria.,‡Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Klaus R Liedl
- †Department of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria
| |
Collapse
|
13
|
Inhester T, Rarey M. Protein-ligand interaction databases: advanced tools to mine activity data and interactions on a structural level. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2014. [DOI: 10.1002/wcms.1192] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Therese Inhester
- Center for Bioinformatics; University of Hamburg; Hamburg Germany
| | - Matthias Rarey
- Center for Bioinformatics; University of Hamburg; Hamburg Germany
| |
Collapse
|
14
|
Kramer C, Fuchs JE, Whitebread S, Gedeck P, Liedl KR. Matched Molecular Pair Analysis: Significance and the Impact of Experimental Uncertainty. J Med Chem 2014; 57:3786-802. [DOI: 10.1021/jm500317a] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Christian Kramer
- Department
of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center
for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria
| | - Julian E. Fuchs
- Department
of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center
for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria
| | - Steven Whitebread
- Preclinical
Safety Profiling, Center for Proteomic Chemistry, Novartis Institutes for BioMedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Peter Gedeck
- Novartis Institute for Tropical Diseases, 10 Biopolis Road, No. 05-01 Chromos, Singapore 138670, Singapore
| | - Klaus R. Liedl
- Department
of Theoretical Chemistry, Faculty for Chemistry and Pharmacy, Center
for Molecular Biosciences Innsbruck (CMBI), Leopold-Franzens University Innsbruck, Innrain 80/82, A-6020 Innsbruck, Austria
| |
Collapse
|