1
|
Yosipof A, Guedes RC, García-Sosa AT. Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category. Front Chem 2018; 6:162. [PMID: 29868564 PMCID: PMC5954128 DOI: 10.3389/fchem.2018.00162] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Accepted: 04/20/2018] [Indexed: 12/11/2022] Open
Abstract
Data mining approaches can uncover underlying patterns in chemical and pharmacological property space decisive for drug discovery and development. Two of the most common approaches are visualization and machine learning methods. Visualization methods use dimensionality reduction techniques in order to reduce multi-dimension data into 2D or 3D representations with a minimal loss of information. Machine learning attempts to find correlations between specific activities or classifications for a set of compounds and their features by means of recurring mathematical models. Both models take advantage of the different and deep relationships that can exist between features of compounds, and helpfully provide classification of compounds based on such features or in case of visualization methods uncover underlying patterns in the feature space. Drug-likeness has been studied from several viewpoints, but here we provide the first implementation in chemoinformatics of the t-Distributed Stochastic Neighbor Embedding (t-SNE) method for the visualization and the representation of chemical space, and the use of different machine learning methods separately and together to form a new ensemble learning method called AL Boost. The models obtained from AL Boost synergistically combine decision tree, random forests (RF), support vector machine (SVM), artificial neural network (ANN), k nearest neighbors (kNN), and logistic regression models. In this work, we show that together they form a predictive model that not only improves the predictive force but also decreases bias. This resulted in a corrected classification rate of over 0.81, as well as higher sensitivity and specificity rates for the models. In addition, separation and good models were also achieved for disease categories such as antineoplastic compounds and nervous system diseases, among others. Such models can be used to guide decision on the feature landscape of compounds and their likeness to either drugs or other characteristics, such as specific or multiple disease-category(ies) or organ(s) of action of a molecule.
Collapse
Affiliation(s)
- Abraham Yosipof
- Department of Information Systems and Department of Business Administration, College of Law & Business, Ramat-Gan, Israel
| | - Rita C Guedes
- Department of Medicinal Chemistry, Faculty of Pharmacy, Research Institute for Medicines (iMed.ULisboa), Universidade de Lisboa, Lisbon, Portugal
| | - Alfonso T García-Sosa
- Department of Molecular Technology, Institute of Chemistry, University of Tartu, Tartu, Estonia
| |
Collapse
|
2
|
Lagarde N, Zagury JF, Montes M. Benchmarking Data Sets for the Evaluation of Virtual Ligand Screening Methods: Review and Perspectives. J Chem Inf Model 2015; 55:1297-307. [PMID: 26038804 DOI: 10.1021/acs.jcim.5b00090] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Virtual screening methods are commonly used nowadays in drug discovery processes. However, to ensure their reliability, they have to be carefully evaluated. The evaluation of these methods is often realized in a retrospective way, notably by studying the enrichment of benchmarking data sets. To this purpose, numerous benchmarking data sets were developed over the years, and the resulting improvements led to the availability of high quality benchmarking data sets. However, some points still have to be considered in the selection of the active compounds, decoys, and protein structures to obtain optimal benchmarking data sets.
Collapse
Affiliation(s)
- Nathalie Lagarde
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| | - Jean-François Zagury
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| | - Matthieu Montes
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| |
Collapse
|
3
|
Meyer T, Knapp EW. Database of protein complexes with multivalent binding ability: Bival-bind. Proteins 2013; 82:744-51. [DOI: 10.1002/prot.24453] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Revised: 10/15/2013] [Accepted: 10/21/2013] [Indexed: 01/13/2023]
Affiliation(s)
- Tim Meyer
- Fachbereich Biologie Chemie; Pharmazie/Institute of Chemistry and Biochemistry, Freie Universität Berlin; 14195 Berlin Germany
| | - Ernst-Walter Knapp
- Fachbereich Biologie Chemie; Pharmazie/Institute of Chemistry and Biochemistry, Freie Universität Berlin; 14195 Berlin Germany
| |
Collapse
|
4
|
Mavridis L, Mitchell JB. Predicting the protein targets for athletic performance-enhancing substances. J Cheminform 2013; 5:31. [PMID: 23800040 PMCID: PMC3701582 DOI: 10.1186/1758-2946-5-31] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2013] [Accepted: 06/17/2013] [Indexed: 12/02/2022] Open
Abstract
Background The World Anti-Doping Agency (WADA) publishes the Prohibited List, a manually compiled international standard of substances and methods prohibited in-competition, out-of-competition and in particular sports. It would be ideal to be able to identify all substances that have one or more performance-enhancing pharmacological actions in an automated, fast and cost effective way. Here, we use experimental data derived from the ChEMBL database (~7,000,000 activity records for 1,300,000 compounds) to build a database model that takes into account both structure and experimental information, and use this database to predict both on-target and off-target interactions between these molecules and targets relevant to doping in sport. Results The ChEMBL database was screened and eight well populated categories of activities (Ki, Kd, EC50, ED50, activity, potency, inhibition and IC50) were used for a rule-based filtering process to define the labels “active” or “inactive”. The “active” compounds for each of the ChEMBL families were thereby defined and these populated our bioactivity-based filtered families. A structure-based clustering step was subsequently performed in order to split families with more than one distinct chemical scaffold. This produced refined families, whose members share both a common chemical scaffold and bioactivity against a common target in ChEMBL. Conclusions We have used the Parzen-Rosenblatt machine learning approach to test whether compounds in ChEMBL can be correctly predicted to belong to their appropriate refined families. Validation tests using the refined families gave a significant increase in predictivity compared with the filtered or with the original families. Out of 61,660 queries in our Monte Carlo cross-validation, belonging to 19,639 refined families, 41,300 (66.98%) had the parent family as the top prediction and 53,797 (87.25%) had the parent family in the top four hits. Having thus validated our approach, we used it to identify the protein targets associated with the WADA prohibited classes. For compounds where we do not have experimental data, we use their computed patterns of interaction with protein targets to make predictions of bioactivity. We hope that other groups will test these predictions experimentally in the future.
Collapse
Affiliation(s)
- Lazaros Mavridis
- Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST, UK.
| | | |
Collapse
|
5
|
García-Sosa AT, Maran U. Drugs, non-drugs, and disease category specificity: organ effects by ligand pharmacology. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2013; 24:319-331. [PMID: 23534612 DOI: 10.1080/1062936x.2013.773373] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Important understanding can be gained from using molecular biology-based and chemistry-based techniques together. Bayesian classifiers have thus been developed in the present work using several statistically significant molecular properties of compiled datasets of drugs and non-drugs, including their disease category or organ. The results show they provide a useful classification and simplicity of several different ligand efficiencies and molecular properties. Early recall of drugs among non-drugs using the classifiers as a ranking tool is also provided. As the chemical space of compounds is addressed together with their anatomical characterization, chemical libraries can be improved to select for specific organ or disease. Eventually, by including even finer detail, the method may help in designing libraries with specific pharmacological or toxicological target chemical space. Alternatively, a lack of statistically significant differences in property density distributions may help in further describing compounds with possibility of activity on several organs or disease groups, and given their very similar or considerably overlapping chemical space, therefore wanted or unwanted side-effects. The overlaps between densities for several properties of organs or disease categories were calculated by integrating the area under the curves where they intersect. The naïve Bayesian classifiers are readily built, fast to score, and easily interpretable.
Collapse
Affiliation(s)
- A T García-Sosa
- Institute of Chemistry, University of Tartu, Tartu, Estonia.
| | | |
Collapse
|
6
|
García-Sosa AT, Oja M, Hetényi C, Maran U. DrugLogit: logistic discrimination between drugs and nondrugs including disease-specificity by assigning probabilities based on molecular properties. J Chem Inf Model 2012; 52:2165-80. [PMID: 22830445 DOI: 10.1021/ci200587h] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The increasing knowledge of both structure and activity of compounds provides a good basis for enhancing the pharmacological characterization of chemical libraries. In addition, pharmacology can be seen as incorporating both advances from molecular biology as well as chemical sciences, with innovative insight provided from studying target-ligand data from a ligand molecular point of view. Predictions and profiling of libraries of drug candidates have previously focused mainly on certain cases of oral bioavailability. Inclusion of other administration routes and disease-specificity would improve the precision of drug profiling. In this work, recent data are extended, and a probability-based approach is introduced for quantitative and gradual classification of compounds into categories of drugs/nondrugs, as well as for disease- or organ-specificity. Using experimental data of over 1067 compounds and multivariate logistic regressions, the classification shows good performance in training and independent test cases. The regressions have high statistical significance in terms of the robustness of coefficients and 95% confidence intervals provided by a 1000-fold bootstrapping resampling. Besides their good predictive power, the classification functions remain chemically interpretable, containing only one to five variables in total, and the physicochemical terms involved can be easily calculated. The present approach is useful for an improved description and filtering of compound libraries. It can also be applied sequentially or in combinations of filters, as well as adapted to particular use cases. The scores and equations may be able to suggest possible routes for compound or library modification. The data is made available for reuse by others, and the equations are freely accessible at http://hermes.chem.ut.ee/~alfx/druglogit.html.
Collapse
|
7
|
García-Sosa AT, Oja M, Hetényi C, Maran U. Disease-Specific Differentiation Between Drugs and Non-Drugs Using Principal Component Analysis of Their Molecular Descriptor Space. Mol Inform 2012; 31:369-83. [DOI: 10.1002/minf.201100094] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 01/25/2012] [Indexed: 01/04/2023]
|
8
|
Xue M, Zheng M, Xiong B, Li Y, Jiang H, Shen J. Knowledge-based scoring functions in drug design. 1. Developing a target-specific method for kinase-ligand interactions. J Chem Inf Model 2010; 50:1378-86. [PMID: 20681607 DOI: 10.1021/ci100182c] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein kinases are attractive targets for therapeutic interventions in many diseases. Due to their importance in drug discovery, a kinase family-specific potential of mean force (PMF) scoring function, kinase-PMF, was developed to assess the binding of ATP-competitive kinase inhibitors. It is hypothesized that target-specific PMF scoring functions may achieve increased performance in scoring along with the growth of the PDB database. The kinase-PMF inherits the functions and atom types in PMF04 and uses a kinase data set of 872 complexes to derive the potentials. The performance of kinase-PMF was evaluated with an external test set containing 128 kinase crystal structures. We compared it with eight scoring functions commonly used in computer-aided drug design, either in terms of the retrieval rate of retrieving "right" conformations or a virtual screening study. The evaluation results clearly demonstrate that a target-specific scoring function is a promising way to improve prediction power in structure-based drug design compared with other general scoring functions. To provide this rescoring service for researchers, a publicly accessible Web site was established at http://202.127.30.184:8080/scoring/index.jsp .
Collapse
Affiliation(s)
- Mengzhu Xue
- State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Zhangjiang Hi-Tech Park, Pudong, Shanghai, China 201203
| | | | | | | | | | | |
Collapse
|
9
|
Ogawa T, Nakano T. The Extended Universal Force Field (XUFF):Theory and Applications. CHEM-BIO INFORMATICS JOURNAL 2010. [DOI: 10.1273/cbij.10.111] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
| | - Tatsuya Nakano
- Division of Medicinal Safety Science, National Institute of Health Sciences
| |
Collapse
|
10
|
Novikov FN, Stroylov VS, Stroganov OV, Chilov GG. Improving performance of docking-based virtual screening by structural filtration. J Mol Model 2009; 16:1223-30. [PMID: 20041273 DOI: 10.1007/s00894-009-0633-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2009] [Accepted: 11/16/2009] [Indexed: 10/20/2022]
Abstract
In the current study an innovative method of structural filtration of docked ligand poses is introduced and applied to improve the virtual screening results. The structural filter is defined by a protein-specific set of interactions that are a) structurally conserved in available structures of a particular protein with its bound ligands, and b) that can be viewed as playing the crucial role in protein-ligand binding. The concept was evaluated on a set of 10 diverse proteins, for which the corresponding structural filters were developed and applied to the results of virtual screening obtained with the Lead Finder software. The application of structural filtration resulted in a considerable improvement of the enrichment factor ranging from several folds to hundreds folds depending on the protein target. It appeared that the structural filtration had effectively repaired the deficiencies of the scoring functions that used to overestimate decoy binding, resulting into a considerably lower false positive rate. In addition, the structural filters were also effective in dealing with some deficiencies of the protein structure models that would lead to false negative predictions otherwise. The ability of structural filtration to recover relatively small but specifically bound molecules creates promises for the application of this technology in the fragment-based drug discovery.
Collapse
|
11
|
Saravanan SE, Karthi R, Sathish K, Kokila K, Sabarinathan R, Sekar K. MLDB: macromolecule ligand database. J Appl Crystallogr 2009. [DOI: 10.1107/s0021889809048626] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
MLDB (macromolecule ligand database) is a knowledgebase containing ligands co-crystallized with the three-dimensional structures available in the Protein Data Bank. The proposed knowledgebase serves as an open resource for the analysis and visualization of all ligands and their interactions with macromolecular structures. MLDB can be used to search ligands, and their interactions can be visualized both in text and graphical formats. MLDB will be updated at regular intervals (weekly) with automated Perl scripts. The knowledgebase is intended to serve the scientific community working in the areas of molecular and structural biology. It is available free to users around the clock and can be accessed at http://dicsoft2.physics.iisc.ernet.in/mldb/.
Collapse
|
12
|
Søndergaard CR, Garrett AE, Carstensen T, Pollastri G, Nielsen JE. Structural artifacts in protein-ligand X-ray structures: implications for the development of docking scoring functions. J Med Chem 2009; 52:5673-84. [PMID: 19711919 DOI: 10.1021/jm8016464] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The development of docking scoring functions requires high-resolution 3D structures of protein-ligand complexes for which the binding affinity of the ligand has been measured experimentally. Protein-ligand binding affinities are measured in solution experiments, and high resolution protein-ligand structures can be determined only by X-ray crystallography. Protein-ligand scoring functions must therefore reproduce solution binding energies using analyses of proteins in a crystal environment. We present an analysis of the prevalence of crystal-induced artifacts and water-mediated contacts in protein-ligand complexes and demonstrate the effect that these can have on the performance of protein-ligand scoring functions. We find 36% of ligands in the PDBBind 2007 refined data set to be influenced by crystal contacts and find the performance of a scoring function to be affected by these. A Web server for detecting crystal contacts in protein-ligand complexes is available at http://enzyme.ucd.ie/LIGCRYST .
Collapse
Affiliation(s)
- Chresten R Søndergaard
- School of Biomolecular and Biomedical Science, Centre for Synthesis and Chemical Biology, UCD Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | | | | | | | | |
Collapse
|
13
|
Doddareddy MR, van Westen GJP, van der Horst E, Peironcely JE, Corthals F, Ijzerman AP, Emmerich M, Jenkins JL, Bender A. Chemogenomics: Looking at biology through the lens of chemistry. Stat Anal Data Min 2009. [DOI: 10.1002/sam.10046] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
14
|
Kirchmair J, Markt P, Distinto S, Schuster D, Spitzer GM, Liedl KR, Langer T, Wolber G. The Protein Data Bank (PDB), its related services and software tools as key components for in silico guided drug discovery. J Med Chem 2009; 51:7021-40. [PMID: 18975926 DOI: 10.1021/jm8005977] [Citation(s) in RCA: 70] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Johannes Kirchmair
- Department of Pharmaceutical Chemistry, Faculty of Chemistry and Pharmacy and Center for Molecular Biosciences, University of Innsbruck, Innrain 52, A-6020 Innsbruck, Austria
| | | | | | | | | | | | | | | |
Collapse
|
15
|
Stroganov OV, Novikov FN, Stroylov VS, Kulkov V, Chilov GG. Lead finder: an approach to improve accuracy of protein-ligand docking, binding energy estimation, and virtual screening. J Chem Inf Model 2009; 48:2371-85. [PMID: 19007114 DOI: 10.1021/ci800166p] [Citation(s) in RCA: 141] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
An innovative molecular docking algorithm and three specialized high accuracy scoring functions are introduced in the Lead Finder docking software. Lead Finder's algorithm for ligand docking combines the classical genetic algorithm with various local optimization procedures and resourceful exploitation of the knowledge generated during docking process. Lead Finder's scoring functions are based on a molecular mechanics functional which explicitly accounts for different types of energy contributions scaled with empiric coefficients to produce three scoring functions tailored for (a) accurate binding energy predictions; (b) correct energy-ranking of docked ligand poses; and (c) correct rank-ordering of active and inactive compounds in virtual screening experiments. The predicted values of the free energy of protein-ligand binding were benchmarked against a set of experimentally measured binding energies for 330 diverse protein-ligand complexes yielding rmsd of 1.50 kcal/mol. The accuracy of ligand docking was assessed on a set of 407 structures, which included almost all published test sets of the following programs: FlexX, Glide SP, Glide XP, Gold, LigandFit, MolDock, and Surflex. rmsd of 2 A or less was observed for 80-96% of the structures in the test sets (80.0% on the Glide XP and FlexX test sets, 96.0% on the Surflex and MolDock test sets). The ability of Lead Finder to distinguish between active and inactive compounds during virtual screening experiments was benchmarked against 34 therapeutically relevant protein targets. Impressive enrichment factors were obtained for almost all of the targets with the average area under receiver operator curve being equal to 0.92.
Collapse
Affiliation(s)
- Oleg V Stroganov
- MolTech Ltd., Leninskie gory, 1/75A, Moscow 119992, Russian Federation, andBioMolTech Corp., 226 York Mills Road, Toronto, Ontario M2L 1L1, Canada
| | | | | | | | | |
Collapse
|
16
|
Abstract
Computational biology/chemistry tools are used in most areas of life/health science research. These methods are continually being developed and their use can present difficulties for both experienced and novice investigators. To facilitate the use of these applications, many packages have been implemented online during these last 5 years. This unit focuses on online computational methods with a special emphasis on structural refinement/atomic simulations, protein electrostatic calculations, searches for functional sites, searches for druggable pockets, protein docking and small molecule docking, and prediction of potential impact of amino acid variations on the structure and function of the protein molecules.
Collapse
|
17
|
Irwin JJ. Community benchmarks for virtual screening. J Comput Aided Mol Des 2008; 22:193-9. [DOI: 10.1007/s10822-008-9189-4] [Citation(s) in RCA: 133] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2007] [Accepted: 01/30/2008] [Indexed: 11/24/2022]
|
18
|
Senger S, Leach AR. SAR Knowledge Bases in Drug Discovery. ACTA ACUST UNITED AC 2008. [DOI: 10.1016/s1574-1400(08)00011-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2023]
|
19
|
Benson ML, Smith RD, Khazanov NA, Dimcheff B, Beaver J, Dresslar P, Nerothin J, Carlson HA. Binding MOAD, a high-quality protein-ligand database. Nucleic Acids Res 2007; 36:D674-8. [PMID: 18055497 PMCID: PMC2238910 DOI: 10.1093/nar/gkm911] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Binding MOAD (Mother of All Databases) is a database of 9836 protein-ligand crystal structures. All biologically relevant ligands are annotated, and experimental binding-affinity data is reported when available. Binding MOAD has almost doubled in size since it was originally introduced in 2004, demonstrating steady growth with each annual update. Several technologies, such as natural language processing, help drive this constant expansion. Along with increasing data, Binding MOAD has improved usability. The website now showcases a faster, more featured viewer to examine the protein-ligand structures. Ligands have additional chemical data, allowing for cheminformatics mining. Lastly, logins are no longer necessary, and Binding MOAD is freely available to all at http://www.BindingMOAD.org.
Collapse
Affiliation(s)
- Mark L Benson
- Bioinformatics Graduate Program, Biophysics Research Division, University of Michigan, Ann Arbor, MI 48109, Torrey Path LLC, Ann Arbor, MI 48104, USA
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Li H, Yap CW, Ung CY, Xue Y, Li ZR, Han LY, Lin HH, Chen YZ. Machine learning approaches for predicting compounds that interact with therapeutic and ADMET related proteins. J Pharm Sci 2007; 96:2838-60. [PMID: 17786989 DOI: 10.1002/jps.20985] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Computational methods for predicting compounds of specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) property are useful for facilitating drug discovery and evaluation. Recently, machine learning methods such as neural networks and support vector machines have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic and ADMET property. These methods are particularly useful for compounds of diverse structures to complement QSAR methods, and for cases of unavailable receptor 3D structure to complement structure-based methods. A number of studies have demonstrated the potential of these methods for predicting such compounds as substrates of P-glycoprotein and cytochrome P450 CYP isoenzymes, inhibitors of protein kinases and CYP isoenzymes, and agonists of serotonin receptor and estrogen receptor. This article is intended to review the strategies, current progresses and underlying difficulties in using machine learning methods for predicting these protein binders and as potential virtual screening tools. Algorithms for proper representation of the structural and physicochemical properties of compounds are also evaluated.
Collapse
Affiliation(s)
- H Li
- Bioinformatics and Drug Design Group, Department of Pharmacy and Department of Computational Science, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
21
|
Nakata K, Tanaka Y, Nakano T, Adachi T, Tanaka H, Kaminuma T, Ishikawa T. Nuclear receptor-mediated transcriptional regulation in Phase I, II, and III xenobiotic metabolizing systems. Drug Metab Pharmacokinet 2007; 21:437-57. [PMID: 17220560 DOI: 10.2133/dmpk.21.437] [Citation(s) in RCA: 146] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Studies of the genetic regulation involved in drug metabolizing enzymes and drug transporters are of great interest to understand the molecular mechanisms of drug response and toxic events. Recent reports have revealed that hydrophobic ligands and several nuclear receptors are involved in the induction or down-regulation of various enzymes and transporters involved in Phase I, II, and III xenobiotic metabolizing systems. Nuclear receptors (NRs) form a family of ligand-activated transcription factors (TFs). These proteins modulate the regulation of target genes by contacting their promoter or enhancer sequences at specific recognition sites. These target genes include metabolizing enzymes such as cytochrome P450s (CYPs), transporters, and NRs. Thus it was now recognized that these NRs play essential role in sensing processing xenobiotic substances including drugs, environmental chemical pollutants and nutritional ingredients. From literature, we picked up target genes of each NR in xenobiotic response systems. Possible cross-talk, by which xenobiotics may exert undesirable effects, was listed. For example, the role of NRs was comprehensively drawn up in cholesterol and bile acid homeostasis in human hepatocyte. Summarizing current states of related research, especially for in silico response element search, we tried to elucidate nuclear receptor mediated xenobiotic processing loops and direct future research.
Collapse
|
22
|
Abstract
Ligand enrichment among top-ranking hits is a key metric of molecular docking. To avoid bias, decoys should resemble ligands physically, so that enrichment is not simply a separation of gross features, yet be chemically distinct from them, so that they are unlikely to be binders. We have assembled a directory of useful decoys (DUD), with 2950 ligands for 40 different targets. Every ligand has 36 decoy molecules that are physically similar but topologically distinct, leading to a database of 98,266 compounds. For most targets, enrichment was at least half a log better with uncorrected databases such as the MDDR than with DUD, evidence of bias in the former. These calculations also allowed 40x40 cross-docking, where the enrichments of each ligand set could be compared for all 40 targets, enabling a specificity metric for the docking screens. DUD is freely available online as a benchmarking set for docking at http://blaster.docking.org/dud/.
Collapse
Affiliation(s)
- Niu Huang
- Department of Pharmaceutical Chemistry, University of California San Francisco, QB3 Building, 1700 4th Street, Box 2550, San Francisco, California 94143-2550, USA
| | | | | |
Collapse
|
23
|
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 2006; 35:D198-201. [PMID: 17145705 PMCID: PMC1751547 DOI: 10.1093/nar/gkl999] [Citation(s) in RCA: 1203] [Impact Index Per Article: 66.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
BindingDB () is a publicly accessible database currently containing ∼20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and ∼11 000 small molecule ligands. The data are extracted from the scientific literature, data collection focusing on proteins that are drug-targets or candidate drug-targets and for which structural data are present in the Protein Data Bank. The BindingDB website supports a range of query types, including searches by chemical structure, substructure and similarity; protein sequence; ligand and protein names; affinity ranges and molecular weight. Data sets generated by BindingDB queries can be downloaded in the form of annotated SDfiles for further analysis, or used as the basis for virtual screening of a compound database uploaded by the user. The data in BindingDB are linked both to structural data in the PDB via PDB IDs and chemical and sequence searches, and to the literature in PubMed via PubMed IDs.
Collapse
Affiliation(s)
| | | | | | | | - Michael K. Gilson
- To whom correspondence should be addressed. Tel: +1 240 314 6217; Fax: +1 240 314 6255;
| |
Collapse
|
24
|
Strömbergsson H, Kryshtafovych A, Prusis P, Fidelis K, Wikberg JES, Komorowski J, Hvidsten TR. Generalized modeling of enzyme-ligand interactions using proteochemometrics and local protein substructures. Proteins 2006; 65:568-79. [PMID: 16948162 DOI: 10.1002/prot.21163] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Modeling and understanding protein-ligand interactions is one of the most important goals in computational drug discovery. To this end, proteochemometrics uses structural and chemical descriptors from several proteins and several ligands to induce interaction-models. Here, we present a new and generalized approach in which proteins varying greatly in terms of sequence and structure are represented by a library of local substructures. Using linear regression and rule-based learning, we combine such local substructures with chemical descriptors from the ligands to model binding affinity for a training set of hydrolase and lyase enzymes. We evaluate the predictive performance of these models using cross validation and sets of unseen ligand with unknown three-dimensional structure. The models are shown to generalize by outperforming models using descriptors from only proteins or only ligands, or models using global structure similarities rather than local similarities. Thus, we demonstrate that this approach is capable of describing dependencies between local structural properties and ligands in otherwise dissimilar protein structures. These dependencies are often, but not always, associated with local substructures that are in contact with the ligands. Finally, we show that strongly bound enzyme-ligand complexes require the presence of particular local substructures, while weakly bound complexes may be described by the absence of certain properties. The results demonstrate that the alignment-independent approach using local substructures is capable of describing protein-ligand interaction for largely different proteins and hence opens up for proteochemometrics-analysis of the interaction-space of entire proteomes. Current approaches are limited to families of closely related proteins. families of closely related proteins.
Collapse
Affiliation(s)
- Helena Strömbergsson
- The Linnaeus Centre for Bioinformatics, Uppsala University, SE-751 24, Uppsala, Sweden
| | | | | | | | | | | | | |
Collapse
|
25
|
Strachan RT, Ferrara G, Roth BL. Screening the receptorome: an efficient approach for drug discovery and target validation. Drug Discov Today 2006; 11:708-16. [PMID: 16846798 DOI: 10.1016/j.drudis.2006.06.012] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2006] [Revised: 06/02/2006] [Accepted: 06/16/2006] [Indexed: 11/18/2022]
Abstract
The receptorome, comprising at least 5% of the human genome, encodes receptors that mediate the physiological, pathological and therapeutic responses to a vast number of exogenous and endogenous ligands. Not surprisingly, the majority of approved medications target members of the receptorome. Several in silico and physical screening approaches have been devised to mine the receptorome efficiently for the discovery and validation of molecular targets for therapeutic drug discovery. Receptorome screening has also been used to discover, and thereby avoid, the molecular targets responsible for serious and unforeseen drug side effects.
Collapse
Affiliation(s)
- Ryan T Strachan
- Department of Biochemistry, Comprehensive Cancer Center and NIMH Psychoactive Drug Screening Program, Case Western Reserve University Medical School, Cleveland, OH 44106, USA
| | | | | |
Collapse
|
26
|
Miteva MA, Violas S, Montes M, Gomez D, Tuffery P, Villoutreix BO. FAF-Drugs: free ADME/tox filtering of compound collections. Nucleic Acids Res 2006; 34:W738-44. [PMID: 16845110 PMCID: PMC1538885 DOI: 10.1093/nar/gkl065] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2006] [Revised: 02/22/2006] [Accepted: 03/01/2006] [Indexed: 12/21/2022] Open
Abstract
In silico screening based on the structures of the ligands or of the receptors has become an essential tool to facilitate the drug discovery process but compound collections are needed to carry out such in silico experiments. It has been recognized that absorption, distribution, metabolism, excretion and toxicity (ADME/tox) are key properties that need to be considered early on, even during the database preparation stage. FAF-Drugs is an online service based on Frowns (a chemoinformatics toolkit) that allows users to process their own compound collections via simple ADME/Tox filtering rules such as molecular weight, polar surface area, logP or number of rotatable bonds. SMILES (Simplified Molecular Input Line Entry System), CANSMILES (canonical smiles) or SDF (structure data file) files are required as input and molecules that pass or do not pass the filters are sent back in CANSMILES format. This service should thus help scientists engaging in drug discovery campaigns. Other utilities and several compound collections suitable for in silico screening are available at our site. FAF-Drugs can be accessed at http://bioserv.rpbs.jussieu.fr/FAFDrugs.html.
Collapse
Affiliation(s)
- Maria A. Miteva
- Inserm U648, Paris 5 University45 rue des Sts Peres, 75006 Paris, France
- INSERM U726, EBGM, University Paris 7France
| | - Stephanie Violas
- Inserm U648, Paris 5 University45 rue des Sts Peres, 75006 Paris, France
- INSERM U726, EBGM, University Paris 7France
| | - Matthieu Montes
- Inserm U648, Paris 5 University45 rue des Sts Peres, 75006 Paris, France
- INSERM U726, EBGM, University Paris 7France
| | | | | | - Bruno O. Villoutreix
- To whom correspondence should be addressed. Tel: +33 (0)1 42 86 20 67; Fax: +33 (0)1 42 86 20 65;
| |
Collapse
|
27
|
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006; 6:4023-37. [PMID: 16791826 DOI: 10.1002/pmic.200500938] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
Collapse
Affiliation(s)
- Lianyi Han
- Department of Computational Science, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | |
Collapse
|
28
|
Block P, Sotriffer CA, Dramburg I, Klebe G. AffinDB: a freely accessible database of affinities for protein-ligand complexes from the PDB. Nucleic Acids Res 2006; 34:D522-6. [PMID: 16381925 PMCID: PMC1347402 DOI: 10.1093/nar/gkj039] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
AffinDB is a database of affinity data for structurally resolved protein–ligand complexes from the Protein Data Bank (PDB). It is freely accessible at . Affinity data are collected from the scientific literature, both from primary sources describing the original experimental work of affinity determination and from secondary references which report affinity values determined by others. AffinDB currently contains over 730 affinity entries covering more than 450 different protein–ligand complexes. Besides the affinity value, PDB summary information and additional data are provided, including the experimental conditions of the affinity measurement (if available in the corresponding reference); 2D drawing, SMILES code and molecular weight of the ligand; links to other databases, and bibliographic information. AffinDB can be queried by PDB code or by any combination of affinity range, temperature and pH value of the measurement, ligand molecular weight, and publication data (author, journal and year). Search results can be saved as tabular reports in text files. The database is supposed to be a valuable resource for researchers interested in biomolecular recognition and the development of tools for correlating structural data with affinities, as needed, for example, in structure-based drug design.
Collapse
Affiliation(s)
| | | | | | - Gerhard Klebe
- To whom correspondence should be addressed. Tel: +49 6421 2821313; Fax: +49 6421 2828994;
| |
Collapse
|
29
|
Tobita M, Horiuchi K, Araki K, Nemoto M, Shimada H, Nishikawa T. BirdsAnts: A protein-small molecule interaction viewer. CHEM-BIO INFORMATICS JOURNAL 2006. [DOI: 10.1273/cbij.6.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Affiliation(s)
- Motoi Tobita
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
- Hitachi, Ltd., Advanced Research Laboratory
| | - Ken Horiuchi
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
| | - Kenji Araki
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
| | - Masashi Nemoto
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
| | - Hiroyasu Shimada
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
| | - Tetsuo Nishikawa
- Informatics Department, Reverse Proteomics Research Institute Co., Ltd.,
| |
Collapse
|
30
|
Affiliation(s)
- Kotoko Nakata
- AdvanceSoft Corporation /IIS, University of Tokyo, 4-6-1 Komaba, Meguro-ku 770-8505, Japan
| | - Shinji Amari
- Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku 770-8505, Japan
| | - Tatsuya Nakano
- National Institute of Health Sciences, 1-18-1, Setagaya-ku, Tokyo 158-8501, Japan
| |
Collapse
|
31
|
Li H, Yap CW, Xue Y, Li ZR, Ung CY, Han LY, Chen YZ. Statistical learning approach for predicting specific pharmacodynamic, pharmacokinetic, or toxicological properties of pharmaceutical agents. Drug Dev Res 2005. [DOI: 10.1002/ddr.20044] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
32
|
In Brief. Nat Rev Drug Discov 2005. [DOI: 10.1038/nrd1673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|