1
|
Rodríguez-Pérez R, Miljković F, Bajorath J. Machine Learning in Chemoinformatics and Medicinal Chemistry. Annu Rev Biomed Data Sci 2022; 5:43-65. [PMID: 35440144 DOI: 10.1146/annurev-biodatasci-122120-124216] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Novartis Institutes for Biomedical Research, Novartis Campus, Basel, Switzerland
| | - Filip Miljković
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Data Science and AI, Imaging and Data Analytics, Clinical Pharmacology and Safety Sciences, R&D AstraZeneca, Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany;
| |
Collapse
|
2
|
de Oliveira ECL, da Costa KS, Taube PS, Lima AH, Junior CDSDS. Biological Membrane-Penetrating Peptides: Computational Prediction and Applications. Front Cell Infect Microbiol 2022; 12:838259. [PMID: 35402305 PMCID: PMC8992797 DOI: 10.3389/fcimb.2022.838259] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 02/21/2022] [Indexed: 12/14/2022] Open
Abstract
Peptides comprise a versatile class of biomolecules that present a unique chemical space with diverse physicochemical and structural properties. Some classes of peptides are able to naturally cross the biological membranes, such as cell membrane and blood-brain barrier (BBB). Cell-penetrating peptides (CPPs) and blood-brain barrier-penetrating peptides (B3PPs) have been explored by the biotechnological and pharmaceutical industries to develop new therapeutic molecules and carrier systems. The computational prediction of peptides’ penetration into biological membranes has been emerged as an interesting strategy due to their high throughput and low-cost screening of large chemical libraries. Structure- and sequence-based information of peptides, as well as atomistic biophysical models, have been explored in computer-assisted discovery strategies to classify and identify new structures with pharmacokinetic properties related to the translocation through biomembranes. Computational strategies to predict the permeability into biomembranes include cheminformatic filters, molecular dynamics simulations, artificial intelligence algorithms, and statistical models, and the choice of the most adequate method depends on the purposes of the computational investigation. Here, we exhibit and discuss some principles and applications of these computational methods widely used to predict the permeability of peptides into biomembranes, exhibiting some of their pharmaceutical and biotechnological applications.
Collapse
Affiliation(s)
- Ewerton Cristhian Lima de Oliveira
- Institute of Technology, Federal University of Pará, Belém, Brazil
- *Correspondence: Kauê Santana da Costa, ; Ewerton Cristhian Lima de Oliveira,
| | - Kauê Santana da Costa
- Laboratory of Computational Simulation, Institute of Biodiversity, Federal University of Western Pará, Santarém, Brazil
- *Correspondence: Kauê Santana da Costa, ; Ewerton Cristhian Lima de Oliveira,
| | - Paulo Sérgio Taube
- Laboratory of Computational Simulation, Institute of Biodiversity, Federal University of Western Pará, Santarém, Brazil
| | - Anderson H. Lima
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | | |
Collapse
|
3
|
Santana K, do Nascimento LD, Lima e Lima A, Damasceno V, Nahum C, Braga RC, Lameira J. Applications of Virtual Screening in Bioprospecting: Facts, Shifts, and Perspectives to Explore the Chemo-Structural Diversity of Natural Products. Front Chem 2021; 9:662688. [PMID: 33996755 PMCID: PMC8117418 DOI: 10.3389/fchem.2021.662688] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 02/25/2021] [Indexed: 12/22/2022] Open
Abstract
Natural products are continually explored in the development of new bioactive compounds with industrial applications, attracting the attention of scientific research efforts due to their pharmacophore-like structures, pharmacokinetic properties, and unique chemical space. The systematic search for natural sources to obtain valuable molecules to develop products with commercial value and industrial purposes remains the most challenging task in bioprospecting. Virtual screening strategies have innovated the discovery of novel bioactive molecules assessing in silico large compound libraries, favoring the analysis of their chemical space, pharmacodynamics, and their pharmacokinetic properties, thus leading to the reduction of financial efforts, infrastructure, and time involved in the process of discovering new chemical entities. Herein, we discuss the computational approaches and methods developed to explore the chemo-structural diversity of natural products, focusing on the main paradigms involved in the discovery and screening of bioactive compounds from natural sources, placing particular emphasis on artificial intelligence, cheminformatics methods, and big data analyses.
Collapse
Affiliation(s)
- Kauê Santana
- Instituto de Biodiversidade, Universidade Federal do Oeste do Pará, Santarém, Brazil
| | | | - Anderson Lima e Lima
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | - Vinícius Damasceno
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | - Claudio Nahum
- Instituto de Ciências Exatas e Naturais, Universidade Federal do Pará, Belém, Brazil
| | | | - Jerônimo Lameira
- Instituto de Ciências Biológicas, Universidade Federal do Pará, Belém, Brazil
| |
Collapse
|
4
|
Leśniak D, Podlewska S, Jastrzębski S, Sieradzki I, Bojarski AJ, Tabor J. Development of New Methods Needs Proper Evaluation-Benchmarking Sets for Machine Learning Experiments for Class A GPCRs. J Chem Inf Model 2019; 59:4974-4992. [PMID: 31604014 DOI: 10.1021/acs.jcim.9b00689] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
New computational approaches for virtual screening applications are constantly being developed. However, before a particular tool is used to search for new active compounds, its effectiveness in the type of task must be examined. In this study, we conducted a detailed analysis of various aspects of preparation of respective data sets for such an evaluation. We propose a protocol for fetching data from the ChEMBL database, examine various compound representations in terms of the possible bias resulting from the way they are generated, and define a new metric for comparing the structural similarity of compounds, which is in line with chemical intuition. The newly developed method is also used for the evaluation of various approaches for division of the data set into training and test set parts, which are also examined in detail in terms of being the source of possible results bias. Finally, machine learning methods are applied in cross-validation studies of data sets constructed within the paper, constituting benchmarks for the assessment of computational methods developed for virtual screening tasks. Additionally, analogous data sets for class A G protein-coupled receptors (100 targets with the highest number of records) were prepared. They are available at http://gmum.net/benchmarks/ , together with script enabling reproduction of all results available at https://github.com/lesniak43/ananas .
Collapse
Affiliation(s)
- Damian Leśniak
- Faculty of Mathematics and Computer Science , Jagiellonian University , 6 Łojasiewicza Street , 30-348 Kraków , Poland
| | - Sabina Podlewska
- Department of Technology and Biotechnology of Drugs , Jagiellonian University Medical College , 9 Medyczna Street , 30-688 Kraków , Poland.,Maj Institute of Pharmacology, Polish Academy of Sciences , 12 Smętna Street , 31-343 Kraków , Poland
| | - Stanisław Jastrzębski
- Faculty of Mathematics and Computer Science , Jagiellonian University , 6 Łojasiewicza Street , 30-348 Kraków , Poland
| | - Igor Sieradzki
- Faculty of Mathematics and Computer Science , Jagiellonian University , 6 Łojasiewicza Street , 30-348 Kraków , Poland
| | - Andrzej J Bojarski
- Maj Institute of Pharmacology, Polish Academy of Sciences , 12 Smętna Street , 31-343 Kraków , Poland
| | - Jacek Tabor
- Faculty of Mathematics and Computer Science , Jagiellonian University , 6 Łojasiewicza Street , 30-348 Kraków , Poland
| |
Collapse
|
5
|
Sivakumar TV, Bhaduri A, Duvvuru Muni RR, Park JH, Kim TY. SimCAL: a flexible tool to compute biochemical reaction similarity. BMC Bioinformatics 2018; 19:254. [PMID: 29969981 PMCID: PMC6029250 DOI: 10.1186/s12859-018-2248-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 06/14/2018] [Indexed: 11/29/2022] Open
Abstract
Background Computation of reaction similarity is a pre-requisite for several bioinformatics applications including enzyme identification for specific biochemical reactions, enzyme classification and mining for specific inhibitors. Reaction similarity is often assessed at either two levels: (i) comparison across all the constituent substrates and products of a reaction, reaction level similarity, (ii) comparison at the transformation center with various degrees of neighborhood, transformation level similarity. Existing reaction similarity computation tools are designed for specific applications and use different features and similarity measures. A single system integrating these diverse features enables comparison of the impact of different molecular properties on similarity score computation. Results To address these requirements, we present SimCAL, an integrated system to calculate reaction similarity with novel features and capability to perform comparative assessment. SimCAL provides reaction similarity computation at both whole reaction level and transformation level. Novel physicochemical features such as stereochemistry, mass, volume and charge are included in computing reaction fingerprint. Users can choose from four different fingerprint types and nine molecular similarity measures. Further, a comparative assessment of these features is also enabled. The performance of SimCAL is assessed on 3,688,122 reaction pairs with Enzyme Commission (EC) number from MetaCyc and achieved an area under the curve (AUC) of > 0.9. In addition, SimCAL results showed strong correlation with state-of-the-art EC-BLAST and molecular signature based reaction similarity methods. Conclusions SimCAL is developed in java and is available as a standalone tool, with intuitive, user-friendly graphical interface and also as a console application. With its customizable feature selection and similarity calculations, it is expected to cater a wide audience interested in studying and analyzing biochemical reactions and metabolic networks. Electronic supplementary material The online version of this article (10.1186/s12859-018-2248-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Anirban Bhaduri
- Bioinformatics Lab, Samsung Advanced Institute of Technology, Bangalore, 560037, India
| | | | - Jin Hwan Park
- Biomaterials Lab, Materials Center, Samsung Advanced Institute of Technology, Gyeonggi-do, 443803, South Korea
| | - Tae Yong Kim
- Biomaterials Lab, Materials Center, Samsung Advanced Institute of Technology, Gyeonggi-do, 443803, South Korea.
| |
Collapse
|
6
|
Wallach I, Heifets A. Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization. J Chem Inf Model 2018; 58:916-932. [PMID: 29698607 DOI: 10.1021/acs.jcim.7b00403] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems, that accounts for the similarity among inactive molecules as well as active ones. We investigated seven widely used benchmarks for virtual screening and classification, and we show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously applied unbiasing techniques. Therefore, it may be the case that the previously reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy.
Collapse
Affiliation(s)
- Izhar Wallach
- Atomwise Inc. , 221 Main Street, Suite 1350 , San Francisco , California 94105 , United States
| | - Abraham Heifets
- Atomwise Inc. , 221 Main Street, Suite 1350 , San Francisco , California 94105 , United States
| |
Collapse
|
7
|
Xia J, Reid TE, Wu S, Zhang L, Wang XS. Maximal Unbiased Benchmarking Data Sets for Human Chemokine Receptors and Comparative Analysis. J Chem Inf Model 2018; 58:1104-1120. [PMID: 29698608 DOI: 10.1021/acs.jcim.8b00004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
Chemokine receptors (CRs) have long been druggable targets for the treatment of inflammatory diseases and HIV-1 infection. As a powerful technique, virtual screening (VS) has been widely applied to identifying small molecule leads for modern drug targets including CRs. For rational selection of a wide variety of VS approaches, ligand enrichment assessment based on a benchmarking data set has become an indispensable practice. However, the lack of versatile benchmarking sets for the whole CRs family that are able to unbiasedly evaluate every single approach including both structure- and ligand-based VS somewhat hinders modern drug discovery efforts. To address this issue, we constructed Maximal Unbiased Benchmarking Data sets for human Chemokine Receptors (MUBD-hCRs) using our recently developed tools of MUBD-DecoyMaker. The MUBD-hCRs encompasses 13 subtypes out of 20 chemokine receptors, composed of 404 ligands and 15756 decoys so far and is readily expandable in the future. It had been thoroughly validated that MUBD-hCRs ligands are chemically diverse while its decoys are maximal unbiased in terms of "artificial enrichment", "analogue bias". In addition, we studied the performance of MUBD-hCRs, in particular CXCR4 and CCR5 data sets, in ligand enrichment assessments of both structure- and ligand-based VS approaches in comparison with other benchmarking data sets available in the public domain and demonstrated that MUBD-hCRs is very capable of designating the optimal VS approach. MUBD-hCRs is a unique and maximal unbiased benchmarking set that covers major CRs subtypes so far.
Collapse
Affiliation(s)
- Jie Xia
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Department of New Drug Research and Development, Institute of Materia Medica , Chinese Academy of Medical Sciences and Peking Union Medical College , Beijing 100050 , China.,State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences , Peking University , Beijing 100191 , China
| | - Terry-Elinor Reid
- Molecular Modeling and Drug Discovery Core Laboratory for District of Columbia Center for AIDS Research (DC CFAR), Department of Pharmaceutical Sciences, College of Pharmacy , Howard University , Washington , D.C. 20059 , United States
| | - Song Wu
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Department of New Drug Research and Development, Institute of Materia Medica , Chinese Academy of Medical Sciences and Peking Union Medical College , Beijing 100050 , China
| | - Liangren Zhang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences , Peking University , Beijing 100191 , China
| | - Xiang Simon Wang
- Molecular Modeling and Drug Discovery Core Laboratory for District of Columbia Center for AIDS Research (DC CFAR), Department of Pharmaceutical Sciences, College of Pharmacy , Howard University , Washington , D.C. 20059 , United States
| |
Collapse
|
8
|
Colwell LJ. Statistical and machine learning approaches to predicting protein-ligand interactions. Curr Opin Struct Biol 2018; 49:123-128. [PMID: 29452923 DOI: 10.1016/j.sbi.2018.01.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Accepted: 01/02/2018] [Indexed: 12/29/2022]
Abstract
Data driven computational approaches to predicting protein-ligand binding are currently achieving unprecedented levels of accuracy on held-out test datasets. Up until now, however, this has not led to corresponding breakthroughs in our ability to design novel ligands for protein targets of interest. This review summarizes the current state of the art in this field, emphasizing the recent development of deep neural networks for predicting protein-ligand binding. We explain the major technical challenges that have caused difficulty with predicting novel ligands, including the problems of sampling noise and the challenge of using benchmark datasets that are sufficiently unbiased that they allow the model to extrapolate to new regimes.
Collapse
Affiliation(s)
- Lucy J Colwell
- Department of Chemistry, Cambridge University, Cambridge, UK.
| |
Collapse
|
9
|
Pei F, Jin H, Zhou X, Xia J, Sun L, Liu Z, Zhang L. Enrichment assessment of multiple virtual screening strategies for Toll-like receptor 8 agonists based on a maximal unbiased benchmarking data set. Chem Biol Drug Des 2015; 86:1226-41. [PMID: 26017460 DOI: 10.1111/cbdd.12590] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2015] [Revised: 04/29/2015] [Accepted: 05/14/2015] [Indexed: 12/16/2022]
Abstract
Toll-like receptor 8 agonists, which activate adaptive immune responses by inducing robust production of T-helper 1-polarizing cytokines, are promising candidates for vaccine adjuvants. As the binding site of toll-like receptor 8 is large and highly flexible, virtual screening by individual method has inevitable limitations; thus, a comprehensive comparison of different methods may provide insights into seeking effective strategy for the discovery of novel toll-like receptor 8 agonists. In this study, the performance of knowledge-based pharmacophore, shape-based 3D screening, and combined strategies was assessed against a maximum unbiased benchmarking data set containing 13 actives and 1302 decoys specialized for toll-like receptor 8 agonists. Prior structure-activity relationship knowledge was involved in knowledge-based pharmacophore generation, and a set of antagonists was innovatively used to verify the selectivity of the selected knowledge-based pharmacophore. The benchmarking data set was generated from our recently developed 'mubd-decoymaker' protocol. The enrichment assessment demonstrated a considerable performance through our selected three-layer virtual screening strategy: knowledge-based pharmacophore (Phar1) screening, shape-based 3D similarity search (Q4_combo), and then a Gold docking screening. This virtual screening strategy could be further employed to perform large-scale database screening and to discover novel toll-like receptor 8 agonists.
Collapse
Affiliation(s)
- Fen Pei
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China
| | - Hongwei Jin
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China
| | - Xin Zhou
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China
| | - Jie Xia
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China.,Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminfomatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC, 20059, USA
| | - Lidan Sun
- The Institute of Molecular Biology, Medical School of China Three Gorges University, 8 Daxue Road, Yichang, 443002, China
| | - Zhenming Liu
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China
| | - Liangren Zhang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, 38 Xueyuan Rd, Beijing, 100191, China
| |
Collapse
|
10
|
Lagarde N, Zagury JF, Montes M. Benchmarking Data Sets for the Evaluation of Virtual Ligand Screening Methods: Review and Perspectives. J Chem Inf Model 2015; 55:1297-307. [PMID: 26038804 DOI: 10.1021/acs.jcim.5b00090] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Virtual screening methods are commonly used nowadays in drug discovery processes. However, to ensure their reliability, they have to be carefully evaluated. The evaluation of these methods is often realized in a retrospective way, notably by studying the enrichment of benchmarking data sets. To this purpose, numerous benchmarking data sets were developed over the years, and the resulting improvements led to the availability of high quality benchmarking data sets. However, some points still have to be considered in the selection of the active compounds, decoys, and protein structures to obtain optimal benchmarking data sets.
Collapse
Affiliation(s)
- Nathalie Lagarde
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| | - Jean-François Zagury
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| | - Matthieu Montes
- Laboratoire Génomique, Bioinformatique et Applications, EA 4627, Conservatoire National des Arts et Métiers, 292 rue Saint Martin, 75003 Paris, France
| |
Collapse
|
11
|
Xia J, Tilahun EL, Reid TE, Zhang L, Wang XS. Benchmarking methods and data sets for ligand enrichment assessment in virtual screening. Methods 2015; 71:146-57. [PMID: 25481478 PMCID: PMC4278665 DOI: 10.1016/j.ymeth.2014.11.015] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2014] [Revised: 11/22/2014] [Accepted: 11/24/2014] [Indexed: 11/21/2022] Open
Abstract
Retrospective small-scale virtual screening (VS) based on benchmarking data sets has been widely used to estimate ligand enrichments of VS approaches in the prospective (i.e. real-world) efforts. However, the intrinsic differences of benchmarking sets to the real screening chemical libraries can cause biased assessment. Herein, we summarize the history of benchmarking methods as well as data sets and highlight three main types of biases found in benchmarking sets, i.e. "analogue bias", "artificial enrichment" and "false negative". In addition, we introduce our recent algorithm to build maximum-unbiased benchmarking sets applicable to both ligand-based and structure-based VS approaches, and its implementations to three important human histone deacetylases (HDACs) isoforms, i.e. HDAC1, HDAC6 and HDAC8. The leave-one-out cross-validation (LOO CV) demonstrates that the benchmarking sets built by our algorithm are maximum-unbiased as measured by property matching, ROC curves and AUCs.
Collapse
Affiliation(s)
- Jie Xia
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing 100191, PR China; Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA
| | - Ermias Lemma Tilahun
- Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA
| | - Terry-Elinor Reid
- Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA
| | - Liangren Zhang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing 100191, PR China.
| | - Xiang Simon Wang
- Molecular Modeling and Drug Discovery Core for District of Columbia Developmental Center for AIDS Research (DC D-CFAR), Laboratory of Cheminformatics and Drug Design, Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, Washington, DC 20059, USA.
| |
Collapse
|
12
|
Xia J, Jin H, Liu Z, Zhang L, Wang XS. An unbiased method to build benchmarking sets for ligand-based virtual screening and its application to GPCRs. J Chem Inf Model 2014; 54:1433-50. [PMID: 24749745 PMCID: PMC4038372 DOI: 10.1021/ci500062f] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
![]()
Benchmarking data
sets have become common in recent years for the
purpose of virtual screening, though the main focus had been placed
on the structure-based virtual screening (SBVS) approaches. Due to
the lack of crystal structures, there is great need for unbiased benchmarking
sets to evaluate various ligand-based virtual screening (LBVS) methods
for important drug targets such as G protein-coupled receptors (GPCRs).
To date these ready-to-apply data sets for LBVS are fairly limited,
and the direct usage of benchmarking sets designed for SBVS could
bring the biases to the evaluation of LBVS. Herein, we propose an
unbiased method to build benchmarking sets for LBVS and validate it
on a multitude of GPCRs targets. To be more specific, our methods
can (1) ensure chemical diversity of ligands, (2) maintain the physicochemical
similarity between ligands and decoys, (3) make the decoys dissimilar
in chemical topology to all ligands to avoid false negatives, and
(4) maximize spatial random distribution of ligands and decoys. We
evaluated the quality of our Unbiased Ligand Set (ULS) and Unbiased
Decoy Set (UDS) using three common LBVS approaches, with Leave-One-Out
(LOO) Cross-Validation (CV) and a metric of average AUC of the ROC
curves. Our method has greatly reduced the “artificial enrichment”
and “analogue bias” of a published GPCRs benchmarking
set, i.e., GPCR Ligand Library (GLL)/GPCR Decoy Database (GDD). In
addition, we addressed an important issue about the ratio of decoys
per ligand and found that for a range of 30 to 100 it does not affect
the quality of the benchmarking set, so we kept the original ratio
of 39 from the GLL/GDD.
Collapse
Affiliation(s)
- Jie Xia
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University , Beijing 100191, China
| | | | | | | | | |
Collapse
|
13
|
Hu Y, Bajorath J. Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer. F1000Res 2014; 3:69. [PMID: 25520777 PMCID: PMC4264635 DOI: 10.12688/f1000research.3713.1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/07/2014] [Indexed: 12/12/2022] Open
Abstract
In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).
Collapse
Affiliation(s)
- Ye Hu
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms University, Bonn, D-53113, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms University, Bonn, D-53113, Germany
| |
Collapse
|
14
|
Finn PW, Morris GM. Shape-based similarity searching in chemical databases. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2012. [DOI: 10.1002/wcms.1128] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
15
|
Hu Y, Bajorath J. Freely available compound data sets and software tools for chemoinformatics and computational medicinal chemistry applications. F1000Res 2012; 1:11. [PMID: 24358818 PMCID: PMC3782340 DOI: 10.12688/f1000research.1-11.v1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/07/2012] [Indexed: 01/22/2023] Open
Abstract
We have generated a number of compound data sets and programs for different types of applications in pharmaceutical research. These data sets and programs were originally designed for our research projects and are made publicly available. Without consulting original literature sources, it is difficult to understand specific features of data sets and software tools, basic ideas underlying their design, and applicability domains. Currently, 30 different entries are available for download from our website. In this data article, we provide an overview of the data and tools we make available and designate the areas of research for which they should be useful. For selected data sets and methods/programs, detailed descriptions are given. This article should help interested readers to select data and tools for specific computational investigations.
Collapse
Affiliation(s)
- Ye Hu
- Department of Life Science Informatics, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr, Bonn, D-53113, Germany
| | - Jurgen Bajorath
- Department of Life Science Informatics, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr, Bonn, D-53113, Germany
| |
Collapse
|
16
|
Abstract
Virtual screening (VS) methods are applied in both academia and drug discovery, and can be divided into ligand- and target structure-based approaches. The VS field is still evolving and is characterized by scientific heterogeneity. The value of virtual compound screening for drug discovery is often debated, in particular, given the large investments made in experimental high-throughput screening technologies. The current state-of-the-art in the VS field is discussed. Despite its limitations, VS applications have often succeeded in identifying novel hits including first-in-class active compounds and novel chemotypes. VS has its place in pharmaceutical research, but there is still much room for further improvements including method evaluation and drug discovery applications. The potential of VS is currently underutilized because its complementarity to high-throughput screening is not sufficiently exploited. Building close interfaces between computational and experimental screening would further streamline the hit identification process.
Collapse
|
17
|
Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. J Med Chem 2012; 55:6582-94. [PMID: 22716043 PMCID: PMC3405771 DOI: 10.1021/jm300687e] [Citation(s) in RCA: 1320] [Impact Index Per Article: 110.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
![]()
A key metric to assess molecular docking remains ligand
enrichment
against challenging decoys. Whereas the directory of useful decoys
(DUD) has been widely used, clear areas for optimization have emerged.
Here we describe an improved benchmarking set that includes more diverse
targets such as GPCRs and ion channels, totaling 102 proteins with
22886 clustered ligands drawn from ChEMBL, each with 50 property-matched
decoys drawn from ZINC. To ensure chemotype diversity, we cluster
each target’s ligands by their Bemis–Murcko atomic frameworks.
We add net charge to the matched physicochemical properties and include
only the most dissimilar decoys, by topology, from the ligands. An
online automated tool (http://decoys.docking.org) generates
these improved matched decoys for user-supplied ligands. We test this
data set by docking all 102 targets, using the results to improve
the balance between ligand desolvation and electrostatics in DOCK
3.6. The complete DUD-E benchmarking set is freely available at http://dude.docking.org.
Collapse
Affiliation(s)
- Michael M Mysinger
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA 94158-2330, USA
| | | | | | | |
Collapse
|
18
|
Schuffenhauer A. Computational methods for scaffold hopping. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2012. [DOI: 10.1002/wcms.1106] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|