1
|
Bonner S, Barrett IP, Ye C, Swiers R, Engkvist O, Bender A, Hoyt CT, Hamilton WL. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Brief Bioinform 2022; 23:6712301. [PMID: 36151740 DOI: 10.1093/bib/bbac404] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 07/14/2022] [Accepted: 08/20/2022] [Indexed: 12/14/2022] Open
Abstract
Drug discovery and development is a complex and costly process. Machine learning approaches are being investigated to help improve the effectiveness and speed of multiple stages of the drug discovery pipeline. Of these, those that use Knowledge Graphs (KG) have promise in many tasks, including drug repurposing, drug toxicity prediction and target gene-disease prioritization. In a drug discovery KG, crucial elements including genes, diseases and drugs are represented as entities, while relationships between them indicate an interaction. However, to construct high-quality KGs, suitable data are required. In this review, we detail publicly available sources suitable for use in constructing drug discovery focused KGs. We aim to help guide machine learning and KG practitioners who are interested in applying new techniques to the drug discovery field, but who may be unfamiliar with the relevant data sources. The datasets are selected via strict criteria, categorized according to the primary type of information contained within and are considered based upon what information could be extracted to build a KG. We then present a comparative analysis of existing public drug discovery KGs and an evaluation of selected motivating case studies from the literature. Additionally, we raise numerous and unique challenges and issues associated with the domain and its datasets, while also highlighting key future research directions. We hope this review will motivate KGs use in solving key and emerging questions in the drug discovery domain.
Collapse
Affiliation(s)
- Stephen Bonner
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Ian P Barrett
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Cheng Ye
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Rowan Swiers
- Data Sciences and Quantitative Biology, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweeden
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, UK
| | | | - William L Hamilton
- School of Computer Science, McGill University, Canada.,Mila-Quebec AI Institute, Montreal, Canada
| |
Collapse
|
2
|
Krasoulis A, Antonopoulos N, Pitsikalis V, Theodorakis S. DENVIS: Scalable and High-Throughput Virtual Screening Using Graph Neural Networks with Atomic and Surface Protein Pocket Features. J Chem Inf Model 2022; 62:4642-4659. [PMID: 36154119 DOI: 10.1021/acs.jcim.2c01057] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Computational methods for virtual screening can dramatically accelerate early-stage drug discovery by identifying potential hits for a specified target. Docking algorithms traditionally use physics-based simulations to address this challenge by estimating the binding orientation of a query protein-ligand pair and a corresponding binding affinity score. Over the recent years, classical and modern machine learning architectures have shown potential for outperforming traditional docking algorithms. Nevertheless, most learning-based algorithms still rely on the availability of the protein-ligand complex binding pose, typically estimated via docking simulations, which leads to a severe slowdown of the overall virtual screening process. A family of algorithms processing target information at the amino acid sequence level avoid this requirement, however, at the cost of processing protein data at a higher representation level. We introduce deep neural virtual screening (DENVIS), an end-to-end pipeline for virtual screening using graph neural networks (GNNs). By performing experiments on two benchmark databases, we show that our method performs competitively to several docking-based, machine learning-based, and hybrid docking/machine learning-based algorithms. By avoiding the intermediate docking step, DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models. When compared to an amino acid sequence-based machine learning model with comparable screening times, DENVIS achieves dramatically better performance. Some key elements of our approach include protein pocket modeling using a combination of atomic and surface features, the use of model ensembles, and data augmentation via artificial negative sampling during model training. In summary, DENVIS achieves competitive to state-of-the-art virtual screening performance, while offering the potential to scale to billions of molecules using minimal computational resources.
Collapse
|
3
|
Prada Gori DN, Llanos MA, Bellera CL, Talevi A, Alberca LN. iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules. J Chem Inf Model 2022; 62:2987-2998. [PMID: 35687523 DOI: 10.1021/acs.jcim.2c00265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.
Collapse
Affiliation(s)
- Denis N Prada Gori
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Manuel A Llanos
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Carolina L Bellera
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Alan Talevi
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Lucas N Alberca
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| |
Collapse
|
4
|
Application of a Deep Learning Network for Joint Prediction of Associated Fluid Production in Unconventional Hydrocarbon Development. Processes (Basel) 2022. [DOI: 10.3390/pr10040740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Machine learning (ML) approaches have risen in popularity for use in many oil and gas (O&G) applications. Time series-based predictive forecasting of hydrocarbon production using deep learning ML strategies that can generalize temporal or sequence-based information within data is fast gaining traction. The recent emphasis on hydrocarbon production provides opportunities to explore the use of deep learning ML to other facets of O&G development where dynamic, temporal dependencies exist and that also hold implications to production forecasting. This study proposes a combination of supervised and unsupervised ML approaches as part of a framework for the joint prediction of produced water and natural gas volumes associated with oil production from unconventional reservoirs in a time series fashion. The study focuses on the pay zones within the Spraberry and Wolfcamp Formations of the Midland Basin in the U.S. The joint prediction model is based on a deep neural network architecture leveraging long short-term memory (LSTM) layers. Our model has the capability to both reproduce and forecast produced water and natural gas volumes for wells at monthly resolution and has demonstrated 91 percent joint prediction accuracy to held out testing data with little disparity noted in prediction performance between the training and test datasets. Additionally, model predictions replicate water and gas production profiles to wells in the test dataset, even for circumstances that include irregularities in production trends. We apply the model in tandem with an Arps decline model to generate cumulative first and five-year estimates for oil, gas, and water production outlooks at the well and basin-levels. Production outlook totals are influenced by well completion, decline curve, and spatial and reservoir attributes. These types of model-derived outlooks can aid operators in formulating management or remedial solutions for the volumes of fluids expected from unconventional O&G development.
Collapse
|
5
|
Picart-Armada S, Thompson WK, Buil A, Perera-Lluna A. The effect of statistical normalization on network propagation scores. Bioinformatics 2021; 37:845-852. [PMID: 33070187 PMCID: PMC8097756 DOI: 10.1093/bioinformatics/btaa896] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 09/18/2020] [Accepted: 10/07/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Network diffusion and label propagation are fundamental tools in computational biology, with applications like gene-disease association, protein function prediction and module discovery. More recently, several publications have introduced a permutation analysis after the propagation process, due to concerns that network topology can bias diffusion scores. This opens the question of the statistical properties and the presence of bias of such diffusion processes in each of its applications. In this work, we characterized some common null models behind the permutation analysis and the statistical properties of the diffusion scores. We benchmarked seven diffusion scores on three case studies: synthetic signals on a yeast interactome, simulated differential gene expression on a protein-protein interaction network and prospective gene set prediction on another interaction network. For clarity, all the datasets were based on binary labels, but we also present theoretical results for quantitative labels. RESULTS Diffusion scores starting from binary labels were affected by the label codification and exhibited a problem-dependent topological bias that could be removed by the statistical normalization. Parametric and non-parametric normalization addressed both points by being codification-independent and by equalizing the bias. We identified and quantified two sources of bias-mean value and variance-that yielded performance differences when normalizing the scores. We provided closed formulae for both and showed how the null covariance is related to the spectral properties of the graph. Despite none of the proposed scores systematically outperformed the others, normalization was preferred when the sought positive labels were not aligned with the bias. We conclude that the decision on bias removal should be problem and data-driven, i.e. based on a quantitative analysis of the bias and its relation to the positive entities. AVAILABILITY The code is publicly available at https://github.com/b2slab/diffuBench and the data underlying this article are available at https://github.com/b2slab/retroData. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergio Picart-Armada
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain.,Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain
| | - Wesley K Thompson
- Mental Health Center Sct. Hans, 4000 Roskilde, Denmark.,Department of Family Medicine and Public Health, University of California, San Diego, La Jolla, CA, USA
| | - Alfonso Buil
- Mental Health Center Sct. Hans, 4000 Roskilde, Denmark
| | - Alexandre Perera-Lluna
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, CIBER-BBN, Barcelona, 08028, Spain.,Esplugues de Llobregat, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, Barcelona, 08950, Spain
| |
Collapse
|
6
|
Lopez-Del Rio A, Picart-Armada S, Perera-Lluna A. Balancing Data on Deep Learning-Based Proteochemometric Activity Classification. J Chem Inf Model 2021; 61:1657-1669. [PMID: 33779173 PMCID: PMC8594867 DOI: 10.1021/acs.jcim.1c00086] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
![]()
In
silico analysis of biological activity data has become an essential
technique in pharmaceutical development. Specifically, the so-called
proteochemometric models aim to share information between targets
in machine learning ligand–target activity prediction models.
However, bioactivity data sets used in proteochemometric modeling
are usually imbalanced, which could potentially affect the performance
of the models. In this work, we explored the effect of different balancing
strategies in deep learning proteochemometric target–compound
activity classification models while controlling for the compound
series bias through clustering. These strategies were (1) no_resampling,
(2) resampling_after_clustering, (3) resampling_before_clustering,
and (4) semi_resampling. These schemas were evaluated in kinases,
GPCRs, nuclear receptors, and proteases from BindingDB. We observed
that the predicted proportion of positives was driven by the actual
data balance in the test set. Additionally, it was confirmed that
data balance had an impact on the performance estimates of the proteochemometric
model. We recommend a combination of data augmentation and clustering
in the training set (semi_resampling) to mitigate the data imbalance
effect in a realistic scenario. The code of this analysis is publicly
available at https://github.com/b2slab/imbalance_pcm_benchmark.
Collapse
Affiliation(s)
- Angela Lopez-Del Rio
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028 Barcelona, Spain.,Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, 08950 Esplugues de Llobregat, Spain
| | - Sergio Picart-Armada
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028 Barcelona, Spain.,Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, 08950 Esplugues de Llobregat, Spain
| | - Alexandre Perera-Lluna
- B2SLab, Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028 Barcelona, Spain.,Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Déu, 08950 Esplugues de Llobregat, Spain
| |
Collapse
|
7
|
Song J, Gao Y, Yin P, Li Y, Li Y, Zhang J, Su Q, Fu X, Pi H. The Random Forest Model Has the Best Accuracy Among the Four Pressure Ulcer Prediction Models Using Machine Learning Algorithms. Risk Manag Healthc Policy 2021; 14:1175-1187. [PMID: 33776495 PMCID: PMC7987326 DOI: 10.2147/rmhp.s297838] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 02/26/2021] [Indexed: 12/11/2022] Open
Abstract
Purpose Build machine learning models for predicting pressure ulcer nursing adverse event, and find an optimal model that predicts the occurrence of pressure ulcer accurately. Patients and Methods Retrospectively enrolled 5814 patients, of which 1673 suffer from pressure ulcer events. Support vector machine (SVM), decision tree (DT), random forest (RF) and artificial neural network (ANN) models were used to construct the pressure ulcer prediction models, respectively. A total of 19 variables are included, and the importance of screening variables is evaluated. Meanwhile, the performance of the prediction models is evaluated and compared. Results The experimental results show that the four pressure ulcer prediction models all achieve good performance. Also, the AUC values of the four models are all greater than 0.95. Besides, the comparison of the four models indicates that RF model achieves a higher accuracy for the prediction of pressure ulcer. Conclusion This research verifies the feasibility of developing a management system for predicting nursing adverse event based on big data technology and machine learning technology. The random forest and decision tree model are more suitable for constructing a pressure ulcer prediction model. This study provides a reference for future pressure ulcer risk warning based on big data.
Collapse
Affiliation(s)
- Jie Song
- Medical School of Chinese PLA, Beijing, People's Republic of China
| | - Yuan Gao
- First Medical Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| | - Pengbin Yin
- Fouth Medical Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| | - Yi Li
- Medical School of Chinese PLA, Beijing, People's Republic of China
| | - Yang Li
- First Medical Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| | - Jie Zhang
- Sixth Medical Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| | - Qingqing Su
- Medical School of Chinese PLA, Beijing, People's Republic of China
| | - Xiaojie Fu
- First Medical Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| | - Hongying Pi
- Medical Service Training Center, Chinese PLA General Hospital, Beijing, People's Republic of China
| |
Collapse
|
8
|
Lopez-Del Rio A, Martin M, Perera-Lluna A, Saidi R. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 2020; 10:14634. [PMID: 32884053 PMCID: PMC7471694 DOI: 10.1038/s41598-020-71450-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 08/06/2020] [Indexed: 11/08/2022] Open
Abstract
The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .
Collapse
Affiliation(s)
- Angela Lopez-Del Rio
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain.
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain.
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - Alexandre Perera-Lluna
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| |
Collapse
|
9
|
Francoeur PG, Masuda T, Sunseri J, Jia A, Iovanisci RB, Snyder I, Koes DR. Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design. J Chem Inf Model 2020; 60:4200-4215. [DOI: 10.1021/acs.jcim.0c00411] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Paul G. Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Tomohide Masuda
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Jocelyn Sunseri
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Andrew Jia
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Richard B. Iovanisci
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Ian Snyder
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R. Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
10
|
Abstract
The drugs we use to cure our diseases can cause damage to the liver as it is the primary organ responsible for metabolism of environmental chemicals and drugs. To identify and eliminate potentially problematic drug candidates in the early stages of drug discovery, in silico techniques provide quick and practical solutions for toxicity determination. Deep learning has emerged as one of the solutions in recent years in the field of pharmaceutical chemistry. Generally, in the case of small data sets as used in toxicology, these data-hungry algorithms are prone to overfitting. We approach the problem from two sides. First, we use images of the three-dimensional conformations and benefit from convolutional neural networks which have fewer parameters than the standard deep neural networks with similar depth. Using images allows connecting various chemical features to the geometry of the compounds. Second, we employ the method COVER to up-sample the data set. It is used not only for increasing the size of the data set, but also for balancing the two classes, i.e., toxic and not toxic. The proof of concept is performed on the p53 end point from the Tox21 data set. The results, which are compatible with the winners of the data challenge, encouraged us to use our methods to predict liver toxicity. We use the most extensive publicly available liver toxicity data set by Mulliner et al. and obtain a sensitivity of 0.79 and a specificity of 0.52. These results demonstrate the applicability of image based toxicity prediction using deep neural networks.
Collapse
Affiliation(s)
- Ece Asilar
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, A-1090 Vienna, Austria
| | - Jennifer Hemmerich
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, A-1090 Vienna, Austria
| | - Gerhard F Ecker
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, A-1090 Vienna, Austria
| |
Collapse
|
11
|
Bongers BJ, IJzerman AP, Van Westen GJP. Proteochemometrics - recent developments in bioactivity and selectivity modeling. DRUG DISCOVERY TODAY. TECHNOLOGIES 2019; 32-33:89-98. [PMID: 33386099 DOI: 10.1016/j.ddtec.2020.08.003] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 08/18/2020] [Accepted: 08/28/2020] [Indexed: 06/12/2023]
Abstract
Proteochemometrics is a machine learning based modeling approach relying on a combination of ligand and protein descriptors. With ongoing developments in machine learning and increases in public data the technique is more frequently applied in early drug discovery, typically in ligand-target binding prediction. Common applications include improvements to single target quantitative structure-activity relationship models, protein selectivity and promiscuity modeling, and large-scale deep learning approaches. The increase in predictive power using proteochemometrics is observed in multi-target bioactivity modeling, opening the door to more extensive studies covering whole protein families. On top of that, with deep learning fueling more complex and larger scale models, proteochemometrics allows faster and higher quality computational models supporting the design, make, test cycle.
Collapse
Affiliation(s)
- Brandon J Bongers
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Adriaan P IJzerman
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands
| | - Gerard J P Van Westen
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, P.O. Box 9502, 2300 RA, Leiden, The Netherlands.
| |
Collapse
|
12
|
Cockroft NT, Cheng X, Fuchs JR. STarFish: A Stacked Ensemble Target Fishing Approach and its Application to Natural Products. J Chem Inf Model 2019; 59:4906-4920. [PMID: 31589422 DOI: 10.1021/acs.jcim.9b00489] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Target fishing is the process of identifying the protein target of a bioactive small molecule. To do so experimentally requires a significant investment of time and resources, which can be expedited with a reliable computational target fishing model. The development of computational target fishing models using machine learning has become very popular over the last several years because of the increased availability of large amounts of public bioactivity data. Unfortunately, the applicability and performance of such models for natural products has not yet been comprehensively assessed. This is, in part, due to the relative lack of bioactivity data available for natural products compared to synthetic compounds. Moreover, the databases commonly used to train such models do not annotate which compounds are natural products, which makes the collection of a benchmarking set difficult. To address this knowledge gap, a data set composed of natural product structures and their associated protein targets was generated by cross-referencing 20 publicly available natural product databases with the bioactivity database ChEMBL. This data set contains 5589 compound-target pairs for 1943 unique compounds and 1023 unique targets. A synthetic data set comprising 107 190 compound-target pairs for 88 728 unique compounds and 1907 unique targets was used to train k-nearest neighbors, random forest, and multilayer perceptron models. The predictive performance of each model was assessed by stratified 10-fold cross-validation and benchmarking on the newly collected natural product data set. Strong performance was observed for each model during cross-validation with area under the receiver operating characteristic (AUROC) scores ranging from 0.94 to 0.99 and Boltzmann-enhanced discrimination of receiver operating characteristic (BEDROC) scores from 0.89 to 0.94. When tested on the natural product data set, performance dramatically decreased with AUROC scores ranging from 0.70 to 0.85 and BEDROC scores from 0.43 to 0.59. However, the implementation of a model stacking approach, which uses logistic regression as a meta-classifier to combine model predictions, dramatically improved the ability to correctly predict the protein targets of natural products and increased the AUROC score to 0.94 and BEDROC score to 0.73. This stacked model was deployed as a web application, called STarFish, and has been made available for use to aid in target identification for natural products.
Collapse
Affiliation(s)
- Nicholas T Cockroft
- Division of Medicinal Chemistry & Pharmacognosy, College of Pharmacy , The Ohio State University , Columbus , Ohio 43210 , United States
| | - Xiaolin Cheng
- Division of Medicinal Chemistry & Pharmacognosy, College of Pharmacy , The Ohio State University , Columbus , Ohio 43210 , United States
| | - James R Fuchs
- Division of Medicinal Chemistry & Pharmacognosy, College of Pharmacy , The Ohio State University , Columbus , Ohio 43210 , United States
| |
Collapse
|