1
|
Kumar S, Ali I, Abbas F, Rana A, Pandey S, Garg M, Kumar D. In-silico design, pharmacophore-based screening, and molecular docking studies reveal that benzimidazole-1,2,3-triazole hybrids as novel EGFR inhibitors targeting lung cancer. J Biomol Struct Dyn 2024; 42:9416-9438. [PMID: 37646177 DOI: 10.1080/07391102.2023.2252496] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/18/2023] [Indexed: 09/01/2023]
Abstract
Lung cancer is a complex and heterogeneous disease, which has been associated with various molecular alterations, including the overexpression and mutations of the epidermal growth factor receptor (EGFR). In this study, designed a library of 1843 benzimidazole-1,2,3-triazole hybrids and carried out pharmacophore-based screening to identify potential EGFR inhibitors. The 164 compounds were further evaluated using molecular docking and molecular dynamics simulations to understand the binding interactions between the compounds and the receptor. In-si-lico ADME and toxicity studies were also conducted to assess the drug-likeness and safety of the identified compounds. The results of this study indicate that benzimidazole-1,2,3-triazole hybrids BENZI-0660, BENZI-0125, BENZI-0279, BENZI-0415, BENZI-0437, and BENZI-1110 exhibit dock scores of -9.7, -9.6, -9.6, -9.6, -9.6, -9.6 while referencing molecule -7.9 kcal/mol for EGFR (PDB ID: 4HJO), respectively. The molecular docking and molecular dynamics simulations revealed that the identified compounds formed stable interactions with the active site of EGFR, indicating their potential as inhibitors. The in-silico ADME and toxicity studies showed that the compounds had favorable drug-likeness properties and low toxicity, further supporting their potential as therapeutic agents. Finally, performed DFT studies on the best-selected ligands to gain further insights into their electronic properties. The findings of this study provide important insights into the potential of benzimidazole-1,2,3-triazole hybrids as promising EGFR inhibitors for the treatment of lung cancer. This research opens up a new avenue for the discovery and development of potent and selective EGFR inhibitors for the treatment of lung cancer.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sunil Kumar
- Department of Pharmaceutical Chemistry, School of Pharmaceutical Sciences, Shoolini University, Solan, India
| | - Iqra Ali
- Department of Biosciences, COMSATS University Islamabad, Islamabad, Pakistan
| | - Faheem Abbas
- Key Lab of Organic Optoelectronics and Molecular Engineering of Ministry of Education, Department of Chemistry, Tsinghua University, Beijing, P. R. China
| | - Anurag Rana
- Yogananda School of Artificial Intelligence, Computers, and Data Sciences, Shoolini University, Solan, India
| | - Sadanand Pandey
- Department of Chemistry, College of Natural Science, Yeungnam University, Gyeongsan, Korea
| | - Manoj Garg
- Amity Institute of Molecular Medicine and Stem Cell Research, Amity University, Noida, India
| | - Deepak Kumar
- Department of Pharmaceutical Chemistry, School of Pharmaceutical Sciences, Shoolini University, Solan, India
| |
Collapse
|
2
|
Ronchi D, Tosca EM, Bartolucci R, Magni P. Go beyond the limits of genetic algorithm in daily covariate selection practice. J Pharmacokinet Pharmacodyn 2024; 51:109-121. [PMID: 37493851 PMCID: PMC10982092 DOI: 10.1007/s10928-023-09875-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 07/08/2023] [Indexed: 07/27/2023]
Abstract
Covariate identification is an important step in the development of a population pharmacokinetic/pharmacodynamic model. Among the different available approaches, the stepwise covariate model (SCM) is the most used. However, SCM is based on a local search strategy, in which the model-building process iteratively tests the addition or elimination of a single covariate at a time given all the others. This introduces a heuristic to limit the searching space and then the computational complexity, but, at the same time, can lead to a suboptimal solution. The application of genetic algorithms (GAs) for covariate selection has been proposed as a possible solution to overcome these limitations. However, their actual use during model building is limited by the extremely high computational costs and convergence issues, both related to the number of models being tested. In this paper, we proposed a new GA for covariate selection to address these challenges. The GA was first developed on a simulated case study where the heuristics introduced to overcome the limitations affecting currently available GA approaches resulted able to limit the selection of redundant covariates, increase replicability of results and reduce convergence times. Then, we tested the proposed GA on a real-world problem related to remifentanil. It obtained good results both in terms of selected covariates and fitness optimization, outperforming the SCM.
Collapse
Affiliation(s)
- D Ronchi
- Dipartimento di Ingegneria Industriale e dell'Informazione, Università degli Studi di Pavia, 27100, Pavia, Italy
| | - E M Tosca
- Dipartimento di Ingegneria Industriale e dell'Informazione, Università degli Studi di Pavia, 27100, Pavia, Italy
| | - R Bartolucci
- Dipartimento di Ingegneria Industriale e dell'Informazione, Università degli Studi di Pavia, 27100, Pavia, Italy
- Clinical Pharmacology & Pharmacometrics, Janssen Research & Development, Beerse, Belgium
| | - P Magni
- Dipartimento di Ingegneria Industriale e dell'Informazione, Università degli Studi di Pavia, 27100, Pavia, Italy.
| |
Collapse
|
3
|
Shiammala PN, Duraimutharasan NKB, Vaseeharan B, Alothaim AS, Al-Malki ES, Snekaa B, Safi SZ, Singh SK, Velmurugan D, Selvaraj C. Exploring the artificial intelligence and machine learning models in the context of drug design difficulties and future potential for the pharmaceutical sectors. Methods 2023; 219:82-94. [PMID: 37778659 DOI: 10.1016/j.ymeth.2023.09.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/21/2023] [Accepted: 09/25/2023] [Indexed: 10/03/2023] Open
Abstract
Artificial intelligence (AI), particularly deep learning as a subcategory of AI, provides opportunities to accelerate and improve the process of discovering and developing new drugs. The use of AI in drug discovery is still in its early stages, but it has the potential to revolutionize the way new drugs are discovered and developed. As AI technology continues to evolve, it is likely that AI will play an even greater role in the future of drug discovery. AI is used to identify new drug targets, design new molecules, and predict the efficacy and safety of potential drugs. The inclusion of AI in drug discovery can screen millions of compounds in a matter of hours, identifying potential drug candidates that would have taken years to find using traditional methods. AI is highly utilized in the pharmaceutical industry by optimizing processes, reducing waste, and ensuring quality control. This review covers much-needed topics, including the different types of machine-learning techniques, their applications in drug discovery, and the challenges and limitations of using machine learning in this field. The state-of-the-art of AI-assisted pharmaceutical discovery is described, covering applications in structure and ligand-based virtual screening, de novo drug creation, prediction of physicochemical and pharmacokinetic properties, drug repurposing, and related topics. Finally, many obstacles and limits of present approaches are outlined, with an eye on potential future avenues for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
| | | | - Baskaralingam Vaseeharan
- Department of Animal Health and Management, Science Block, Alagappa University, Karaikudi, Tamil Nadu 630 003, India
| | - Abdulaziz S Alothaim
- Department of Biology, College of Science in Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia
| | - Esam S Al-Malki
- Department of Biology, College of Science in Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia
| | - Babu Snekaa
- Laboratory for Artificial Intelligence and Molecular Modelling, Department of Pharmacology, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, Tamil Nadu 600077, India
| | - Sher Zaman Safi
- Faculty of Medicine, Bioscience and Nursing, MAHSA University, Jenjarom 42610, Selangor, Malaysia
| | - Sanjeev Kumar Singh
- Computer Aided Drug Design and Molecular Modelling Lab, Department of Bioinformatics, Science Block, Alagappa University, Karaikudi-630 003, Tamil Nadu, India
| | - Devadasan Velmurugan
- Department of Biotechnology, College of Engineering & Technology, SRM Institute of Science & Technology, Kattankulathur, Chennai, Tamil Nadu 603203, India
| | - Chandrabose Selvaraj
- Laboratory for Artificial Intelligence and Molecular Modelling, Department of Pharmacology, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, Tamil Nadu 600077, India; Laboratory for Artificial Intelligence and Molecular Modelling, Center for Global Health Research, Saveetha Medical College, Saveetha Institute of Medical and Technical Sciences, Saveetha Nagar, Thandalam, Chennai, Tamil Nadu 602105, India.
| |
Collapse
|
4
|
Dutschmann TM, Kinzel L, Ter Laak A, Baumann K. Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation. J Cheminform 2023; 15:49. [PMID: 37118768 PMCID: PMC10142532 DOI: 10.1186/s13321-023-00709-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 03/10/2023] [Indexed: 04/30/2023] Open
Abstract
It is insightful to report an estimator that describes how certain a model is in a prediction, additionally to the prediction alone. For regression tasks, most approaches implement a variation of the ensemble method, apart from few exceptions. Instead of a single estimator, a group of estimators yields several predictions for an input. The uncertainty can then be quantified by measuring the disagreement between the predictions, for example by the standard deviation. In theory, ensembles should not only provide uncertainties, they also boost the predictive performance by reducing errors arising from variance. Despite the development of novel methods, they are still considered the "golden-standard" to quantify the uncertainty of regression models. Subsampling-based methods to obtain ensembles can be applied to all models, regardless whether they are related to deep learning or traditional machine learning. However, little attention has been given to the question whether the ensemble method is applicable to virtually all scenarios occurring in the field of cheminformatics. In a widespread and diversified attempt, ensembles are evaluated for 32 datasets of different sizes and modeling difficulty, ranging from physicochemical properties to biological activities. For increasing ensemble sizes with up to 200 members, the predictive performance as well as the applicability as uncertainty estimator are shown for all combinations of five modeling techniques and four molecular featurizations. Useful recommendations were derived for practitioners regarding the success and minimum size of ensembles, depending on whether predictive performance or uncertainty quantification is of more importance for the task at hand.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Lennart Kinzel
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Antonius Ter Laak
- Bayer AG, Research & Development, Pharmaceuticals, Muellerstrasse 178, 13353, Berlin, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany.
| |
Collapse
|
5
|
Liu XQ, Yi YJ, Kong Y, Yu P, Zhao LG, Li DD. Consensus scoring model: A novel approach to the study of EGFR kinase inhibitors. Chem Phys Lett 2022. [DOI: 10.1016/j.cplett.2022.139650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
6
|
Prada Gori DN, Llanos MA, Bellera CL, Talevi A, Alberca LN. iRaPCA and SOMoC: Development and Validation of Web Applications for New Approaches for the Clustering of Small Molecules. J Chem Inf Model 2022; 62:2987-2998. [PMID: 35687523 DOI: 10.1021/acs.jcim.2c00265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The clustering of small molecules implies the organization of a group of chemical structures into smaller subgroups with similar features. Clustering has important applications to sample chemical datasets or libraries in a representative manner (e.g., to choose, from a virtual screening hit list, a chemically diverse subset of compounds to be submitted to experimental confirmation, or to split datasets into representative training and validation sets when implementing machine learning models). Most strategies for clustering molecules are based on molecular fingerprints and hierarchical clustering algorithms. Here, two open-source in-house methodologies for clustering of small molecules are presented: iterative Random subspace Principal Component Analysis clustering (iRaPCA), an iterative approach based on feature bagging, dimensionality reduction, and K-means optimization; and Silhouette Optimized Molecular Clustering (SOMoC), which combines molecular fingerprints with the Uniform Manifold Approximation and Projection (UMAP) and Gaussian Mixture Model algorithm (GMM). In a benchmarking exercise, the performance of both clustering methods has been examined across 29 datasets containing between 100 and 5000 small molecules, comparing these results with those given by two other well-known clustering methods, Ward and Butina. iRaPCA and SOMoC consistently showed the best performance across these 29 datasets, both in terms of within-cluster and between-cluster distances. Both iRaPCA and SOMoC have been implemented as free Web Apps and standalone applications, to allow their use to a wide audience within the scientific community.
Collapse
Affiliation(s)
- Denis N Prada Gori
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Manuel A Llanos
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Carolina L Bellera
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Alan Talevi
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| | - Lucas N Alberca
- Laboratory of Bioactive Compounds Research and Development (LIDeB), Department of Biological Sciences, Faculty of Exact Sciences, National University of La Plata (UNLP), La Plata B1900ADU, Argentina
| |
Collapse
|
7
|
Yin Y, Hu H, Yang Z, Jiang F, Huang Y, Wu J. AFSE: towards improving model generalization of deep graph learning of ligand bioactivities targeting GPCR proteins. Brief Bioinform 2022; 23:6554127. [PMID: 35348582 DOI: 10.1093/bib/bbac077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 02/12/2022] [Accepted: 02/14/2022] [Indexed: 11/14/2022] Open
Abstract
Ligand molecules naturally constitute a graph structure. Recently, many excellent deep graph learning (DGL) methods have been proposed and used to model ligand bioactivities, which is critical for the virtual screening of drug hits from compound databases in interest. However, pharmacists can find that these well-trained DGL models usually are hard to achieve satisfying performance in real scenarios for virtual screening of drug candidates. The main challenges involve that the datasets for training models were small-sized and biased, and the inner active cliff cases would worsen model performance. These challenges would cause predictors to overfit the training data and have poor generalization in real virtual screening scenarios. Thus, we proposed a novel algorithm named adversarial feature subspace enhancement (AFSE). AFSE dynamically generates abundant representations in new feature subspace via bi-directional adversarial learning, and then minimizes the maximum loss of molecular divergence and bioactivity to ensure local smoothness of model outputs and significantly enhance the generalization of DGL models in predicting ligand bioactivities. Benchmark tests were implemented on seven state-of-the-art open-source DGL models with the potential of modeling ligand bioactivities, and precisely evaluated by multiple criteria. The results indicate that, on almost all 33 GPCRs datasets and seven DGL models, AFSE greatly improved their enhancement factor (top-10%, 20% and 30%), which is the most important evaluation in virtual screening of hits from compound databases, while ensuring the superior performance on RMSE and $r^2$. The web server of AFSE is freely available at http://noveldelta.com/AFSE for academic purposes.
Collapse
Affiliation(s)
- Yueming Yin
- School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Haifeng Hu
- School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Zhen Yang
- School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China.,National Engineering Research Center of Communications and Networking, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Feihu Jiang
- School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Yihe Huang
- School of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Jiansheng Wu
- School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China.,Smart Health Big Data Analysis and Location Services Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| |
Collapse
|
8
|
Dutschmann TM, Baumann K. Evaluating High-Variance Leaves as Uncertainty Measure for Random Forest Regression. Molecules 2021; 26:molecules26216514. [PMID: 34770921 PMCID: PMC8588039 DOI: 10.3390/molecules26216514] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/19/2021] [Accepted: 10/22/2021] [Indexed: 01/31/2023] Open
Abstract
Uncertainty measures estimate the reliability of a predictive model. Especially in the field of molecular property prediction as part of drug design, model reliability is crucial. Besides other techniques, Random Forests have a long tradition in machine learning related to chemoinformatics and are widely used. Random Forests consist of an ensemble of individual regression models, namely, decision trees and, therefore, provide an uncertainty measure already by construction. Regarding the disagreement of single-model predictions, a narrower distribution of predictions is interpreted as a higher reliability. The standard deviation of the decision tree ensemble predictions is the default uncertainty measure for Random Forests. Due to the increasing application of machine learning in drug design, there is a constant search for novel uncertainty measures that, ideally, outperform classical uncertainty criteria. When analyzing Random Forests, it appears obvious to consider the variance of the dependent variables within each terminal decision tree leaf to obtain predictive uncertainties. Hereby, predictions that arise from more leaves of high variance are considered less reliable. Expectedly, the number of such high-variance leaves yields a reasonable uncertainty measure. Depending on the dataset, it can also outperform ensemble uncertainties. However, small-scale comparisons, i.e., considering only a few datasets, are insufficient, since they are more prone to chance correlations. Therefore, large-scale estimations are required to make general claims about the performance of uncertainty measures. On several chemoinformatic regression datasets, high-variance leaves are compared to the standard deviation of ensemble predictions. It turns out that high-variance leaf uncertainty is meaningful, not superior to the default ensemble standard deviation. A brief possible explanation is offered.
Collapse
|
9
|
Yin Y, Hu H, Yang Z, Xu H, Wu J. RealVS: Toward Enhancing the Precision of Top Hits in Ligand-Based Virtual Screening of Drug Leads from Large Compound Databases. J Chem Inf Model 2021; 61:4924-4939. [PMID: 34619030 DOI: 10.1021/acs.jcim.1c01021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Accurate modeling of compound bioactivities is essential for the virtual screening of drug leads. In real-world scenarios, pharmacists tend to choose from the top-k hit compounds ranked by predicted bioactivities from a large database with interest to continue wet experiments for drug discovery. Significant improvement of the precision of the top hits in ligand-based virtual screening of drug leads is more valuable than conventional schemes for accurately predicting the bioactivities of all compounds from a large database. Here, we proposed a new method, RealVS, to significantly improve the top hits' precision and learn interpretable key substructures associated with compound bioactivities. The features of RealVS involve the following points. (1) Abundant transferable information from the source domain was introduced for alleviating the insufficiency of inactive ligands associated with drug targets. (2) The adversarial domain alignment was adopted to fit the distribution of generated features of compounds from the training data set and that from the screening database for greater model generalization ability. (3) A novel objective function was proposed to simultaneously optimize the classification loss, regression loss, and adversarial loss, where most inactive ligands tend to be screened out before activity regression prediction. (4) Graph attention networks were adopted for learning key substructures associated with ligand bioactivities for better model interpretability. The results on a large number of benchmark data sets show that our method has significantly improved the precision of top hits under various k values in ligand-based virtual screening of drug leads from large compound databases, which is of great value in real-world scenarios. The web server of RealVS is freely available at noveldelta.com/RealVS for academic purposes, where virtual screening of hits from large compound databases is accessible.
Collapse
Affiliation(s)
- Yueming Yin
- College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Haifeng Hu
- College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Zhen Yang
- National Engineering Research Center of Communications and Networking, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Huajian Xu
- College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| | - Jiansheng Wu
- School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
| |
Collapse
|
10
|
McComb M, Bies R, Ramanathan M. Machine learning in pharmacometrics: Opportunities and challenges. Br J Clin Pharmacol 2021; 88:1482-1499. [PMID: 33634893 DOI: 10.1111/bcp.14801] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Revised: 02/08/2021] [Accepted: 02/12/2021] [Indexed: 12/13/2022] Open
Abstract
The explosive growth in medical devices, imaging and diagnostics, computing, and communication and information technologies in drug development and healthcare has created an ever-expanding data landscape that the pharmacometrics (PMX) research community must now traverse. The tools of machine learning (ML) have emerged as a powerful computational approach in other data-rich disciplines but its effective utilization in the pharmaceutical sciences and PMX modelling is in its infancy. ML-based methods can complement PMX modelling by enabling the information in diverse sources of big data, e.g. population-based public databases and disease-specific clinical registries, to be harnessed because they are capable of efficiently identifying salient variables associated with outcomes and delineating their interdependencies. ML algorithms are computationally efficient, have strong predictive capabilities and can enable learning in the big data setting. ML algorithms can be viewed as providing a computational bridge from big data to complement PMX modelling. This review provides an overview of the strengths and weaknesses of ML approaches vis-à-vis population methods, assesses current research into ML applications in the pharmaceutical sciences and provides perspective for potential opportunities and strategies for the successful integration and utilization of ML in PMX.
Collapse
Affiliation(s)
- Mason McComb
- Department of Pharmaceutical Sciences, University at Buffalo, University at Buffalo, State University of New York, Buffalo, NY, USA
| | - Robert Bies
- Department of Pharmaceutical Sciences, University at Buffalo, University at Buffalo, State University of New York, Buffalo, NY, USA.,Institute for Computational Data Science, University at Buffalo, NY, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, University at Buffalo, University at Buffalo, State University of New York, Buffalo, NY, USA.,Department of Neurology, University at Buffalo, State University of New York, Buffalo, NY, USA
| |
Collapse
|
11
|
Lane TR, Foil DH, Minerali E, Urbina F, Zorn KM, Ekins S. Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery. Mol Pharm 2020; 18:403-415. [PMID: 33325717 DOI: 10.1021/acs.molpharmaceut.0c01013] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we and others have applied multiple machine learning algorithms and modeling metrics and, in some cases, compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our proprietary software Assay Central with random forest, k-nearest neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (three layers). Model performance was assessed using an array of fivefold cross-validation metrics including area-under-the-curve, F1 score, Cohen's kappa, and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets, all methods appeared comparable, while the distance from the top indicated that Assay Central and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case. If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods for evaluating and comparing such models.
Collapse
Affiliation(s)
- Thomas R Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Daniel H Foil
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Eni Minerali
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Fabio Urbina
- Department of Cell Biology and Physiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7545, United States
| | - Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, North Carolina 27606, United States
| |
Collapse
|
12
|
Sheridan RP, Karnachi P, Tudor M, Xu Y, Liaw A, Shah F, Cheng AC, Joshi E, Glick M, Alvarez J. Experimental Error, Kurtosis, Activity Cliffs, and Methodology: What Limits the Predictivity of Quantitative Structure-Activity Relationship Models? J Chem Inf Model 2020; 60:1969-1982. [PMID: 32207612 DOI: 10.1021/acs.jcim.9b01067] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Given a particular descriptor/method combination, some quantitative structure-activity relationship (QSAR) datasets are very predictive by random-split cross-validation while others are not. Recent literature in modelability suggests that the limiting issue for predictivity is in the data, not the QSAR methodology, and the limits are due to activity cliffs. Here, we investigate, on in-house data, the relative usefulness of experimental error, distribution of the activities, and activity cliff metrics in determining how predictive a dataset is likely to be. We include unmodified in-house datasets, datasets that should be perfectly predictive based only on the chemical structure, datasets where the distribution of activities is manipulated, and datasets that include a known amount of added noise. We find that activity cliff metrics determine predictivity better than the other metrics we investigated, whatever the type of dataset, consistent with the modelability literature. However, such metrics cannot distinguish real activity cliffs due to large uncertainties in the activities. We also show that a number of modern QSAR methods, and some alternative descriptors, are equally bad at predicting the activities of compounds on activity cliffs, consistent with the assumptions behind "modelability." Finally, we relate time-split predictivity with random-split predictivity and show that different coverages of chemical space are at least as important as uncertainty in activity and/or activity cliffs in limiting predictivity.
Collapse
Affiliation(s)
- Robert P Sheridan
- Computational and Structural Chemistry, Merck & Company Inc., Kenilworth, New Jersey 07033, United States
| | - Prabha Karnachi
- Computational and Structural Chemistry, Merck & Company Inc., Kenilworth, New Jersey 07033, United States
| | - Matthew Tudor
- Computational and Structural Chemistry, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Yuting Xu
- Biometrics Research, Merck & Company Inc., Rahway, New Jersey 07065, United States
| | - Andy Liaw
- Biometrics Research, Merck & Company Inc., Rahway, New Jersey 07065, United States
| | - Falgun Shah
- Computational and Structural Chemistry, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Alan C Cheng
- Computational and Structural Chemistry, Merck & Company Inc., South San Francisco, California 94080, United States
| | - Elizabeth Joshi
- Pharmacokinetics, Pharmacodynamics & Drug Metabolism, Merck & Company Inc., West Point, Pennsylvania 19486, United States
| | - Meir Glick
- Computational and Structural Chemistry, Merck & Company Inc., Boston, Massachusetts 02115, United States
| | - Juan Alvarez
- Computational and Structural Chemistry, Merck & Company Inc., Boston, Massachusetts 02115, United States
| |
Collapse
|
13
|
Wu J, Sun Y, Chan WKB, Zhu Y, Zhu W, Huang W, Hu H, Yan S, Pang T, Ke X, Li F. Homologous G Protein-Coupled Receptors Boost the Modeling and Interpretation of Bioactivities of Ligand Molecules. J Chem Inf Model 2020; 60:1865-1875. [PMID: 32040913 DOI: 10.1021/acs.jcim.9b01000] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
G protein-coupled receptors (GPCRs) are one of the most important drug targets, accounting for ∼34% of drugs on the market. For drug discovery, accurate modeling and explanation of bioactivities of ligands is critical for the screening and optimization of hit compounds. Homologous GPCRs are more likely to interact with chemically similar ligands, and they tend to share common binding modes with ligand molecules. The inclusion of homologous GPCRs in learning bioactivities of ligands potentially enhances the accuracy and interpretability of models due to utilizing increased training sample size and the existence of common ligand substructures that control bioactivities. Accurate modeling and interpretation of bioactivities of ligands by combining homologous GPCRs can be formulated as multitask learning with joint feature learning problem and naturally matched with the group lasso learning algorithm. Thus, we proposed a multitask regression learning with group lasso (MTR-GL) implemented by l2,1-norm regularization to model bioactivities of ligand molecules and then tested the algorithm on a series of thirty-five representative GPCRs datasets that cover nine subfamilies of human GPCRs. The results show that MTR-GL is overall superior to single-task learning methods and classic multitask learning with joint feature learning methods. Moreover, MTR-GL achieves better performance than state-of-the-art deep multitask learning based methods of predicting ligand bioactivities on most datasets (31/35), where MTR-GL obtained an average improvement of 38% on correlation coefficient (r2) and 29% on root-mean-square error over the DeepNeuralNet-QSAR predictors.
Collapse
Affiliation(s)
- Jiansheng Wu
- School of Geographic and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China.,Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Yi Sun
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Wallace K B Chan
- Department of Pharmacology, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Yanxiang Zhu
- Verimake Research, Nanjing Qujike Info-tech Co., Ltd., Nanjing 210088, China
| | - Wenyong Zhu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Wanqing Huang
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Haifeng Hu
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Shancheng Yan
- School of Geographic and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China.,Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| | - Tao Pang
- Jiangsu Key Laboratory of Drug Screening, China Pharmaceutical University, Nanjing 210009, China
| | - Xiaoyan Ke
- Child Mental Health Research Center, Nanjing Brain Hospital, Nanjing Medical University, Nanjing 210029, China
| | - Fei Li
- State Key Laboratory of Natural Medicines, China Pharmaceutical University, Nanjing 210009, China
| |
Collapse
|
14
|
Wu J, Zhang Q, Wu W, Pang T, Hu H, Chan WKB, Ke X, Zhang Y. WDL-RF: predicting bioactivities of ligand molecules acting with G protein-coupled receptors by combining weighted deep learning and random forest. Bioinformatics 2019; 34:2271-2282. [PMID: 29432522 DOI: 10.1093/bioinformatics/bty070] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2017] [Accepted: 02/07/2018] [Indexed: 12/11/2022] Open
Abstract
Motivation Precise assessment of ligand bioactivities (including IC50, EC50, Ki, Kd, etc.) is essential for virtual screening and lead compound identification. However, not all ligands have experimentally determined activities. In particular, many G protein-coupled receptors (GPCRs), which are the largest integral membrane protein family and represent targets of nearly 40% drugs on the market, lack published experimental data about ligand interactions. Computational methods with the ability to accurately predict the bioactivity of ligands can help efficiently address this problem. Results We proposed a new method, WDL-RF, using weighted deep learning and random forest, to model the bioactivity of GPCR-associated ligand molecules. The pipeline of our algorithm consists of two consecutive stages: (i) molecular fingerprint generation through a new weighted deep learning method, and (ii) bioactivity calculations with a random forest model; where one uniqueness of the approach is that the model allows end-to-end learning of prediction pipelines with input ligands being of arbitrary size. The method was tested on a set of twenty-six non-redundant GPCRs that have a high number of active ligands, each with 200-4000 ligand associations. The results from our benchmark show that WDL-RF can generate bioactivity predictions with an average root-mean square error 1.33 and correlation coefficient (r2) 0.80 compared to the experimental measurements, which are significantly more accurate than the control predictors with different molecular fingerprints and descriptors. In particular, data-driven molecular fingerprint features, as extracted from the weighted deep learning models, can help solve deficiencies stemming from the use of traditional hand-crafted features and significantly increase the efficiency of short molecular fingerprints in virtual screening. Availability and implementation The WDL-RF web server, as well as source codes and datasets of WDL-RF, is freely available at https://zhanglab.ccmb.med.umich.edu/WDL-RF/ for academic purposes. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiansheng Wu
- School of Geographic and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA
| | - Qiuming Zhang
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Weijian Wu
- College of Computer and Information, Hohai University, Nanjing, China
| | - Tao Pang
- Jiangsu Key Laboratory of Drug Screening, China Pharmaceutical University, Nanjing, China
| | - Haifeng Hu
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Wallace K B Chan
- Department of Biological Chemistry, University of Michigan, Ann Arbor, USA
| | - Xiaoyan Ke
- Child Mental Health Research Center, Nanjing Brain Hospital, Nanjing Medical University, Nanjing, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, USA
| |
Collapse
|
15
|
Wu J, Liu B, Chan WKB, Wu W, Pang T, Hu H, Yan S, Ke X, Zhang Y. Precise modelling and interpretation of bioactivities of ligands targeting G protein-coupled receptors. Bioinformatics 2019; 35:i324-i332. [PMID: 31510691 PMCID: PMC6612825 DOI: 10.1093/bioinformatics/btz336] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
MOTIVATION Accurate prediction and interpretation of ligand bioactivities are essential for virtual screening and drug discovery. Unfortunately, many important drug targets lack experimental data about the ligand bioactivities; this is particularly true for G protein-coupled receptors (GPCRs), which account for the targets of about a third of drugs currently on the market. Computational approaches with the potential of precise assessment of ligand bioactivities and determination of key substructural features which determine ligand bioactivities are needed to address this issue. RESULTS A new method, SED, was proposed to predict ligand bioactivities and to recognize key substructures associated with GPCRs through the coupling of screening for Lasso of long extended-connectivity fingerprints (ECFPs) with deep neural network training. The SED pipeline contains three successive steps: (i) representation of long ECFPs for ligand molecules, (ii) feature selection by screening for Lasso of ECFPs and (iii) bioactivity prediction through a deep neural network regression model. The method was examined on a set of 16 representative GPCRs that cover most subfamilies of human GPCRs, where each has 300-5000 ligand associations. The results show that SED achieves excellent performance in modelling ligand bioactivities, especially for those in the GPCR datasets without sufficient ligand associations, where SED improved the baseline predictors by 12% in correlation coefficient (r2) and 19% in root mean square error. Detail data analyses suggest that the major advantage of SED lies on its ability to detect substructures from long ECFPs which significantly improves the predictive performance. AVAILABILITY AND IMPLEMENTATION The source code and datasets of SED are freely available at https://zhanglab.ccmb.med.umich.edu/SED/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jiansheng Wu
- School of Geographic and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing, China
- Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Ben Liu
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Wallace K B Chan
- Department of Pharmacology, University of Michigan, Ann Arbor, MI, USA
| | - Weijian Wu
- College of Computer and Information, Hohai University, Nanjing, China
| | - Tao Pang
- Jiangsu Key Laboratory of Drug Screening, China Pharmaceutical University, Nanjing, China
| | - Haifeng Hu
- School of Telecommunication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Shancheng Yan
- School of Geographic and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing, China
- Smart Health Big Data Analysis and Location Services Engineering Lab of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Xiaoyan Ke
- Child Mental Health Research Center, Nanjing Brain Hospital, Nanjing Medical University, Nanjing, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|
16
|
Exploring the Potential of Spherical Harmonics and PCVM for Compounds Activity Prediction. Int J Mol Sci 2019; 20:ijms20092175. [PMID: 31052500 PMCID: PMC6539940 DOI: 10.3390/ijms20092175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2019] [Revised: 04/14/2019] [Accepted: 04/29/2019] [Indexed: 01/11/2023] Open
Abstract
Biologically active chemical compounds may provide remedies for several diseases. Meanwhile, Machine Learning techniques applied to Drug Discovery, which are cheaper and faster than wet-lab experiments, have the capability to more effectively identify molecules with the expected pharmacological activity. Therefore, it is urgent and essential to develop more representative descriptors and reliable classification methods to accurately predict molecular activity. In this paper, we investigate the potential of a novel representation based on Spherical Harmonics fed into Probabilistic Classification Vector Machines classifier, namely SHPCVM, to compound the activity prediction task. We make use of representation learning to acquire the features which describe the molecules as precise as possible. To verify the performance of SHPCVM ten-fold cross-validation tests are performed on twenty-one G protein-coupled receptors (GPCRs). Experimental outcomes (accuracy of 0.86) assessed by the classification accuracy, precision, recall, Matthews’ Correlation Coefficient and Cohen’s kappa reveal that using our Spherical Harmonics-based representation which is relatively short and Probabilistic Classification Vector Machines can achieve very satisfactory performance results for GPCRs.
Collapse
|
17
|
Sheridan RP. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? J Chem Inf Model 2019; 59:1324-1337. [DOI: 10.1021/acs.jcim.8b00825] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Robert P. Sheridan
- Modeling and Informatics, Merck & Co. Inc., Kenilworth, New Jersey 07065, United States
| |
Collapse
|
18
|
An Application of Fit Quality to Screen MDM2/p53 Protein-Protein Interaction Inhibitors. Molecules 2018; 23:molecules23123174. [PMID: 30513790 PMCID: PMC6321222 DOI: 10.3390/molecules23123174] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2018] [Revised: 11/28/2018] [Accepted: 11/30/2018] [Indexed: 12/17/2022] Open
Abstract
The judicious application of ligand or binding efficiency (LE) metrics, which quantify the molecular properties required to obtain binding affinity for a drug target, is gaining traction in the selection and optimization of fragments, hits and leads. Here we report for the first time the use of LE based metric, fit quality (FQ), in virtual screening (VS) of MDM2/p53 protein-protein interaction inhibitors (PPIIs). Firstly, a Receptor-Ligand pharmacophore model was constructed on multiple MDM2/ligand complex structures to screen the library. The enrichment factor (EF) for screening was calculated based on a decoy set to define the screening threshold. Finally, 1% of the library, 335 compounds, were screened and re-filtered with the FQ metric. According to the statistical results of FQ vs. activity of 156 MDM2/p53 PPIIs extracted from literatures, the cut-off was defined as FQ = 0.8. After the second round of VS, six compounds with the FQ > 0.8 were picked out for assessing their antitumor activity. At the cellular level, the six hits exhibited a good selectivity (larger than 3) against HepG2 (wt-p53) vs. Hep3B (p53 null) cell lines. On the further study, the six hits exhibited an acceptable affinity (range of Ki from 102 to 103 nM) to MDM2 when comparing to Nutlin-3a. Based on our work, FQ based VS strategy could be applied to discover other PPIIs.
Collapse
|
19
|
Kensert A, Alvarsson J, Norinder U, Spjuth O. Evaluating parameters for ligand-based modeling with random forest on sparse data sets. J Cheminform 2018; 10:49. [PMID: 30306349 PMCID: PMC6755600 DOI: 10.1186/s13321-018-0304-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 10/03/2018] [Indexed: 11/10/2022] Open
Abstract
Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.
Collapse
Affiliation(s)
- Alexander Kensert
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden.
| | - Jonathan Alvarsson
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| | - Ulf Norinder
- Unit of Toxicology Sciences, Karolinska Institutet, Swetox, Forskargatan 20, SE-15136, Södertälje, Sweden.,Department of Computer and Systems Sciences, Stockholm University, Box 7003, SE-164 07, Kista, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
| |
Collapse
|
20
|
Luque Ruiz I, Gómez-Nieto MÁ. Regression Modelability Index: A New Index for Prediction of the Modelability of Data Sets in the Development of QSAR Regression Models. J Chem Inf Model 2018; 58:2069-2084. [PMID: 30205684 DOI: 10.1021/acs.jcim.8b00313] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Prediction of the capability of a data set to be modeled by a statistical algorithm in the development of quantitative structure-activity relationship (QSAR) regression models is an important issue that allows researchers to avoid unnecessary tasks, wasted time, and/or the need to depurate the molecule composition of the data set in order to achieve an improvement of the model's accuracy. In this paper, we propose and formulate a new index that correlates with the performance of QSAR models. This index, the regression modelability index, requires very low computational cost and is based on the rivality between the nearest neighbors of the molecules in the data set. This rivality allows measurement of the capability of each molecule of the data set to be correctly predicted by a regression algorithm. In this study, using 40 data sets with very different characteristics regarding the number of molecules and activity values, we prove the high correlation between the proposed regression modelability index and the correlation coefficient in cross-validation ( Q2), reaching r2 values of 0.8. In addition, we describe the ability of this index to discover the outliers detected by the regression algorithms, allowing easy data set depuration in the first stages of the construction of QSAR regression models.
Collapse
Affiliation(s)
- Irene Luque Ruiz
- Department of Computing and Numerical Analysis , University of Córdoba , Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba , Spain
| | - Miguel Ángel Gómez-Nieto
- Department of Computing and Numerical Analysis , University of Córdoba , Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba , Spain
| |
Collapse
|
21
|
Svensson F, Aniceto N, Norinder U, Cortes-Ciriano I, Spjuth O, Carlsson L, Bender A. Conformal Regression for Quantitative Structure–Activity Relationship Modeling—Quantifying Prediction Uncertainty. J Chem Inf Model 2018; 58:1132-1140. [DOI: 10.1021/acs.jcim.8b00054] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Affiliation(s)
- Fredrik Svensson
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
- IOTA Pharmaceuticals, St Johns Innovation Centre, Cowley Road, Cambridge CB4 0WS, U.K
| | - Natalia Aniceto
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Ulf Norinder
- Swetox, Unit of Toxicology Sciences, Karolinska Institutet, Forskargatan 20, SE-151 36 Södertälje, Sweden
- Department of Computer and Systems Sciences, Stockholm University, Box 7003, SE-164 07 Kista, Sweden
| | - Isidro Cortes-Ciriano
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124, Uppsala Sweden
| | - Lars Carlsson
- Quantitative Biology, Discovery Sciences, IMED Biotech Unit, AstraZeneca, SE-43183, Mölndal, Sweden
- Department of Computer Science, Royal Holloway, University of London, Egham Hill, Surrey, U.K
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, U.K
| |
Collapse
|
22
|
On the virtues of automated quantitative structure-activity relationship: the new kid on the block. Future Med Chem 2018; 10:335-342. [PMID: 29393678 DOI: 10.4155/fmc-2017-0170] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Quantitative structure-activity relationship (QSAR) has proved to be an invaluable tool in medicinal chemistry. Data availability at unprecedented levels through various databases have collaborated to a resurgence in the interest for QSAR. In this context, rapid generation of quality predictive models is highly desirable for hit identification and lead optimization. We showcase the application of an automated QSAR approach, which randomly selects multiple training/test sets and utilizes machine-learning algorithms to generate predictive models. Results demonstrate that AutoQSAR produces models of improved or similar quality to those generated by practitioners in the field but in just a fraction of the time. Despite the potential of the concept to the benefit of the community, the AutoQSAR opportunity has been largely undervalued.
Collapse
|
23
|
Li DD, Meng XF, Wang Q, Yu P, Zhao LG, Zhang ZP, Wang ZZ, Xiao W. Consensus scoring model for the molecular docking study of mTOR kinase inhibitor. J Mol Graph Model 2017; 79:81-87. [PMID: 29154212 DOI: 10.1016/j.jmgm.2017.11.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 10/27/2017] [Accepted: 11/03/2017] [Indexed: 12/22/2022]
Abstract
The discovery of mammalian target of rapamycin (mTOR) kinase inhibitors has always been a research hotspot of antitumor drugs. Consensus scoring used in the docking study of mTOR kinase inhibitors usually improves hit rate of virtual screening. Herein, we attempt to build a series of consensus scoring models based on a set of the common scoring functions. In this paper, twenty-five kinds of mTOR inhibitors (16 clinical candidate compounds and 9 promising preclinical compounds) are carefully collected, and selected for the molecular docking study used by the Glide docking programs within the standard precise (SP) mode. The predicted poses of these ligands are saved, and revaluated by twenty-six available scoring functions, respectively. Subsequently, consensus scoring models are trained based on the obtained rescoring results by the partial least squares (PLS) method, and validated by Leave-one-out (LOO) method. In addition, three kinds of ligand efficiency indices (BEI, SEI, and LLE) instead of pIC50 as the activity could greatly improve the statistical quality of build models. Two best calculated models 10 and 22 using the same BEI indice have following statistical parameters, respectively: for model 10, training set R2=0.767, Q2=0.647, RMSE=0.024, and for test set R2=0.932, RMSE=0.026; for model 22, raining set R2=0.790, Q2=0.627, RMSE=0.023, and for test set R2=0.955, RMSE=0.020. These two consensus scoring model would be used for the docking virtual screening of novel mTOR inhibitors.
Collapse
Affiliation(s)
- Dong-Dong Li
- College of Chemical Engineering, Nanjing Forestry University, 159 Long Pan Road, Nanjing 210037, China.
| | - Xiang-Feng Meng
- College of Chemical Engineering, Nanjing Forestry University, 159 Long Pan Road, Nanjing 210037, China
| | - Qiang Wang
- College of Chemical Engineering, Nanjing Forestry University, 159 Long Pan Road, Nanjing 210037, China
| | - Pan Yu
- College of Chemical Engineering, Nanjing Forestry University, 159 Long Pan Road, Nanjing 210037, China
| | - Lin-Guo Zhao
- College of Chemical Engineering, Nanjing Forestry University, 159 Long Pan Road, Nanjing 210037, China
| | - Zheng-Ping Zhang
- Chia Tai Tianqing Pharmaceutical Group Co., Ltd., 369 South Yuzhou Road, Haizhou District, Lianyungang 222062, Jiangsu Province, China.
| | - Zhen-Zhong Wang
- Jiangsu Kanion Pharmaceutical Co., Ltd., 58 Haichang South Road, Lianyungang 222001, Jiangsu Province, China
| | - Wei Xiao
- Jiangsu Kanion Pharmaceutical Co., Ltd., 58 Haichang South Road, Lianyungang 222001, Jiangsu Province, China.
| |
Collapse
|
24
|
Polanski J, Tkocz A, Kucia U. Beware of ligand efficiency (LE): understanding LE data in modeling structure-activity and structure-economy relationships. J Cheminform 2017; 9:49. [PMID: 29086197 PMCID: PMC5593805 DOI: 10.1186/s13321-017-0236-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Accepted: 09/04/2017] [Indexed: 12/19/2022] Open
Abstract
Background On the one hand, ligand efficiency (LE) and the binding efficiency index (BEI), which are binding properties (B) averaged versus the heavy atom count (HAC: LE) or molecular weight (MW: BEI), have recently been declared a novel universal tool for drug design. On the other hand, questions have been raised about the mathematical validity of the LE approach. Results In fact, neither the critics nor the advocates are precise enough to provide a generally understandable and accepted chemistry of the LE metrics. In particular, this refers to the puzzle of the LE trends for small and large molecules. In this paper, we explain the chemistry and mathematics of the LE type of data. Because LE is a weight metrics related to binding per gram, its hyperbolic decrease with an increasing number of heavy atoms can be easily understood by its 1/MW dependency. Accordingly, we analyzed how this influences the LE trends for ligand-target binding, economic big data or molecular descriptor data. In particular, we compared the trends for the thermodynamic ∆G data of a series of ligands that interact with 14 different target classes, which were extracted from the BindingDB database with the market prices of a commercial compound library of ca. 2.5 mln synthetic building blocks. Conclusions An interpretation of LE and BEI that clearly explains the observed trends for these parameters are presented here for the first time. Accordingly, we show that the main misunderstanding of the chemical meaning of the BEI and LE parameters is their interpretation as molecular descriptors that are connected with a single molecule, while binding is a statistical effect in which a population of ligands limits the formation of ligand-receptor complexes. Therefore, LE (BEI) should not be interpreted as a molecular (physicochemical) descriptor that is connected with a single molecule but as a property (binding per gram). Accordingly, the puzzle of the surprising behavior of LE is explained by the 1/MW dependency. This effect clearly explains the hyperbolic LE trend not as a real increase in binding potency but as a physical limitation due to the different population of ligands with different MWs in a 1 g sample available for the formation of ligand-receptor complexes.. ![]()
Collapse
Affiliation(s)
- Jaroslaw Polanski
- Institute of Chemistry, University of Silesia, 9 Szkolna Street, 40-006, Katowice, Poland.
| | - Aleksandra Tkocz
- Institute of Chemistry, University of Silesia, 9 Szkolna Street, 40-006, Katowice, Poland
| | - Urszula Kucia
- Institute of Chemistry, University of Silesia, 9 Szkolna Street, 40-006, Katowice, Poland
| |
Collapse
|
25
|
Cavalluzzi MM, Mangiatordi GF, Nicolotti O, Lentini G. Ligand efficiency metrics in drug discovery: the pros and cons from a practical perspective. Expert Opin Drug Discov 2017; 12:1087-1104. [PMID: 28814111 DOI: 10.1080/17460441.2017.1365056] [Citation(s) in RCA: 68] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
INTRODUCTION Ligand efficiency metrics are almost universally accepted as a valuable indicator of compound quality and an aid to reduce attrition. Areas covered: In this review, the authors describe ligand efficiency metrics giving a balanced overview on their merits and points of weakness in order to enable the readers to gain an informed opinion. Relevant theoretical breakthroughs and drug-like properties are also illustrated. Several recent exemplary case studies are discussed in order to illustrate the main fields of application of ligand efficiency metrics. Expert opinion: As a medicinal chemist guide, ligand efficiency metrics perform in a context- and chemotype-dependent manner; thus, they should not be used as a magic box. Since the 'big bang' of efficiency metrics occurred more or less ten years ago and the average time to develop a new drug is over the same period, the next few years will give a clearer outlook on the increased rate of success, if any, gained by means of these new intriguing tools.
Collapse
Affiliation(s)
| | | | - Orazio Nicolotti
- a Department of Pharmacy - Drug Sciences , University of Bari Aldo Moro , Bari , Italy
| | - Giovanni Lentini
- a Department of Pharmacy - Drug Sciences , University of Bari Aldo Moro , Bari , Italy
| |
Collapse
|
26
|
Polanski J, Tkocz A. Between Descriptors and Properties: Understanding the Ligand Efficiency Trends for G Protein-Coupled Receptor and Kinase Structure-Activity Data Sets. J Chem Inf Model 2017; 57:1321-1329. [PMID: 28489365 DOI: 10.1021/acs.jcim.7b00116] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
The chemical meaning of the ligand efficiency (LE) metrics is explained in this paper using a large G protein-coupled receptor (GPCR) and kinase structure-activity (IC50, Ki) data set. Although there is a controversy in the literature regarding both the mathematical validity and the performance of LE, it is in common use as an early estimator for drug optimization. Apparently, the numerous con arguments are not convincing enough. We show here for the first time that the main misunderstanding of the chemical meaning of LE is its interpretation as a molecular descriptor connected with a single molecule. Instead, LE should be interpreted as a statistical property. We show that the LE, which is designed as a regression of a binding property on the heavy atom count (HAC), is correlated to the reciprocal of the molecular weight because of Avogadro statistics. This indicates that the hyperbolic model of LE is basically a consequence of a nonbinding effect, an increase in the number of ligands that are available to a receptor for smaller molecules, and not a real increase in the binding potency for a single HAC as interpreted in the literature. Accordingly, we need to revisit and carefully reevaluate LE-based molecular comparisons.
Collapse
Affiliation(s)
- Jaroslaw Polanski
- Institute of Chemistry, University of Silesia , 9 Szkolna Street, 40-006 Katowice, Poland
| | - Aleksandra Tkocz
- Institute of Chemistry, University of Silesia , 9 Szkolna Street, 40-006 Katowice, Poland
| |
Collapse
|
27
|
Meirson T, Samson AO, Gil-Henn H. An in silico high-throughput screen identifies potential selective inhibitors for the non-receptor tyrosine kinase Pyk2. DRUG DESIGN DEVELOPMENT AND THERAPY 2017; 11:1535-1557. [PMID: 28572720 PMCID: PMC5441678 DOI: 10.2147/dddt.s136150] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The non-receptor tyrosine kinase proline-rich tyrosine kinase 2 (Pyk2) is a critical mediator of signaling from cell surface growth factor and adhesion receptors to cell migration, proliferation, and survival. Emerging evidence indicates that signaling by Pyk2 regulates hematopoietic cell response, bone density, neuronal degeneration, angiogenesis, and cancer. These physiological and pathological roles of Pyk2 warrant it as a valuable therapeutic target for invasive cancers, osteoporosis, Alzheimer’s disease, and inflammatory cellular response. Despite its potential as a therapeutic target, no potent and selective inhibitor of Pyk2 is available at present. As a first step toward discovering specific potential inhibitors of Pyk2, we used an in silico high-throughput screening approach. A virtual library of six million lead-like compounds was docked against four different high-resolution Pyk2 kinase domain crystal structures and further selected for predicted potency and ligand efficiency. Ligand selectivity for Pyk2 over focal adhesion kinase (FAK) was evaluated by comparative docking of ligands and measurement of binding free energy so as to obtain 40 potential candidates. Finally, the structural flexibility of a subset of the docking complexes was evaluated by molecular dynamics simulation, followed by intermolecular interaction analysis. These compounds may be considered as promising leads for further development of highly selective Pyk2 inhibitors.
Collapse
Affiliation(s)
- Tomer Meirson
- Faculty of Medicine in the Galilee, Bar-Ilan University, Safed, Israel
| | - Abraham O Samson
- Faculty of Medicine in the Galilee, Bar-Ilan University, Safed, Israel
| | - Hava Gil-Henn
- Faculty of Medicine in the Galilee, Bar-Ilan University, Safed, Israel
| |
Collapse
|
28
|
Kumar M, Kaur T, Sharma A. Role of computational efficiency indices and pose clustering in effective decision making: An example of annulated furanones in Pf-DHFR space. Comput Biol Chem 2017; 67:48-61. [PMID: 28049061 DOI: 10.1016/j.compbiolchem.2016.12.007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Revised: 11/04/2016] [Accepted: 12/22/2016] [Indexed: 10/20/2022]
|
29
|
Sheridan RP. Debunking the Idea that Ligand Efficiency Indices Are Superior to pIC50 as QSAR Activities. J Chem Inf Model 2016; 56:2253-2262. [DOI: 10.1021/acs.jcim.6b00431] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Robert P. Sheridan
- Modeling and Informatics Department, Merck & Co. Inc., Rahway, New Jersey 07065, United States
| |
Collapse
|