1
|
Abdollahi S, Schaub DP, Barroso M, Laubach NC, Hutwelker W, Panzer U, Gersting SØW, Bonn S. A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles. J Cheminform 2024; 16:118. [PMID: 39468635 PMCID: PMC11520803 DOI: 10.1186/s13321-024-00913-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 10/10/2024] [Indexed: 10/30/2024] Open
Abstract
The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.Scientific contributionThis work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.
Collapse
Affiliation(s)
- Sina Abdollahi
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Darius P Schaub
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Madalena Barroso
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Nora C Laubach
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Wiebke Hutwelker
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Ulf Panzer
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - S Øren W Gersting
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| | - Stefan Bonn
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| |
Collapse
|
2
|
Durant G, Boyles F, Birchall K, Deane CM. The future of machine learning for small-molecule drug discovery will be driven by data. NATURE COMPUTATIONAL SCIENCE 2024; 4:735-743. [PMID: 39407003 DOI: 10.1038/s43588-024-00699-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 09/03/2024] [Indexed: 10/25/2024]
Abstract
Many studies have prophesied that the integration of machine learning techniques into small-molecule therapeutics development will help to deliver a true leap forward in drug discovery. However, increasingly advanced algorithms and novel architectures have not always yielded substantial improvements in results. In this Perspective, we propose that a greater focus on the data for training and benchmarking these models is more likely to drive future improvement, and explore avenues for future research and strategies to address these data challenges.
Collapse
Affiliation(s)
- Guy Durant
- Department of Statistics, University of Oxford, Oxford, UK
| | - Fergus Boyles
- Department of Statistics, University of Oxford, Oxford, UK
| | | | | |
Collapse
|
3
|
Guichaoua G, Pinel P, Hoffmann B, Azencott CA, Stoven V. Drug-Target Interactions Prediction at Scale: The Komet Algorithm with the LCIdb Dataset. J Chem Inf Model 2024; 64:6938-6956. [PMID: 39237105 PMCID: PMC11423346 DOI: 10.1021/acs.jcim.4c00422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2024]
Abstract
Drug-target interactions (DTIs) prediction algorithms are used at various stages of the drug discovery process. In this context, specific problems such as deorphanization of a new therapeutic target or target identification of a drug candidate arising from phenotypic screens require large-scale predictions across the protein and molecule spaces. DTI prediction heavily relies on supervised learning algorithms that use known DTIs to learn associations between molecule and protein features, allowing for the prediction of new interactions based on learned patterns. The algorithms must be broadly applicable to enable reliable predictions, even in regions of the protein or molecule spaces where data may be scarce. In this paper, we address two key challenges to fulfill these goals: building large, high-quality training datasets and designing prediction methods that can scale, in order to be trained on such large datasets. First, we introduce LCIdb, a curated, large-sized dataset of DTIs, offering extensive coverage of both the molecule and druggable protein spaces. Notably, LCIdb contains a much higher number of molecules than publicly available benchmarks, expanding coverage of the molecule space. Second, we propose Komet (Kronecker Optimized METhod), a DTI prediction pipeline designed for scalability without compromising performance. Komet leverages a three-step framework, incorporating efficient computation choices tailored for large datasets and involving the Nyström approximation. Specifically, Komet employs a Kronecker interaction module for (molecule, protein) pairs, which efficiently captures determinants in DTIs, and whose structure allows for reduced computational complexity and quasi-Newton optimization, ensuring that the model can handle large training sets, without compromising on performance. Our method is implemented in open-source software, leveraging GPU parallel computation for efficiency. We demonstrate the interest of our pipeline on various datasets, showing that Komet displays superior scalability and prediction performance compared to state-of-the-art deep learning approaches. Additionally, we illustrate the generalization properties of Komet by showing its performance on an external dataset, and on the publicly available L H benchmark designed for scaffold hopping problems. Komet is available open source at https://komet.readthedocs.io and all datasets, including LCIdb, can be found at https://zenodo.org/records/10731712.
Collapse
Affiliation(s)
- Gwenn Guichaoua
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Philippe Pinel
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
- Iktos SAS, 75017 Paris, France
| | | | - Chloé-Agathe Azencott
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Véronique Stoven
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| |
Collapse
|
4
|
Lam HYI, Guan JS, Ong XE, Pincket R, Mu Y. Protein language models are performant in structure-free virtual screening. Brief Bioinform 2024; 25:bbae480. [PMID: 39327890 PMCID: PMC11427677 DOI: 10.1093/bib/bbae480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/17/2024] [Accepted: 09/12/2024] [Indexed: 09/28/2024] Open
Abstract
Hitherto virtual screening (VS) has been typically performed using a structure-based drug design paradigm. Such methods typically require the use of molecular docking on high-resolution three-dimensional structures of a target protein-a computationally-intensive and time-consuming exercise. This work demonstrates that by employing protein language models and molecular graphs as inputs to a novel graph-to-transformer cross-attention mechanism, a screening power comparable to state-of-the-art structure-based models can be achieved. The implications thereof include highly expedited VS due to the greatly reduced compute required to run this model, and the ability to perform early stages of computer-aided drug design in the complete absence of 3D protein structures.
Collapse
Affiliation(s)
- Hilbert Yuen In Lam
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Dr, Singapore 637551, Singapore, Republic of Singapore
- MagMol Pte. Ltd., 68 Circular Road, #02-01, Singapore 049422, Singapore, Republic of Singapore
| | - Jia Sheng Guan
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Dr, Singapore 637551, Singapore, Republic of Singapore
| | - Xing Er Ong
- MagMol Pte. Ltd., 68 Circular Road, #02-01, Singapore 049422, Singapore, Republic of Singapore
| | - Robbe Pincket
- Heliovision, Asstraat 5, 3000 Leuven, Leuven, Kingdom of Belgium
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Dr, Singapore 637551, Singapore, Republic of Singapore
- MagMol Pte. Ltd., 68 Circular Road, #02-01, Singapore 049422, Singapore, Republic of Singapore
| |
Collapse
|
5
|
Ji W, She S, Qiao C, Feng Q, Rui M, Xu X, Feng C. A general prediction model for compound-protein interactions based on deep learning. Front Pharmacol 2024; 15:1465890. [PMID: 39295942 PMCID: PMC11408283 DOI: 10.3389/fphar.2024.1465890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 08/20/2024] [Indexed: 09/21/2024] Open
Abstract
Background The identification of compound-protein interactions (CPIs) is crucial for drug discovery and understanding mechanisms of action. Accurate CPI prediction can elucidate drug-target-disease interactions, aiding in the discovery of candidate compounds and effective synergistic drugs, particularly from traditional Chinese medicine (TCM). Existing in silico methods face challenges in prediction accuracy and generalization due to compound and target diversity and the lack of largescale interaction datasets and negative datasets for model learning. Methods To address these issues, we developed a computational model for CPI prediction by integrating the constructed large-scale bioactivity benchmark dataset with a deep learning (DL) algorithm. To verify the accuracy of our CPI model, we applied it to predict the targets of compounds in TCM. An herb pair of Astragalus membranaceus and Hedyotis diffusaas was used as a model, and the active compounds in this herb pair were collected from various public databases and the literature. The complete targets of these active compounds were predicted by the CPI model, resulting in an expanded target dataset. This dataset was next used for the prediction of synergistic antitumor compound combinations. The predicted multi-compound combinations were subsequently examined through in vitro cellular experiments. Results Our CPI model demonstrated superior performance over other machine learning models, achieving an area under the Receiver Operating Characteristic curve (AUROC) of 0.98, an area under the precision-recall curve (AUPR) of 0.98, and an accuracy (ACC) of 93.31% on the test set. The model's generalization capability and applicability were further confirmed using external databases. Utilizing this model, we predicted the targets of compounds in the herb pair of Astragalus membranaceus and Hedyotis diffusaas, yielding an expanded target dataset. Then, we integrated this expanded target dataset to predict effective drug combinations using our drug synergy prediction model DeepMDS. Experimental assay on breast cancer cell line MDA-MB-231 proved the efficacy of the best predicted multi-compound combinations: Combination I (Epicatechin, Ursolic acid, Quercetin, Aesculetin and Astragaloside IV) exhibited a half-maximal inhibitory concentration (IC50) value of 19.41 μM, and a combination index (CI) value of 0.682; and Combination II (Epicatechin, Ursolic acid, Quercetin, Vanillic acid and Astragaloside IV) displayed a IC50 value of 23.83 μM and a CI value of 0.805. These results validated the ability of our model to make accurate predictions for novel CPI data outside the training dataset and evaluated the reliability of the predictions, showing good applicability potential in drug discovery and in the elucidation of the bioactive compounds in TCM. Conclusion Our CPI prediction model can serve as a useful tool for accurately identifying potential CPI for a wide range of proteins, and is expected to facilitate drug research, repurposing and support the understanding of TCM.
Collapse
Affiliation(s)
- Wei Ji
- School of Pharmacy, Jiangsu University, Zhenjiang, China
- School of Medicine, Jiangsu University, Zhenjiang, China
| | - Shengnan She
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| | - Chunxue Qiao
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| | - Qiuqi Feng
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| | - Mengjie Rui
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| | - Ximing Xu
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| | - Chunlai Feng
- School of Pharmacy, Jiangsu University, Zhenjiang, China
| |
Collapse
|
6
|
Prat A, Abdel Aty H, Bastas O, Kamuntavičius G, Paquet T, Norvaišas P, Gasparotto P, Tal R. HydraScreen: A Generalizable Structure-Based Deep Learning Approach to Drug Discovery. J Chem Inf Model 2024; 64:5817-5831. [PMID: 39037942 DOI: 10.1021/acs.jcim.4c00481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
We propose HydraScreen, a deep-learning framework for safe and robust accelerated drug discovery. HydraScreen utilizes a state-of-the-art 3D convolutional neural network designed for the effective representation of molecular structures and interactions in protein-ligand binding. We designed an end-to-end pipeline for high-throughput screening and lead optimization, targeting applications in structure-based drug design. We assessed our approach using established public benchmarks based on the CASF-2016 core set, achieving top-tier results in affinity and pose prediction (Pearson's r = 0.86, RMSE = 1.15, Top-1 = 0.95). We introduced a novel approach for interaction profiling, aimed at detecting potential biases within both the model and data sets. This approach not only enhanced interpretability but also reinforced the impartiality of our methodology. Finally, we demonstrated HydraScreen's ability to generalize effectively across novel proteins and ligands through a temporal split. We also provide insights into potential avenues for future development aimed at enhancing the robustness of machine learning scoring functions. HydraScreen (accessible at http://hydrascreen.ro5.ai/paper) provides a user-friendly GUI and a public API, facilitating the easy-access assessment of protein-ligand complexes.
Collapse
Affiliation(s)
- Alvaro Prat
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Hisham Abdel Aty
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Orestis Bastas
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | | | - Tanya Paquet
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Povilas Norvaišas
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Piero Gasparotto
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Roy Tal
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| |
Collapse
|
7
|
Bouchouireb Z, Olivier-Jimenez D, Jaunet-Lahary T, Thany SH, Le Questel JY. Navigating the complexities of docking tools with nicotinic receptors and acetylcholine binding proteins in the realm of neonicotinoids. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2024; 281:116582. [PMID: 38905934 DOI: 10.1016/j.ecoenv.2024.116582] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 05/30/2024] [Accepted: 06/09/2024] [Indexed: 06/23/2024]
Abstract
Molecular docking, pivotal in predicting small-molecule ligand binding modes, struggles with accurately identifying binding conformations and affinities. This is particularly true for neonicotinoids, insecticides whose impacts on ecosystems require precise molecular interaction modeling. This study scrutinizes the effectiveness of prominent docking software (Ledock, ADFR, Autodock Vina, CDOCKER) in simulating interactions of environmental chemicals, especially neonicotinoid-like molecules with nicotinic acetylcholine receptors (nAChRs) and acetylcholine binding proteins (AChBPs). We aimed to assess the accuracy and reliability of these tools in reproducing crystallographic data, focusing on semi-flexible and flexible docking approaches. Our analysis identified Ledock as the most accurate in semi-flexible docking, while Autodock Vina with Vinardo scoring function proved most reliable. However, no software consistently excelled in both accuracy and reliability. Additionally, our evaluation revealed that none of the tools could establish a clear correlation between docking scores and experimental dissociation constants (Kd) for neonicotinoid-like compounds. In contrast, a strong correlation was found with drug-like compounds, bringing to light a bias in considered software towards pharmaceuticals, thus limiting their applicability to environmental chemicals. The comparison between semi-flexible and flexible docking revealed that the increased computational complexity of the latter did not result in enhanced accuracy. In fact, the higher computational cost of flexible docking with its lack of enhanced predictive accuracy, rendered this approach useless for this class of compounds. Conclusively, our findings emphasize the need for continued development of docking methodologies, particularly for environmental chemicals. This study not only illuminates current software capabilities but also underscores the urgency for advancements in computational molecular docking as it is a relevant tool to environmental sciences.
Collapse
Affiliation(s)
| | - Damien Olivier-Jimenez
- Leiden University Medical Center, Center for Proteomics and Metabolomics, Albinusdreef 2, Leiden 2333ZA, Netherlands
| | | | - Steeve H Thany
- Université d'Orléans, Physiology, Ecology and Environment (P2E) laboratory USC INRAE 1328, Orléans 45067, France; Institut universitaire de France (IUF), 1 rue Descartes 75005 Paris, France
| | | |
Collapse
|
8
|
Liu H, Hu B, Chen P, Wang X, Wang H, Wang S, Wang J, Lin B, Cheng M. Docking Score ML: Target-Specific Machine Learning Models Improving Docking-Based Virtual Screening in 155 Targets. J Chem Inf Model 2024; 64:5413-5426. [PMID: 38958413 DOI: 10.1021/acs.jcim.4c00072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
In drug discovery, molecular docking methods face challenges in accurately predicting energy. Scoring functions used in molecular docking often fail to simulate complex protein-ligand interactions fully and accurately leading to biases and inaccuracies in virtual screening and target predictions. We introduce the "Docking Score ML", developed from an analysis of over 200,000 docked complexes from 155 known targets for cancer treatments. The scoring functions used are founded on bioactivity data sourced from ChEMBL and have been fine-tuned using both supervised machine learning and deep learning techniques. We validated our approach extensively using multiple data sets such as validation of selectivity mechanism, the DUDE, DUD-AD, and LIT-PCBA data sets, and performed a multitarget analysis on drugs like sunitinib. To enhance prediction accuracy, feature fusion techniques were explored. By merging the capabilities of the Graph Convolutional Network (GCN) with multiple docking functions, our results indicated a clear superiority of our methodologies over conventional approaches. These advantages demonstrate that Docking Score ML is an efficient and accurate tool for virtual screening and reverse docking.
Collapse
Affiliation(s)
- Haihan Liu
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Baichun Hu
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Peiying Chen
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Xiao Wang
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Hanxun Wang
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Shizun Wang
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Jian Wang
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Bin Lin
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| | - Maosheng Cheng
- Key Laboratory of Structure-Based Drug Design & Discovery of Ministry of Education, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- Key Laboratory of Intelligent Drug Design and New Drug Discovery of Liaoning Province, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
- School of Pharmaceutical Engineering, Shenyang Pharmaceutical University, Shenyang 110016, People's Republic of China
| |
Collapse
|
9
|
Nguyen ATN, Nguyen DTN, Koh HY, Toskov J, MacLean W, Xu A, Zhang D, Webb GI, May LT, Halls ML. The application of artificial intelligence to accelerate G protein-coupled receptor drug discovery. Br J Pharmacol 2024; 181:2371-2384. [PMID: 37161878 DOI: 10.1111/bph.16140] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 04/14/2023] [Accepted: 04/27/2023] [Indexed: 05/11/2023] Open
Abstract
The application of artificial intelligence (AI) approaches to drug discovery for G protein-coupled receptors (GPCRs) is a rapidly expanding area. Artificial intelligence can be used at multiple stages during the drug discovery process, from aiding our understanding of the fundamental actions of GPCRs to the discovery of new ligand-GPCR interactions or the prediction of clinical responses. Here, we provide an overview of the concepts behind artificial intelligence, including the subfields of machine learning and deep learning. We summarise the published applications of artificial intelligence to different stages of the GPCR drug discovery process. Finally, we reflect on the benefits and limitations of artificial intelligence and share our vision for the exciting potential for further development of applications to aid GPCR drug discovery. In addition to making the drug discovery process "faster, smarter and cheaper," we anticipate that the application of artificial intelligence will create exciting new opportunities for GPCR drug discovery. LINKED ARTICLES: This article is part of a themed issue Therapeutic Targeting of G Protein-Coupled Receptors: hot topics from the Australasian Society of Clinical and Experimental Pharmacologists and Toxicologists 2021 Virtual Annual Scientific Meeting. To view the other articles in this section visit http://onlinelibrary.wiley.com/doi/10.1111/bph.v181.14/issuetoc.
Collapse
Affiliation(s)
- Anh T N Nguyen
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
| | - Diep T N Nguyen
- Department of Information Technology, Faculty of Engineering and Technology, Vietnam National University, Cau Giay, Hanoi, Vietnam
| | - Huan Yee Koh
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
- Monash Data Futures Institute and Department of Data Science and Artificial Intelligence, Monash University, Clayton, Victoria, Australia
| | - Jason Toskov
- Monash DeepNeuron, Monash University, Clayton, Victoria, Australia
| | - William MacLean
- Monash DeepNeuron, Monash University, Clayton, Victoria, Australia
| | - Andrew Xu
- Monash DeepNeuron, Monash University, Clayton, Victoria, Australia
| | - Daokun Zhang
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
- Monash Data Futures Institute and Department of Data Science and Artificial Intelligence, Monash University, Clayton, Victoria, Australia
| | - Geoffrey I Webb
- Monash Data Futures Institute and Department of Data Science and Artificial Intelligence, Monash University, Clayton, Victoria, Australia
| | - Lauren T May
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
| | - Michelle L Halls
- Drug Discovery Biology Theme, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
| |
Collapse
|
10
|
Tian T, Li S, Zhang Z, Chen L, Zou Z, Zhao D, Zeng J. Benchmarking compound activity prediction for real-world drug discovery applications. Commun Chem 2024; 7:127. [PMID: 38834746 DOI: 10.1038/s42004-024-01204-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 05/16/2024] [Indexed: 06/06/2024] Open
Abstract
Identifying active compounds for target proteins is fundamental in early drug discovery. Recently, data-driven computational methods have demonstrated promising potential in predicting compound activities. However, there lacks a well-designed benchmark to comprehensively evaluate these methods from a practical perspective. To fill this gap, we propose a Compound Activity benchmark for Real-world Applications (CARA). Through carefully distinguishing assay types, designing train-test splitting schemes and selecting evaluation metrics, CARA can consider the biased distribution of current real-world compound activity data and avoid overestimation of model performances. We observed that although current models can make successful predictions for certain proportions of assays, their performances varied across different assays. In addition, evaluation of several few-shot training strategies demonstrated different performances related to task types. Overall, we provide a high-quality dataset for developing and evaluating compound activity prediction models, and the analyses in this work may inspire better applications of data-driven models in drug discovery.
Collapse
Affiliation(s)
- Tingzhong Tian
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Shuya Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Ziting Zhang
- Department of Automation, Tsinghua University, Beijing, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, China
| | - Lin Chen
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Ziheng Zou
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
- School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China.
| |
Collapse
|
11
|
Cavasotto CN, Di Filippo JI, Scardino V. Lessons learnt from machine learning in early stages of drug discovery. Expert Opin Drug Discov 2024; 19:631-633. [PMID: 38727031 DOI: 10.1080/17460441.2024.2354279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 05/08/2024] [Indexed: 05/22/2024]
Affiliation(s)
- Claudio N Cavasotto
- Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires, Argentina
- Facultad de Ciencias Biomédicas, Universidad Austral, Pilar, Buenos Aires, Argentina
- Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Argentina
| | - Juan I Di Filippo
- Facultad de Ciencias Biomédicas, Universidad Austral, Pilar, Buenos Aires, Argentina
- Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Argentina
- Meton AI, Inc, Wilmington, DE, USA
| | - Valeria Scardino
- Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Argentina
- Meton AI, Inc, Wilmington, DE, USA
| |
Collapse
|
12
|
Gu S, Yang Y, Zhao Y, Qiu J, Wang X, Tong HHY, Liu L, Wan X, Liu H, Hou T, Kang Y. Evaluation of AlphaFold2 Structures for Hit Identification across Multiple Scenarios. J Chem Inf Model 2024; 64:3630-3639. [PMID: 38630855 DOI: 10.1021/acs.jcim.3c01976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
The introduction of AlphaFold2 (AF2) has sparked significant enthusiasm and generated extensive discussion within the scientific community, particularly among drug discovery researchers. Although previous studies have addressed the performance of AF2 structures in virtual screening (VS), a more comprehensive investigation is still necessary considering the paramount importance of structural accuracy in drug design. In this study, we evaluate the performance of AF2 structures in VS across three common drug discovery scenarios: targets with holo, apo, and AF2 structures; targets with only apo and AF2 structures; and targets exclusively with AF2 structures. We utilized both the traditional physics-based Glide and the deep-learning-based scoring function RTMscore to rank the compounds in the DUD-E, DEKOIS 2.0, and DECOY data sets. The results demonstrate that, overall, the performance of VS on AF2 structures is comparable to that on apo structures but notably inferior to that on holo structures across diverse scenarios. Moreover, when a target has solely AF2 structure, selecting the holo structure of the target from different subtypes within the same protein family produces comparable results with the AF2 structure for VS on the data set of the AF2 structures, and significantly better results than the AF2 structures on its own data set. This indicates that utilizing AF2 structures for docking-based VS may not yield most satisfactory outcomes, even when solely AF2 structures are available. Moreover, we rule out the possibility that the variations in VS performance between the binding pockets of AF2 and holo structures arise from the differences in their biological assembly composition.
Collapse
Affiliation(s)
- Shukai Gu
- Faculty of Applied Science, Macao Polytechnic University, Macao 999078, SAR, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yuwei Yang
- Faculty of Applied Science, Macao Polytechnic University, Macao 999078, SAR, China
| | - Yihao Zhao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Jiayue Qiu
- Faculty of Applied Science, Macao Polytechnic University, Macao 999078, SAR, China
| | - Xiaorui Wang
- State Key Laboratory of Quality Re-search in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao 999078, China
| | - Henry Hoi Yee Tong
- Faculty of Applied Science, Macao Polytechnic University, Macao 999078, SAR, China
| | - Liwei Liu
- Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd., Nanjing 210000, Jiangsu, China
| | - Xiaozhe Wan
- Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd., Nanjing 210000, Jiangsu, China
| | - Huanxiang Liu
- Faculty of Applied Science, Macao Polytechnic University, Macao 999078, SAR, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yu Kang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
13
|
Łapińska N, Pacławski A, Szlęk J, Mendyk A. SerotoninAI: Serotonergic System Focused, Artificial Intelligence-Based Application for Drug Discovery. J Chem Inf Model 2024; 64:2150-2157. [PMID: 38289046 PMCID: PMC11005036 DOI: 10.1021/acs.jcim.3c01517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/02/2024] [Accepted: 01/04/2024] [Indexed: 04/09/2024]
Abstract
SerotoninAI is an innovative web application for scientific purposes focused on the serotonergic system. By leveraging SerotoninAI, researchers can assess the affinity (pKi value) of a molecule to all main serotonin receptors and serotonin transporters based on molecule structure introduced as SMILES. Additionally, the application provides essential insights into critical attributes of potential drugs such as blood-brain barrier penetration and human intestinal absorption. The complexity of the serotonergic system demands advanced tools for accurate predictions, which is a fundamental requirement in drug development. SerotoninAI addresses this need by providing an intuitive user interface that generates predictions of pKi values for the main serotonergic targets. The application is freely available on the Internet at https://serotoninai.streamlit.app/, implemented in Streamlit with all major web browsers supported. Currently, to the best of our knowledge, there is no tool that allows users to access affinity predictions for serotonergic targets without registration or financial obligations. SerotoninAI significantly increases the scope of drug development activities worldwide. The source code of the application is available at https://github.com/nczub/SerotoninAI_streamlit.
Collapse
Affiliation(s)
- Natalia Łapińska
- Department
of Pharmaceutical Technology and Biopharmaceutics, Jagiellonian University Medical College, 30-688 Kraków, Poland
- Doctoral
School of Medicinal and Health Sciences, Jagiellonian University Medical College, 30-688 Kraków, Poland
| | - Adam Pacławski
- Department
of Pharmaceutical Technology and Biopharmaceutics, Jagiellonian University Medical College, 30-688 Kraków, Poland
| | - Jakub Szlęk
- Department
of Pharmaceutical Technology and Biopharmaceutics, Jagiellonian University Medical College, 30-688 Kraków, Poland
| | - Aleksander Mendyk
- Department
of Pharmaceutical Technology and Biopharmaceutics, Jagiellonian University Medical College, 30-688 Kraków, Poland
| |
Collapse
|
14
|
Brocidiacono M, Francoeur P, Aggarwal R, Popov KI, Koes DR, Tropsha A. BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening. J Chem Inf Model 2024; 64:2488-2495. [PMID: 38113513 DOI: 10.1021/acs.jcim.3c01211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2023]
Abstract
Deep learning methods that predict protein-ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein-ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed Banana (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind's test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, Banana achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with Gnina. We suggest that Banana, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.
Collapse
Affiliation(s)
- Michael Brocidiacono
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Paul Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Rishal Aggarwal
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Konstantin I Popov
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Alexander Tropsha
- Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
15
|
Zhang X, Gao H, Wang H, Chen Z, Zhang Z, Chen X, Li Y, Qi Y, Wang R. PLANET: A Multi-objective Graph Neural Network Model for Protein-Ligand Binding Affinity Prediction. J Chem Inf Model 2024; 64:2205-2220. [PMID: 37319418 DOI: 10.1021/acs.jcim.3c00253] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Predicting protein-ligand binding affinity is a central issue in drug design. Various deep learning models have been published in recent years, where many of them rely on 3D protein-ligand complex structures as input and tend to focus on the single task of reproducing binding affinity. In this study, we have developed a graph neural network model called PLANET (Protein-Ligand Affinity prediction NETwork). This model takes the graph-represented 3D structure of the binding pocket on the target protein and the 2D chemical structure of the ligand molecule as input. It was trained through a multi-objective process with three related tasks, including deriving the protein-ligand binding affinity, protein-ligand contact map, and ligand distance matrix. Besides the protein-ligand complexes with known binding affinity data retrieved from the PDBbind database, a large number of non-binder decoys were also added to the training data for deriving the final model of PLANET. When tested on the CASF-2016 benchmark, PLANET exhibited a scoring power comparable to the best result yielded by other deep learning models as well as a reasonable ranking power and docking power. In virtual screening trials conducted on the DUD-E benchmark, PLANET's performance was notably better than several deep learning and machine learning models. As on the LIT-PCBA benchmark, PLANET achieved comparable accuracy as the conventional docking program Glide, but it only spent less than 1% of Glide's computation time to finish the same job because PLANET did not need exhaustive conformational sampling. Considering the decent accuracy and efficiency of PLANET in binding affinity prediction, it may become a useful tool for conducting large-scale virtual screening.
Collapse
Affiliation(s)
- Xiangying Zhang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Haotian Gao
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Haojie Wang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Zhihang Chen
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Zhe Zhang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Xinchong Chen
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Yan Li
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Yifei Qi
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Renxiao Wang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| |
Collapse
|
16
|
Wallach I, Bernard D, Nguyen K, Ho G, Morrison A, Stecula A, Rosnik A, O’Sullivan AM, Davtyan A, Samudio B, Thomas B, Worley B, Butler B, Laggner C, Thayer D, Moharreri E, Friedland G, Truong H, van den Bedem H, Ng HL, Stafford K, Sarangapani K, Giesler K, Ngo L, Mysinger M, Ahmed M, Anthis NJ, Henriksen N, Gniewek P, Eckert S, de Oliveira S, Suterwala S, PrasadPrasad SVK, Shek S, Contreras S, Hare S, Palazzo T, O’Brien TE, Van Grack T, Williams T, Chern TR, Kenyon V, Lee AH, Cann AB, Bergman B, Anderson BM, Cox BD, Warrington JM, Sorenson JM, Goldenberg JM, Young MA, DeHaan N, Pemberton RP, Schroedl S, Abramyan TM, Gupta T, Mysore V, Presser AG, Ferrando AA, Andricopulo AD, Ghosh A, Ayachi AG, Mushtaq A, Shaqra AM, Toh AKL, Smrcka AV, Ciccia A, de Oliveira AS, Sverzhinsky A, de Sousa AM, Agoulnik AI, Kushnir A, Freiberg AN, Statsyuk AV, Gingras AR, Degterev A, Tomilov A, Vrielink A, Garaeva AA, Bryant-Friedrich A, Caflisch A, Patel AK, Rangarajan AV, Matheeussen A, Battistoni A, Caporali A, Chini A, Ilari A, Mattevi A, Foote AT, Trabocchi A, Stahl A, Herr AB, Berti A, Freywald A, Reidenbach AG, Lam A, Cuddihy AR, White A, Taglialatela A, Ojha AK, Cathcart AM, Motyl AAL, Borowska A, D’Antuono A, Hirsch AKH, Porcelli AM, Minakova A, Montanaro A, Müller A, Fiorillo A, Virtanen A, O’Donoghue AJ, Del Rio Flores A, Garmendia AE, Pineda-Lucena A, Panganiban AT, Samantha A, Chatterjee AK, Haas AL, Paparella AS, John ALS, Prince A, ElSheikh A, Apfel AM, Colomba A, O’Dea A, Diallo BN, Ribeiro BMRM, Bailey-Elkin BA, Edelman BL, Liou B, Perry B, Chua BSK, Kováts B, Englinger B, Balakrishnan B, Gong B, Agianian B, Pressly B, Salas BPM, Duggan BM, Geisbrecht BV, Dymock BW, Morten BC, Hammock BD, Mota BEF, Dickinson BC, Fraser C, Lempicki C, Novina CD, Torner C, Ballatore C, Bon C, Chapman CJ, Partch CL, Chaton CT, Huang C, Yang CY, Kahler CM, Karan C, Keller C, Dieck CL, Huimei C, Liu C, Peltier C, Mantri CK, Kemet CM, Müller CE, Weber C, Zeina CM, Muli CS, Morisseau C, Alkan C, Reglero C, Loy CA, Wilson CM, Myhr C, Arrigoni C, Paulino C, Santiago C, Luo D, Tumes DJ, Keedy DA, Lawrence DA, Chen D, Manor D, Trader DJ, Hildeman DA, Drewry DH, Dowling DJ, Hosfield DJ, Smith DM, Moreira D, Siderovski DP, Shum D, Krist DT, Riches DWH, Ferraris DM, Anderson DH, Coombe DR, Welsbie DS, Hu D, Ortiz D, Alramadhani D, Zhang D, Chaudhuri D, Slotboom DJ, Ronning DR, Lee D, Dirksen D, Shoue DA, Zochodne DW, Krishnamurthy D, Duncan D, Glubb DM, Gelardi ELM, Hsiao EC, Lynn EG, Silva EB, Aguilera E, Lenci E, Abraham ET, Lama E, Mameli E, Leung E, Giles E, Christensen EM, Mason ER, Petretto E, Trakhtenberg EF, Rubin EJ, Strauss E, Thompson EW, Cione E, Lisabeth EM, Fan E, Kroon EG, Jo E, García-Cuesta EM, Glukhov E, Gavathiotis E, Yu F, Xiang F, Leng F, Wang F, Ingoglia F, van den Akker F, Borriello F, Vizeacoumar FJ, Luh F, Buckner FS, Vizeacoumar FS, Bdira FB, Svensson F, Rodriguez GM, Bognár G, Lembo G, Zhang G, Dempsey G, Eitzen G, Mayer G, Greene GL, Garcia GA, Lukacs GL, Prikler G, Parico GCG, Colotti G, De Keulenaer G, Cortopassi G, Roti G, Girolimetti G, Fiermonte G, Gasparre G, Leuzzi G, Dahal G, Michlewski G, Conn GL, Stuchbury GD, Bowman GR, Popowicz GM, Veit G, de Souza GE, Akk G, Caljon G, Alvarez G, Rucinski G, Lee G, Cildir G, Li H, Breton HE, Jafar-Nejad H, Zhou H, Moore HP, Tilford H, Yuan H, Shim H, Wulff H, Hoppe H, Chaytow H, Tam HK, Van Remmen H, Xu H, Debonsi HM, Lieberman HB, Jung H, Fan HY, Feng H, Zhou H, Kim HJ, Greig IR, Caliandro I, Corvo I, Arozarena I, Mungrue IN, Verhamme IM, Qureshi IA, Lotsaris I, Cakir I, Perry JJP, Kwiatkowski J, Boorman J, Ferreira J, Fries J, Kratz JM, Miner J, Siqueira-Neto JL, Granneman JG, Ng J, Shorter J, Voss JH, Gebauer JM, Chuah J, Mousa JJ, Maynes JT, Evans JD, Dickhout J, MacKeigan JP, Jossart JN, Zhou J, Lin J, Xu J, Wang J, Zhu J, Liao J, Xu J, Zhao J, Lin J, Lee J, Reis J, Stetefeld J, Bruning JB, Bruning JB, Coles JG, Tanner JJ, Pascal JM, So J, Pederick JL, Costoya JA, Rayman JB, Maciag JJ, Nasburg JA, Gruber JJ, Finkelstein JM, Watkins J, Rodríguez-Frade JM, Arias JAS, Lasarte JJ, Oyarzabal J, Milosavljevic J, Cools J, Lescar J, Bogomolovas J, Wang J, Kee JM, Kee JM, Liao J, Sistla JC, Abrahão JS, Sishtla K, Francisco KR, Hansen KB, Molyneaux KA, Cunningham KA, Martin KR, Gadar K, Ojo KK, Wong KS, Wentworth KL, Lai K, Lobb KA, Hopkins KM, Parang K, Machaca K, Pham K, Ghilarducci K, Sugamori KS, McManus KJ, Musta K, Faller KME, Nagamori K, Mostert KJ, Korotkov KV, Liu K, Smith KS, Sarosiek K, Rohde KH, Kim KK, Lee KH, Pusztai L, Lehtiö L, Haupt LM, Cowen LE, Byrne LJ, Su L, Wert-Lamas L, Puchades-Carrasco L, Chen L, Malkas LH, Zhuo L, Hedstrom L, Hedstrom L, Walensky LD, Antonelli L, Iommarini L, Whitesell L, Randall LM, Fathallah MD, Nagai MH, Kilkenny ML, Ben-Johny M, Lussier MP, Windisch MP, Lolicato M, Lolli ML, Vleminckx M, Caroleo MC, Macias MJ, Valli M, Barghash MM, Mellado M, Tye MA, Wilson MA, Hannink M, Ashton MR, Cerna MVC, Giorgis M, Safo MK, Maurice MS, McDowell MA, Pasquali M, Mehedi M, Serafim MSM, Soellner MB, Alteen MG, Champion MM, Skorodinsky M, O’Mara ML, Bedi M, Rizzi M, Levin M, Mowat M, Jackson MR, Paige M, Al-Yozbaki M, Giardini MA, Maksimainen MM, De Luise M, Hussain MS, Christodoulides M, Stec N, Zelinskaya N, Van Pelt N, Merrill NM, Singh N, Kootstra NA, Singh N, Gandhi NS, Chan NL, Trinh NM, Schneider NO, Matovic N, Horstmann N, Longo N, Bharambe N, Rouzbeh N, Mahmoodi N, Gumede NJ, Anastasio NC, Khalaf NB, Rabal O, Kandror O, Escaffre O, Silvennoinen O, Bishop OT, Iglesias P, Sobrado P, Chuong P, O’Connell P, Martin-Malpartida P, Mellor P, Fish PV, Moreira POL, Zhou P, Liu P, Liu P, Wu P, Agogo-Mawuli P, Jones PL, Ngoi P, Toogood P, Ip P, von Hundelshausen P, Lee PH, Rowswell-Turner RB, Balaña-Fouce R, Rocha REO, Guido RVC, Ferreira RS, Agrawal RK, Harijan RK, Ramachandran R, Verma R, Singh RK, Tiwari RK, Mazitschek R, Koppisetti RK, Dame RT, Douville RN, Austin RC, Taylor RE, Moore RG, Ebright RH, Angell RM, Yan R, Kejriwal R, Batey RA, Blelloch R, Vandenberg RJ, Hickey RJ, Kelm RJ, Lake RJ, Bradley RK, Blumenthal RM, Solano R, Gierse RM, Viola RE, McCarthy RR, Reguera RM, Uribe RV, do Monte-Neto RL, Gorgoglione R, Cullinane RT, Katyal S, Hossain S, Phadke S, Shelburne SA, Geden SE, Johannsen S, Wazir S, Legare S, Landfear SM, Radhakrishnan SK, Ammendola S, Dzhumaev S, Seo SY, Li S, Zhou S, Chu S, Chauhan S, Maruta S, Ashkar SR, Shyng SL, Conticello SG, Buroni S, Garavaglia S, White SJ, Zhu S, Tsimbalyuk S, Chadni SH, Byun SY, Park S, Xu SQ, Banerjee S, Zahler S, Espinoza S, Gustincich S, Sainas S, Celano SL, Capuzzi SJ, Waggoner SN, Poirier S, Olson SH, Marx SO, Van Doren SR, Sarilla S, Brady-Kalnay SM, Dallman S, Azeem SM, Teramoto T, Mehlman T, Swart T, Abaffy T, Akopian T, Haikarainen T, Moreda TL, Ikegami T, Teixeira TR, Jayasinghe TD, Gillingwater TH, Kampourakis T, Richardson TI, Herdendorf TJ, Kotzé TJ, O’Meara TR, Corson TW, Hermle T, Ogunwa TH, Lan T, Su T, Banjo T, O’Mara TA, Chou T, Chou TF, Baumann U, Desai UR, Pai VP, Thai VC, Tandon V, Banerji V, Robinson VL, Gunasekharan V, Namasivayam V, Segers VFM, Maranda V, Dolce V, Maltarollo VG, Scoffone VC, Woods VA, Ronchi VP, Van Hung Le V, Clayton WB, Lowther WT, Houry WA, Li W, Tang W, Zhang W, Van Voorhis WC, Donaldson WA, Hahn WC, Kerr WG, Gerwick WH, Bradshaw WJ, Foong WE, Blanchet X, Wu X, Lu X, Qi X, Xu X, Yu X, Qin X, Wang X, Yuan X, Zhang X, Zhang YJ, Hu Y, Aldhamen YA, Chen Y, Li Y, Sun Y, Zhu Y, Gupta YK, Pérez-Pertejo Y, Li Y, Tang Y, He Y, Tse-Dinh YC, Sidorova YA, Yen Y, Li Y, Frangos ZJ, Chung Z, Su Z, Wang Z, Zhang Z, Liu Z, Inde Z, Artía Z, Heifets A. AI is a viable alternative to high throughput screening: a 318-target study. Sci Rep 2024; 14:7526. [PMID: 38565852 PMCID: PMC10987645 DOI: 10.1038/s41598-024-54655-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 02/15/2024] [Indexed: 04/04/2024] Open
Abstract
High throughput screening (HTS) is routinely used to identify bioactive small molecules. This requires physical compounds, which limits coverage of accessible chemical space. Computational approaches combined with vast on-demand chemical libraries can access far greater chemical space, provided that the predictive accuracy is sufficient to identify useful molecules. Through the largest and most diverse virtual HTS campaign reported to date, comprising 318 individual projects, we demonstrate that our AtomNet® convolutional neural network successfully finds novel hits across every major therapeutic area and protein class. We address historical limitations of computational screening by demonstrating success for target proteins without known binders, high-quality X-ray crystal structures, or manual cherry-picking of compounds. We show that the molecules selected by the AtomNet® model are novel drug-like scaffolds rather than minor modifications to known bioactive compounds. Our empirical results suggest that computational methods can substantially replace HTS as the first step of small-molecule drug discovery.
Collapse
|
17
|
Ajmal A, Alkhatabi HA, Alreemi RM, Alamri MA, Khalid A, Abdalla AN, Alotaibi BS, Wadood A. Prospective virtual screening combined with bio-molecular simulation enabled identification of new inhibitors for the KRAS drug target. BMC Chem 2024; 18:57. [PMID: 38528576 DOI: 10.1186/s13065-024-01152-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 02/26/2024] [Indexed: 03/27/2024] Open
Abstract
Lung cancer is a disease with a high mortality rate and it is the number one cause of cancer death globally. Approximately 12-14% of non-small cell lung cancers are caused by mutations in KRASG12C. The KRASG12C is one of the most prevalent mutants in lung cancer patients. KRAS was first considered undruggable. The sotorasib and adagrasib are the recently approved drugs that selectively target KRASG12C, and offer new treatment approaches to enhance patient outcomes however drug resistance frequently arises. Drug development is a challenging, expensive, and time-consuming process. Recently, machine-learning-based virtual screening are used for the development of new drugs. In this study, we performed machine-learning-based virtual screening followed by molecular docking, all atoms molecular dynamics simulation, and binding energy calculations for the identifications of new inhibitors against the KRASG12C mutant. In this study, four machine learning models including, random forest, k-nearest neighbors, Gaussian naïve Bayes, and support vector machine were used. By using an external dataset and 5-fold cross-validation, the developed models were validated. Among all the models the performance of the random forest (RF) model was best on the train/test dataset and external dataset. The random forest model was further used for the virtual screening of the ZINC15 database, in-house database, Pakistani phytochemicals, and South African Natural Products database. A total of 100 ns MD simulation was performed for the four best docking score complexes as well as the standard compound in complex with KRASG12C. Furthermore, the top four hits revealed greater stability and greater binding affinities for KRASG12C compared to the standard drug. These new hits have the potential to inhibit KRASG12C and may help to prevent KRAS-associated lung cancer. All the datasets used in this study can be freely available at ( https://github.com/Amar-Ajmal/Datasets-for-KRAS ).
Collapse
Affiliation(s)
- Amar Ajmal
- Department of Biochemistry, Abdul Wali Khan University Mardan, Mardan, 23200, Pakistan
| | - Hind A Alkhatabi
- Department of Biochemistry, College of Science, University of Jeddah, Jeddah, 21959, Saudi Arabia
| | - Roaa M Alreemi
- Department of Biochemistry, College of Science, University of Jeddah, Jeddah, 21959, Saudi Arabia
| | - Mubarak A Alamri
- Department of Pharmaceutical Chemistry, College of Pharmacy, Prince Sattam Bin Abdulaziz University, Al-Kharj, 11942, Saudi Arabia
| | - Asaad Khalid
- Substance Abuse and Toxicology Research Center, Jazan University, P.O. Box: 114, Jazan, 45142, Saudi Arabia.
| | - Ashraf N Abdalla
- Department of Pharmacology and Toxicology, College of Pharmacy, Umm Al-Qura University, Makkah, 21955, Saudi Arabia
| | - Bader S Alotaibi
- Department of Clinical Laboratory Sciences, College of Applied Medical Sciences, Shaqra Univesity, Al- Quwayiyah, Riyadh, Saudi Arabia
| | - Abdul Wadood
- Department of Biochemistry, Abdul Wali Khan University Mardan, Mardan, 23200, Pakistan.
| |
Collapse
|
18
|
Brocidiacono M, Popov KI, Tropsha A. An Improved Metric and Benchmark for Assessing the Performance of Virtual Screening Models. ARXIV 2024:arXiv:2403.10478v1. [PMID: 38560736 PMCID: PMC10980085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Structure-based virtual screening (SBVS) is a key workflow in computational drug discovery. SBVS models are assessed by measuring the enrichment of known active molecules over decoys in retrospective screens. However, the standard formula for enrichment cannot estimate model performance on very large libraries. Additionally, current screening benchmarks cannot easily be used with machine learning (ML) models due to data leakage. We propose an improved formula for calculating VS enrichment and introduce the BayesBind benchmarking set composed of protein targets that are structurally dissimilar to those in the BigBind training set. We assess current models on this benchmark and find that none perform appreciably better than a KNN baseline. We publicly release the BayesBind benchmark at https://github.com/molecularmodelinglab/bigbind.
Collapse
|
19
|
Isert C, Atz K, Riniker S, Schneider G. Exploring protein-ligand binding affinity prediction with electron density-based geometric deep learning. RSC Adv 2024; 14:4492-4502. [PMID: 38312732 PMCID: PMC10835705 DOI: 10.1039/d3ra08650j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 01/19/2024] [Indexed: 02/06/2024] Open
Abstract
Rational structure-based drug design relies on accurate predictions of protein-ligand binding affinity from structural molecular information. Although deep learning-based methods for predicting binding affinity have shown promise in computational drug design, certain approaches have faced criticism for their potential to inadequately capture the fundamental physical interactions between ligands and their macromolecular targets or for being susceptible to dataset biases. Herein, we propose to include bond-critical points based on the electron density of a protein-ligand complex as a fundamental physical representation of protein-ligand interactions. Employing a geometric deep learning model, we explore the usefulness of these bond-critical points to predict absolute binding affinities of protein-ligand complexes, benchmark model performance against existing methods, and provide a critical analysis of this new approach. The models achieved root-mean-squared errors of 1.4-1.8 log units on the PDBbind dataset, and 1.0-1.7 log units on the PDE10A dataset, not indicating significant advantages over benchmark methods, and thus rendering the utility of electron density for deep learning models context-dependent. The relationship between intermolecular electron density and corresponding binding affinity was analyzed, and Pearson correlation coefficients r > 0.7 were obtained for several macromolecular targets.
Collapse
Affiliation(s)
- Clemens Isert
- ETH Zurich, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 4 8093 Zurich Switzerland +41 44 633 73 27
| | - Kenneth Atz
- ETH Zurich, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 4 8093 Zurich Switzerland +41 44 633 73 27
| | - Sereina Riniker
- ETH Zurich, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 4 8093 Zurich Switzerland +41 44 633 73 27
| | - Gisbert Schneider
- ETH Zurich, Department of Chemistry and Applied Biosciences Vladimir-Prelog-Weg 4 8093 Zurich Switzerland +41 44 633 73 27
| |
Collapse
|
20
|
Weng G, Zhao H, Nie D, Zhang H, Liu L, Hou T, Kang Y. RediscMol: Benchmarking Molecular Generation Models in Biological Properties. J Med Chem 2024; 67:1533-1543. [PMID: 38181194 DOI: 10.1021/acs.jmedchem.3c02051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
Deep learning-based molecular generative models have garnered emerging attention for their capability to generate molecules with novel structures and desired physicochemical properties. However, the evaluation of these models, particularly in a biological context, remains insufficient. To address the limitations of existing metrics and emulate practical application scenarios, we construct the RediscMol benchmark that comprises active molecules extracted from 5 kinase and 3 GPCR data sets. A set of rediscovery- and similarity-related metrics are introduced to assess the performance of 8 representative generative models (CharRNN, VAE, Reinvent, AAE, ORGAN, RNNAttn, TransVAE, and GraphAF). Our findings based on the RediscMol benchmark differ from those of previous evaluations. CharRNN, VAE, and Reinvent exhibit a greater ability to reproduce known active molecules, while RNNAttn, TransVAE, and GraphAF struggle in this aspect despite their notable performance on commonly used distribution-learning metrics. Our evaluation framework may provide valuable guidance for advancing generative models in real-world drug design scenarios.
Collapse
Affiliation(s)
- Gaoqi Weng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| | - Huifeng Zhao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| | - Dou Nie
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| | - Haotian Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| | - Liwei Liu
- Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd., Shenzhen 518129, Guangdong, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang UniversityHangzhou 310058, Zhejiang, China
| |
Collapse
|
21
|
Li Y, Fan Z, Rao J, Chen Z, Chu Q, Zheng M, Li X. An overview of recent advances and challenges in predicting compound-protein interaction (CPI). MEDICAL REVIEW (2021) 2023; 3:465-486. [PMID: 38282802 PMCID: PMC10808869 DOI: 10.1515/mr-2023-0030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 08/30/2023] [Indexed: 01/30/2024]
Abstract
Compound-protein interactions (CPIs) are critical in drug discovery for identifying therapeutic targets, drug side effects, and repurposing existing drugs. Machine learning (ML) algorithms have emerged as powerful tools for CPI prediction, offering notable advantages in cost-effectiveness and efficiency. This review provides an overview of recent advances in both structure-based and non-structure-based CPI prediction ML models, highlighting their performance and achievements. It also offers insights into CPI prediction-related datasets and evaluation benchmarks. Lastly, the article presents a comprehensive assessment of the current landscape of CPI prediction, elucidating the challenges faced and outlining emerging trends to advance the field.
Collapse
Affiliation(s)
- Yanbei Li
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, Zhejiang Province, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhehuan Fan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jingxin Rao
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhiyi Chen
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, Zhejiang Province, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Qinyu Chu
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, Zhejiang Province, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Mingyue Zheng
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, UCAS, Hangzhou, Zhejiang Province, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
22
|
McNutt A, Bisiriyu F, Song S, Vyas A, Hutchison GR, Koes DR. Conformer Generation for Structure-Based Drug Design: How Many and How Good? J Chem Inf Model 2023; 63:6598-6607. [PMID: 37903507 PMCID: PMC10647020 DOI: 10.1021/acs.jcim.3c01245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 10/18/2023] [Accepted: 10/19/2023] [Indexed: 11/01/2023]
Abstract
Conformer generation, the assignment of realistic 3D coordinates to a small molecule, is fundamental to structure-based drug design. Conformational ensembles are required for rigid-body matching algorithms, such as shape-based or pharmacophore approaches, and even methods that treat the ligand flexibly, such as docking, are dependent on the quality of the provided conformations due to not sampling all degrees of freedom (e.g., only sampling torsions). Here, we empirically elucidate some general principles about the size, diversity, and quality of the conformational ensembles needed to get the best performance in common structure-based drug discovery tasks. In many cases, our findings may parallel "common knowledge" well-known to practitioners of the field. Nonetheless, we feel that it is valuable to quantify these conformational effects while reproducing and expanding upon previous studies. Specifically, we investigate the performance of a state-of-the-art generative deep learning approach versus a more classical geometry-based approach, the effect of energy minimization as a postprocessing step, the effect of ensemble size (maximum number of conformers), and construction (filtering by root-mean-square deviation for diversity) and how these choices influence the ability to recapitulate bioactive conformations and perform pharmacophore screening and molecular docking.
Collapse
Affiliation(s)
- Andrew
T. McNutt
- Department
of Computational and Systems Biology, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| | - Fatimah Bisiriyu
- The
Neighborhood Academy, Pittsburgh, Pennsylvania 15206, United States
| | - Sophia Song
- Upper
St. Clair High School, Pittsburgh, Pennsylvania 15241, United States
| | - Ananya Vyas
- Taylor
Allderdice High School, Pittsburgh, Pennsylvania 15217, United States
| | - Geoffrey R. Hutchison
- Department of Chemistry, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
- Department
of Chemical and Petroleum Engineering, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| | - David Ryan Koes
- Department
of Computational and Systems Biology, University
of Pittsburgh, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
23
|
ÖZçelİk R, Bağ A, Atil B, Barsbey M, ÖZgür A, Ozkirimli E. A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models. J Comput Biol 2023; 30:1226-1239. [PMID: 37988395 DOI: 10.1089/cmb.2023.0208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023] Open
Abstract
Statistical models that accurately predict the binding affinity of an input ligand-protein pair can greatly accelerate drug discovery. Such models are trained on available ligand-protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug-target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug-target affinities.
Collapse
Affiliation(s)
- Riza ÖZçelİk
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Alperen Bağ
- Technical University of Munich, Munich, Germany
| | - Berk Atil
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Melİh Barsbey
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Arzucan ÖZgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Elif Ozkirimli
- Roche Informatics, F. Hoffmann-La Roche AG, Basel, Switzerland
| |
Collapse
|
24
|
Hadfield TE, Scantlebury J, Deane CM. Exploring the ability of machine learning-based virtual screening models to identify the functional groups responsible for binding. J Cheminform 2023; 15:84. [PMID: 37726844 PMCID: PMC10509074 DOI: 10.1186/s13321-023-00755-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 08/25/2023] [Indexed: 09/21/2023] Open
Abstract
Many recently proposed structure-based virtual screening models appear to be able to accurately distinguish high affinity binders from non-binders. However, several recent studies have shown that they often do so by exploiting ligand-specific biases in the dataset, rather than identifying favourable intermolecular interactions in the input protein-ligand complex. In this work we propose a novel approach for assessing the extent to which machine learning-based virtual screening models are able to identify the functional groups responsible for binding. To sidestep the difficulty in establishing the ground truth importance of each atom of a large scale set of protein-ligand complexes, we propose a protocol for generating synthetic data. Each ligand in the dataset is surrounded by a randomly sampled point cloud of pharmacophores, and the label assigned to the synthetic protein-ligand complex is determined by a 3-dimensional deterministic binding rule. This allows us to precisely quantify the ground truth importance of each atom and compare it to the model generated attributions. Using our generated datasets, we demonstrate that a recently proposed deep learning-based virtual screening model, PointVS, identified the most important functional groups with 39% more efficiency than a fingerprint-based random forest, suggesting that it would generalise more effectively to new examples. In addition, we found that ligand-specific biases, such as those present in widely used virtual screening datasets, substantially impaired the ability of all ML models to identify the most important functional groups. We have made our synthetic data generation framework available to facilitate the benchmarking of new virtual screening models. Code is available at https://github.com/tomhadfield95/synthVS .
Collapse
Affiliation(s)
- Thomas E Hadfield
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, UK
| | - Jack Scantlebury
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, UK
| | - Charlotte M Deane
- Oxford Protein Informatics Group, Department of Statistics, University of Oxford, Oxford, UK.
| |
Collapse
|
25
|
Schaller D, Christ CD, Chodera JD, Volkamer A. Benchmarking Cross-Docking Strategies for Structure-Informed Machine Learning in Kinase Drug Discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.11.557138. [PMID: 37745489 PMCID: PMC10515787 DOI: 10.1101/2023.09.11.557138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
In recent years machine learning has transformed many aspects of the drug discovery process including small molecule design for which the prediction of the bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches, but is fundamentally limited by the accuracy with which protein:ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase:inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures co-crystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the co-crystallized ligand-utilizing shape overlap with or without maximum common substructure matching-are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance to generate a low RMSD docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar co-crystallized ligands according to shape and electrostatics proofed to be the most efficient way to reproduce binding poses achieving a success rate of 66.9 % across all included systems. The studied docking and pose selection strategies-which utilize the OpenEye Toolkit-were implemented into pipelines of the KinoML framework allowing automated and reliable protein:ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe the general findings can also be transferred to other protein families.
Collapse
Affiliation(s)
- David Schaller
- In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Clara D. Christ
- Molecular Design, Research and Development, Pharmaceuticals, Bayer AG, 13342 Berlin, Germany
| | - John D. Chodera
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Andrea Volkamer
- In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany
- Data Driven Drug Design, Faculty of Mathematics and Computer Sciences, Saarland University, Saarbrücken, Germany
| |
Collapse
|
26
|
Liu C, Kutchukian P, Nguyen ND, AlQuraishi M, Sorger PK. A Hybrid Structure-Based Machine Learning Approach for Predicting Kinase Inhibition by Small Molecules. J Chem Inf Model 2023; 63:5457-5472. [PMID: 37595065 PMCID: PMC10498990 DOI: 10.1021/acs.jcim.3c00347] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Indexed: 08/20/2023]
Abstract
Kinases have been the focus of drug discovery programs for three decades leading to over 70 therapeutic kinase inhibitors and biophysical affinity measurements for over 130,000 kinase-compound pairs. Nonetheless, the precise target spectrum for many kinases remains only partly understood. In this study, we describe a computational approach to unlocking qualitative and quantitative kinome-wide binding measurements for structure-based machine learning. Our study has three components: (i) a Kinase Inhibitor Complex (KinCo) data set comprising in silico predicted kinase structures paired with experimental binding constants, (ii) a machine learning loss function that integrates qualitative and quantitative data for model training, and (iii) a structure-based machine learning model trained on KinCo. We show that our approach outperforms methods trained on crystal structures alone in predicting binary and quantitative kinase-compound interaction affinities; relative to structure-free methods, our approach also captures known kinase biochemistry and more successfully generalizes to distant kinase sequences and compound scaffolds.
Collapse
Affiliation(s)
- Changchang Liu
- Laboratory
of Systems Pharmacology, Department of Systems Biology, Harvard Program
in Therapeutic Science, Harvard Medical
School, Boston, Massachusetts 02115, United States
| | - Peter Kutchukian
- Novartis
Institutes for Biomedical Research, Cambridge, Massachusetts 02139, United States
| | - Nhan D. Nguyen
- Pritzker
School of Molecular Engineering, University
of Chicago, Chicago, Illinois 60637, United
States
| | - Mohammed AlQuraishi
- Department
of Systems Biology, Columbia University, New York, New York 10032, United States
| | - Peter K. Sorger
- Laboratory
of Systems Pharmacology, Department of Systems Biology, Harvard Program
in Therapeutic Science, Harvard Medical
School, Boston, Massachusetts 02115, United States
| |
Collapse
|
27
|
Shen C, Zhang X, Hsieh CY, Deng Y, Wang D, Xu L, Wu J, Li D, Kang Y, Hou T, Pan P. A generalized protein-ligand scoring framework with balanced scoring, docking, ranking and screening powers. Chem Sci 2023; 14:8129-8146. [PMID: 37538816 PMCID: PMC10395315 DOI: 10.1039/d3sc02044d] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 07/03/2023] [Indexed: 08/05/2023] Open
Abstract
Applying machine learning algorithms to protein-ligand scoring functions has aroused widespread attention in recent years due to the high predictive accuracy and affordable computational cost. Nevertheless, most machine learning-based scoring functions are only applicable to a specific task, e.g., binding affinity prediction, binding pose prediction or virtual screening, suggesting that the development of a scoring function with balanced performance in all critical tasks remains a grand challenge. To this end, we propose a novel parameterization strategy by introducing an adjustable binding affinity term that represents the correlation between the predicted outcomes and experimental data into the training of mixture density network. The resulting residue-atom distance likelihood potential not only retains the superior docking and screening power over all the other state-of-the-art approaches, but also achieves a remarkable improvement in scoring and ranking performance. We emphatically explore the impacts of several key elements on prediction accuracy as well as the task preference, and demonstrate that the performance of scoring/ranking and docking/screening tasks of a certain model could be well balanced through an appropriate manner. Overall, our study highlights the potential utility of our innovative parameterization strategy as well as the resulting scoring framework in future structure-based drug design.
Collapse
Affiliation(s)
- Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
- School of Public Health, Zhejiang University Hangzhou 310058 Zhejiang China
- CarbonSilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology Changzhou 213001 China
| | - Jian Wu
- School of Public Health, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Dan Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| |
Collapse
|
28
|
Zhang X, Shen C, Jiang D, Zhang J, Ye Q, Xu L, Hou T, Pan P, Kang Y. TB-IECS: an accurate machine learning-based scoring function for virtual screening. J Cheminform 2023; 15:63. [PMID: 37403155 DOI: 10.1186/s13321-023-00731-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 06/18/2023] [Indexed: 07/06/2023] Open
Abstract
Machine learning-based scoring functions (MLSFs) have shown potential for improving virtual screening capabilities over classical scoring functions (SFs). Due to the high computational cost in the process of feature generation, the numbers of descriptors used in MLSFs and the characterization of protein-ligand interactions are always limited, which may affect the overall accuracy and efficiency. Here, we propose a new SF called TB-IECS (theory-based interaction energy component score), which combines energy terms from Smina and NNScore version 2, and utilizes the eXtreme Gradient Boosting (XGBoost) algorithm for model training. In this study, the energy terms decomposed from 15 traditional SFs were firstly categorized based on their formulas and physicochemical principles, and 324 feature combinations were generated accordingly. Five best feature combinations were selected for further evaluation of the model performance in regard to the selection of feature vectors with various length, interaction types and ML algorithms. The virtual screening power of TB-IECS was assessed on the datasets of DUD-E and LIT-PCBA, as well as seven target-specific datasets from the ChemDiv database. The results showed that TB-IECS outperformed classical SFs including Glide SP and Dock, and effectively balanced the efficiency and accuracy for practical virtual screening.
Collapse
Affiliation(s)
- Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Jintu Zhang
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Qing Ye
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, 213001, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of, Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, China.
| |
Collapse
|
29
|
Scantlebury J, Vost L, Carbery A, Hadfield TE, Turnbull OM, Brown N, Chenthamarakshan V, Das P, Grosjean H, von Delft F, Deane CM. A Small Step Toward Generalizability: Training a Machine Learning Scoring Function for Structure-Based Virtual Screening. J Chem Inf Model 2023; 63:2960-2974. [PMID: 37166179 PMCID: PMC10207375 DOI: 10.1021/acs.jcim.3c00322] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Indexed: 05/12/2023]
Abstract
Over the past few years, many machine learning-based scoring functions for predicting the binding of small molecules to proteins have been developed. Their objective is to approximate the distribution which takes two molecules as input and outputs the energy of their interaction. Only a scoring function that accounts for the interatomic interactions involved in binding can accurately predict binding affinity on unseen molecules. However, many scoring functions make predictions based on data set biases rather than an understanding of the physics of binding. These scoring functions perform well when tested on similar targets to those in the training set but fail to generalize to dissimilar targets. To test what a machine learning-based scoring function has learned, input attribution, a technique for learning which features are important to a model when making a prediction on a particular data point, can be applied. If a model successfully learns something beyond data set biases, attribution should give insight into the important binding interactions that are taking place. We built a machine learning-based scoring function that aimed to avoid the influence of bias via thorough train and test data set filtering and show that it achieves comparable performance on the Comparative Assessment of Scoring Functions, 2016 (CASF-2016) benchmark to other leading methods. We then use the CASF-2016 test set to perform attribution and find that the bonds identified as important by PointVS, unlike those extracted from other scoring functions, have a high correlation with those found by a distance-based interaction profiler. We then show that attribution can be used to extract important binding pharmacophores from a given protein target when supplied with a number of bound structures. We use this information to perform fragment elaboration and see improvements in docking scores compared to using structural information from a traditional, data-based approach. This not only provides definitive proof that the scoring function has learned to identify some important binding interactions but also constitutes the first deep learning-based method for extracting structural information from a target for molecule design.
Collapse
Affiliation(s)
- Jack Scantlebury
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
| | - Lucy Vost
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
| | - Anna Carbery
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
- Diamond
Light Source Ltd., Harwell Science and
Innovation Campus, Didcot OX11 0DE, United Kingdom
| | - Thomas E. Hadfield
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
| | - Oliver M. Turnbull
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
| | | | | | - Payel Das
- IBM
Thomas J. Watson Research Center, Yorktown Heights, New York 10598, United States
| | - Harold Grosjean
- Structural
Genomics Consortium, University of Oxford, Oxford OX3 7DQ, United Kingdom
| | - Frank von Delft
- Diamond
Light Source Ltd., Harwell Science and
Innovation Campus, Didcot OX11 0DE, United Kingdom
- Centre for
Medicines Discovery, University of Oxford, Oxford OX3 7DQ, United Kingdom
- Department
of Biochemistry, University of Johannesburg, Johannesburg 2006, South Africa
- Research
Complex at Harwell, Harwell Science and
Innovation Campus, Didcot OX11 0FA, United Kingdom
| | - Charlotte M. Deane
- Department
of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom
| |
Collapse
|
30
|
Díaz-Rovira AM, Martín H, Beuming T, Díaz L, Guallar V, Ray SS. Are Deep Learning Structural Models Sufficiently Accurate for Virtual Screening? Application of Docking Algorithms to AlphaFold2 Predicted Structures. J Chem Inf Model 2023; 63:1668-1674. [PMID: 36892986 DOI: 10.1021/acs.jcim.2c01270] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/10/2023]
Abstract
Machine learning-based protein structure prediction algorithms, such as RosettaFold and AlphaFold2, have greatly impacted the structural biology field, arousing a fair amount of discussion around their potential role in drug discovery. While there are few preliminary studies addressing the usage of these models in virtual screening, none of them focus on the prospect of hit-finding in a real-world virtual screen with a model based on low prior structural information. In order to address this, we have developed an AlphaFold2 version where we exclude all structural templates with more than 30% sequence identity from the model-building process. In a previous study, we used those models in conjunction with state-of-the-art free energy perturbation methods and demonstrated that it is possible to obtain quantitatively accurate results. In this work, we focus on using these structures in rigid receptor-ligand docking studies. Our results indicate that using out-of-the-box Alphafold2 models is not an ideal scenario for virtual screening campaigns; in fact, we strongly recommend to include some post-processing modeling to drive the binding site into a more realistic holo model.
Collapse
Affiliation(s)
- Anna M Díaz-Rovira
- Barcelona Supercomputing Center, Jordi Girona 29, E-08034 Barcelona, Spain
| | | | - Thijs Beuming
- Latham Biopharm Group, 101 Main Street, Suite 1400, Cambridge, Massachusetts 02142, United States
| | - Lucía Díaz
- Nostrum Biodiscovery S.L., E-08029 Barcelona, Spain
| | - Victor Guallar
- Barcelona Supercomputing Center, Jordi Girona 29, E-08034 Barcelona, Spain.,Nostrum Biodiscovery S.L., E-08029 Barcelona, Spain.,ICREA, Passeig Lluís Companys 23, E-08010 Barcelona, Spain
| | - Soumya S Ray
- RA Capital, 200 Berkeley Street, Boston, Massachusetts 02116, United States.,3-Dimensional Consulting, 134 Franklin Avenue, Quincy, Massachusetts 02170, United States
| |
Collapse
|
31
|
New avenues in artificial-intelligence-assisted drug discovery. Drug Discov Today 2023; 28:103516. [PMID: 36736583 DOI: 10.1016/j.drudis.2023.103516] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Revised: 12/08/2022] [Accepted: 01/26/2023] [Indexed: 02/05/2023]
Abstract
Over the past decade, the amount of biomedical data available has grown at unprecedented rates. Increased automation technology and larger data volumes have encouraged the use of machine learning (ML) or artificial intelligence (AI) techniques for mining such data and extracting useful patterns. Because the identification of chemical entities with desired biological activity is a crucial task in drug discovery, AI technologies have the potential to accelerate this process and support decision making. In addition, the advent of deep learning (DL) has shown great promise in addressing diverse problems in drug discovery, such as de novo molecular design. Herein, we will appraise the current state-of-the-art in AI-assisted drug discovery, discussing the recent applications covering generative models for chemical structure generation, scoring functions to improve binding affinity and pose prediction, and molecular dynamics to assist in the parametrization, featurization and generalization tasks. Finally, we will discuss current hurdles and the strategies to overcome them, as well as potential future directions.
Collapse
|
32
|
Wang Z, Zheng L, Wang S, Lin M, Wang Z, Kong AWK, Mu Y, Wei Y, Li W. A fully differentiable ligand pose optimization framework guided by deep learning and a traditional scoring function. Brief Bioinform 2023; 24:6887112. [PMID: 36502369 DOI: 10.1093/bib/bbac520] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 10/17/2022] [Accepted: 10/31/2022] [Indexed: 12/14/2022] Open
Abstract
The recently reported machine learning- or deep learning-based scoring functions (SFs) have shown exciting performance in predicting protein-ligand binding affinities with fruitful application prospects. However, the differentiation between highly similar ligand conformations, including the native binding pose (the global energy minimum state), remains challenging that could greatly enhance the docking. In this work, we propose a fully differentiable, end-to-end framework for ligand pose optimization based on a hybrid SF called DeepRMSD+Vina combined with a multi-layer perceptron (DeepRMSD) and the traditional AutoDock Vina SF. The DeepRMSD+Vina, which combines (1) the root mean square deviation (RMSD) of the docking pose with respect to the native pose and (2) the AutoDock Vina score, is fully differentiable; thus is capable of optimizing the ligand binding pose to the energy-lowest conformation. Evaluated by the CASF-2016 docking power dataset, the DeepRMSD+Vina reaches a success rate of 94.4%, which outperforms most reported SFs to date. We evaluated the ligand conformation optimization framework in practical molecular docking scenarios (redocking and cross-docking tasks), revealing the high potentialities of this framework in drug design and discovery. Structural analysis shows that this framework has the ability to identify key physical interactions in protein-ligand binding, such as hydrogen-bonding. Our work provides a paradigm for optimizing ligand conformations based on deep learning algorithms. The DeepRMSD+Vina model and the optimization framework are available at GitHub repository https://github.com/zchwang/DeepRMSD-Vina_Optimization.
Collapse
Affiliation(s)
- Zechen Wang
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| | - Liangzhen Zheng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China.,Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Mingzhi Lin
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Zhihao Wang
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| | - Adams Wai-Kin Kong
- Rolls-Royce Corporate Lab, Nanyang Technological University, Singapore 637551, Singapore
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| | - Weifeng Li
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
33
|
Wang L, Shi SH, Li H, Zeng XX, Liu SY, Liu ZQ, Deng YF, Lu AP, Hou TJ, Cao DS. Reducing false positive rate of docking-based virtual screening by active learning. Brief Bioinform 2023; 24:6987822. [PMID: 36642412 DOI: 10.1093/bib/bbac626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/10/2022] [Accepted: 12/20/2022] [Indexed: 01/17/2023] Open
Abstract
Machine learning-based scoring functions (MLSFs) have become a very favorable alternative to classical scoring functions because of their potential superior screening performance. However, the information of negative data used to construct MLSFs was rarely reported in the literature, and meanwhile the putative inactive molecules recorded in existing databases usually have obvious bias from active molecules. Here we proposed an easy-to-use method named AMLSF that combines active learning using negative molecular selection strategies with MLSF, which can iteratively improve the quality of inactive sets and thus reduce the false positive rate of virtual screening. We chose energy auxiliary terms learning as the MLSF and validated our method on eight targets in the diverse subset of DUD-E. For each target, we screened the IterBioScreen database by AMLSF and compared the screening results with those of the four control models. The results illustrate that the number of active molecules in the top 1000 molecules identified by AMLSF was significantly higher than those identified by the control models. In addition, the free energy calculation results for the top 10 molecules screened out by the AMLSF, null model and control models based on DUD-E also proved that more active molecules can be identified, and the false positive rate can be reduced by AMLSF.
Collapse
Affiliation(s)
- Lei Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China
| | - Shao-Hua Shi
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
| | - Hui Li
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China
| | - Xiang-Xiang Zeng
- Department of Computer Science, Hunan University, Changsha 410082, Hunan, China
| | - Su-You Liu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China
| | - Zhao-Qian Liu
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China
| | - Ya-Feng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, Zhejiang 310018, China
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
| | - Ting-Jun Hou
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Dong-Sheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China.,Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
| |
Collapse
|
34
|
Kanakala G, Aggarwal R, Nayar D, Priyakumar UD. Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets. ACS OMEGA 2023; 8:2389-2397. [PMID: 36687059 PMCID: PMC9850481 DOI: 10.1021/acsomega.2c06781] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein-ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein-ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein-ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.
Collapse
Affiliation(s)
| | - Rishal Aggarwal
- International
Institute of Information Technology, Hyderabad500 032, India
| | - Divya Nayar
- Department
of Materials Science and Engineering, Indian
Institute of Technology Delhi, Hauz Khas, New Delhi110016, India
| | - U. Deva Priyakumar
- International
Institute of Information Technology, Hyderabad500 032, India
| |
Collapse
|
35
|
Abstract
The discovery of new hits through ligand-based virtual screening in drug discovery is essentially a low-data problem, as data acquisition is both difficult and expensive. The requirement for large amounts of training data hinders the application of conventional machine learning techniques to this problem domain. This work explores few-shot machine learning for hit discovery and lead optimization. We build on the state-of-the-art and introduce two new metric-based meta-learning techniques, Prototypical and Relation Networks, to this problem domain. We also explore using different embeddings, namely, extended-connectivity fingerprints (ECFP) and embeddings generated through graph convolutional networks (GCN), as inputs to neural networks for classification. This study shows that learned embeddings through GCNs consistently perform better than extended-connectivity fingerprints for toxicity and LBVS experiments. We conclude that the effectiveness of few-shot learning is highly dependent on the nature of the data. Few-shot learning models struggle to perform consistently on MUV and DUD-E data, in which the active compounds are structurally distinct. However, on Tox21 data, the few-shot models perform well, and we find that Prototypical Networks outperform the state-of-the-art, which is based on the Matching Networks architecture. Additionally, training these networks is substantially faster (up to 190%) and therefore takes a fraction of the time to train for comparable, or better, results.
Collapse
Affiliation(s)
- Daniel Vella
- Department of Artificial Intelligence, University of Malta, MsidaMSD 2080, Malta
| | - Jean-Paul Ebejer
- Department of Artificial Intelligence, University of Malta, MsidaMSD 2080, Malta.,Centre for Molecular Medicine and Biobanking, University of Malta, MsidaMSD 2080, Malta
| |
Collapse
|
36
|
Protein-ligand binding affinity prediction with edge awareness and supervised attention. iScience 2022; 26:105892. [PMID: 36691617 PMCID: PMC9860494 DOI: 10.1016/j.isci.2022.105892] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 11/12/2022] [Accepted: 12/23/2022] [Indexed: 12/29/2022] Open
Abstract
Accurate prediction of protein-ligand binding affinity is crucial in structure-based drug design but remains some challenges even with recent advances in deep learning: (1) Existing methods neglect the edge information in protein and ligand structure data; (2) current attention mechanisms struggle to capture true binding interactions in the small dataset. Herein, we proposed SEGSA_DTA, a SuperEdge Graph convolution-based and Supervised Attention-based Drug-Target Affinity prediction method, where the super edge graph convolution can comprehensively utilize node and edge information and the multi-supervised attention module can efficiently learn the attention distribution consistent with real protein-ligand interactions. Results on the multiple datasets show that SEGSA_DTA outperforms current state-of-the-art methods. We also applied SEGSA_DTA in repurposing FDA-approved drugs to identify potential coronavirus disease 2019 (COVID-19) treatments. Besides, by using SHapley Additive exPlanations (SHAP), we found that SEGSA_DTA is interpretable and further provides a new quantitative analytical solution for structure-based lead optimization.
Collapse
|
37
|
Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00581-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
38
|
van Tilborg D, Alenicheva A, Grisoni F. Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. J Chem Inf Model 2022; 62:5938-5951. [PMID: 36456532 PMCID: PMC9749029 DOI: 10.1021/acs.jcim.2c01073] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Indexed: 12/03/2022]
Abstract
Machine learning has become a crucial tool in drug discovery and chemistry at large, e.g., to predict molecular properties, such as bioactivity, with high accuracy. However, activity cliffs─pairs of molecules that are highly similar in their structure but exhibit large differences in potency─have received limited attention for their effect on model performance. Not only are these edge cases informative for molecule discovery and optimization but also models that are well equipped to accurately predict the potency of activity cliffs have increased potential for prospective applications. Our work aims to fill the current knowledge gap on best-practice machine learning methods in the presence of activity cliffs. We benchmarked a total of 24 machine and deep learning approaches on curated bioactivity data from 30 macromolecular targets for their performance on activity cliff compounds. While all methods struggled in the presence of activity cliffs, machine learning approaches based on molecular descriptors outperformed more complex deep learning methods. Our findings highlight large case-by-case differences in performance, advocating for (a) the inclusion of dedicated "activity-cliff-centered" metrics during model development and evaluation and (b) the development of novel algorithms to better predict the properties of activity cliffs. To this end, the methods, metrics, and results of this study have been encapsulated into an open-access benchmarking platform named MoleculeACE (Activity Cliff Estimation, available on GitHub at: https://github.com/molML/MoleculeACE). MoleculeACE is designed to steer the community toward addressing the pressing but overlooked limitation of molecular machine learning models posed by activity cliffs.
Collapse
Affiliation(s)
- Derek van Tilborg
- Institute
for Complex Molecular Systems and Dept. Biomedical Engineering, Eindhoven University of Technology, 5612AZEindhoven, The Netherlands
- Centre
for Living Technologies, Alliance TU/e,
WUR, UU, UMC Utrecht, 3584CBUtrecht, The Netherlands
| | | | - Francesca Grisoni
- Institute
for Complex Molecular Systems and Dept. Biomedical Engineering, Eindhoven University of Technology, 5612AZEindhoven, The Netherlands
- Centre
for Living Technologies, Alliance TU/e,
WUR, UU, UMC Utrecht, 3584CBUtrecht, The Netherlands
| |
Collapse
|
39
|
Morris CJ, Stern JA, Stark B, Christopherson M, Della Corte D. MILCDock: Machine Learning Enhanced Consensus Docking for Virtual Screening in Drug Discovery. J Chem Inf Model 2022; 62:5342-5350. [PMID: 36342217 DOI: 10.1021/acs.jcim.2c00705] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Molecular docking tools are regularly used to computationally identify new molecules in virtual screening for drug discovery. However, docking tools suffer from inaccurate scoring functions with widely varying performance on different proteins. To enable more accurate ranking of active over inactive ligands in virtual screening, we created a machine learning consensus docking tool, MILCDock, that uses predictions from five traditional molecular docking tools to predict the probability a ligand binds to a protein. MILCDock was trained and tested on data from both the DUD-E and LIT-PCBA docking datasets and shows improved performance over traditional molecular docking tools and other consensus docking methods on the DUD-E dataset. LIT-PCBA targets proved to be difficult for all methods tested. We also find that DUD-E data, although biased, can be effective in training machine learning tools if care is taken to avoid DUD-E's biases during training.
Collapse
Affiliation(s)
- Connor J Morris
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Jacob A Stern
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States.,Department of Computer Science, Brigham Young University, Provo, Utah84602, United States
| | - Brenden Stark
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Max Christopherson
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Dennis Della Corte
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| |
Collapse
|
40
|
Li Y, Zhou D, Zheng G, Li X, Wu D, Yuan Y. DyScore: A Boosting Scoring Method with Dynamic Properties for Identifying True Binders and Nonbinders in Structure-Based Drug Discovery. J Chem Inf Model 2022; 62:5550-5567. [PMID: 36327102 PMCID: PMC9983328 DOI: 10.1021/acs.jcim.2c00926] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The accurate prediction of protein-ligand binding affinity is critical for the success of computer-aided drug discovery. However, the accuracy of current scoring functions is usually unsatisfactory due to their rough approximation or sometimes even omittance of many factors involved in protein-ligand binding. For instance, the intrinsic dynamics of the protein-ligand binding state is usually disregarded in scoring function because these rapid binding affinity prediction approaches are only based on a representative complex structure of the protein and ligand in the binding state. That is, the dynamic protein-ligand binding complex ensembles are simplified as a static snapshot in calculation. In this study, two novel features were proposed for characterizing the dynamic properties of protein-ligand binding based on the static structure of the complex, which is expected to be a valuable complement to the current scoring functions. The two features demonstrate the geometry-shape matching between a protein and a ligand as well as the dynamic stability of protein-ligand binding. We further combined these two novel features with several classical scoring functions to develop a binary classification model called DyScore that uses the Extreme Gradient Boosting algorithm to classify compound poses as binders or non-binders. We have found that DyScore achieves state-of-the-art performance in distinguishing active and decoy ligands on both enhanced DUD data set and external test sets with both proposed novel features showing significant contributions to the improved performance. Especially, DyScore exhibits superior performance on early recognition, a crucial requirement for success in virtual screening and de novo drug design. The standalone version of DyScore and Dyscore-MF are freely available to all at: https://github.com/YanjunLi-CS/dyscore.
Collapse
Affiliation(s)
- Yanjun Li
- NSF Center for Big Learning, University of Florida, Gainesville, Florida 32611, United States; Baidu Research USA, Sunnyvale, California 94089, United States
| | - Daohong Zhou
- Department of Pharmacodynamics, Univerity of Florida, Gainesville, Florida 32611, United States
| | - Guangrong Zheng
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32611, United States
| | - Xiaolin Li
- Cognization Lab, Palo Alto, California 94306, United States
| | - Dapeng Wu
- NSF Center for Big Learning, University of Florida, Gainesville, Florida 32611, United States
| | - Yaxia Yuan
- Department of Pharmacodynamics, Univerity of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|
41
|
Zhu H, Yang J, Huang N. Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening. J Chem Inf Model 2022; 62:5485-5502. [PMID: 36268980 DOI: 10.1021/acs.jcim.2c01149] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.
Collapse
Affiliation(s)
- Hui Zhu
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Niu Huang
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| |
Collapse
|
42
|
Kontoyianni M. Library size in virtual screening: is it truly a number's game? Expert Opin Drug Discov 2022; 17:1177-1179. [PMID: 36196482 DOI: 10.1080/17460441.2022.2130244] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Affiliation(s)
- Maria Kontoyianni
- Department of Pharmaceutical Sciences, Southern Illinois University Edwardsville, Edwardsville, IL, USA
| |
Collapse
|
43
|
Shimizu H, Kodama M, Matsumoto M, Orba Y, Sasaki M, Sato A, Sawa H, Nakayama KI. LIGHTHOUSE illuminates therapeutics for a variety of diseases including COVID-19. iScience 2022; 25:105314. [PMID: 36246574 PMCID: PMC9549714 DOI: 10.1016/j.isci.2022.105314] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2022] [Revised: 08/08/2022] [Accepted: 10/05/2022] [Indexed: 11/26/2022] Open
Abstract
One of the bottlenecks in the application of basic research findings to patients is the enormous cost, time, and effort required for high-throughput screening of potential drugs for given therapeutic targets. Here we have developed LIGHTHOUSE, a graph-based deep learning approach for discovery of the hidden principles underlying the association of small-molecule compounds with target proteins. Without any 3D structural information for proteins or chemicals, LIGHTHOUSE estimates protein-compound scores that incorporate known evolutionary relations and available experimental data. It identified therapeutics for cancer, lifestyle related disease, and bacterial infection. Moreover, LIGHTHOUSE predicted ethoxzolamide as a therapeutic for coronavirus disease 2019 (COVID-19), and this agent was indeed effective against alpha, beta, gamma, and delta variants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that are rampant worldwide. We envision that LIGHTHOUSE will help accelerate drug discovery and fill the gap between bench side and bedside. LIGHTHOUSE discovers therapeutics solely on the basis of the primary sequence The predictions of LIGHTHOUSE against multiple diseases were experimentally correct LIGHTHOUSE facilitates optimization of lead compounds as well
Collapse
Affiliation(s)
- Hideyuki Shimizu
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan,Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA,Wyss Institute for Biologically Inspired Engineering, Harvard Medical School, Boston, MA 02115, USA,Department of AI Systems Medicine, M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 113-8510, Japan,Corresponding author
| | - Manabu Kodama
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan
| | - Masaki Matsumoto
- Department of Omics and Systems Biology, Niigata University Graduate School of Medical and Dental Sciences, Niigata 951-8510, Japan
| | - Yasuko Orba
- Division of Molecular Pathobiology, International Institute for Zoonosis Control, Hokkaido University, Sapporo 060-8638, Japan
| | - Michihito Sasaki
- Division of Molecular Pathobiology, International Institute for Zoonosis Control, Hokkaido University, Sapporo 060-8638, Japan
| | - Akihiko Sato
- Division of Molecular Pathobiology, International Institute for Zoonosis Control, Hokkaido University, Sapporo 060-8638, Japan,Drug Discovery and Disease Research Laboratory, Shionogi & Co. Ltd., Osaka 561-0825, Japan
| | - Hirofumi Sawa
- Division of Molecular Pathobiology, International Institute for Zoonosis Control, Hokkaido University, Sapporo 060-8638, Japan,International Collaboration Unit, International Institute for Zoonosis Control, Hokkaido University, Sapporo 060-8638, Japan,One Health Research Center, Hokkaido University, Sapporo 060-8638, Japan,Global Virus Network, Baltimore, MD 21201, USA,Hokkaido University, Institute for Vaccine Research and Development (HU-IVReD)
| | - Keiichi I. Nakayama
- Department of Molecular and Cellular Biology, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582, Japan,Corresponding author
| |
Collapse
|
44
|
Krasoulis A, Antonopoulos N, Pitsikalis V, Theodorakis S. DENVIS: Scalable and High-Throughput Virtual Screening Using Graph Neural Networks with Atomic and Surface Protein Pocket Features. J Chem Inf Model 2022; 62:4642-4659. [PMID: 36154119 DOI: 10.1021/acs.jcim.2c01057] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Computational methods for virtual screening can dramatically accelerate early-stage drug discovery by identifying potential hits for a specified target. Docking algorithms traditionally use physics-based simulations to address this challenge by estimating the binding orientation of a query protein-ligand pair and a corresponding binding affinity score. Over the recent years, classical and modern machine learning architectures have shown potential for outperforming traditional docking algorithms. Nevertheless, most learning-based algorithms still rely on the availability of the protein-ligand complex binding pose, typically estimated via docking simulations, which leads to a severe slowdown of the overall virtual screening process. A family of algorithms processing target information at the amino acid sequence level avoid this requirement, however, at the cost of processing protein data at a higher representation level. We introduce deep neural virtual screening (DENVIS), an end-to-end pipeline for virtual screening using graph neural networks (GNNs). By performing experiments on two benchmark databases, we show that our method performs competitively to several docking-based, machine learning-based, and hybrid docking/machine learning-based algorithms. By avoiding the intermediate docking step, DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models. When compared to an amino acid sequence-based machine learning model with comparable screening times, DENVIS achieves dramatically better performance. Some key elements of our approach include protein pocket modeling using a combination of atomic and surface features, the use of model ensembles, and data augmentation via artificial negative sampling during model training. In summary, DENVIS achieves competitive to state-of-the-art virtual screening performance, while offering the potential to scale to billions of molecules using minimal computational resources.
Collapse
|
45
|
Yaseen A, Amin I, Akhter N, Ben-Hur A, Minhas F. Insights into performance evaluation of compound-protein interaction prediction methods. Bioinformatics 2022; 38:ii75-ii81. [PMID: 36124806 DOI: 10.1093/bioinformatics/btac496] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Machine-learning-based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing. Despite numerous recent publication with increasing methodological sophistication claiming consistent improvements in predictive accuracy, we have observed a number of fundamental issues in experiment design that produce overoptimistic estimates of model performance. RESULTS We systematically analyze the impact of several factors affecting generalization performance of CPI predictors that are overlooked in existing work: (i) similarity between training and test examples in cross-validation; (ii) synthesizing negative examples in absence of experimentally verified negative examples and (iii) alignment of evaluation protocol and performance metrics with real-world use of CPI predictors in screening large compound libraries. Using both state-of-the-art approaches by other researchers as well as a simple kernel-based baseline, we have found that effective assessment of generalization performance of CPI predictors requires careful control over similarity between training and test examples. We show that, under stringent performance assessment protocols, a simple kernel-based approach can exceed the predictive performance of existing state-of-the-art methods. We also show that random pairing for generating synthetic negative examples for training and performance evaluation results in models with better generalization in comparison to more sophisticated strategies used in existing studies. Our analyses indicate that using proposed experiment design strategies can offer significant improvements for CPI prediction leading to effective target compound screening for drug repurposing and discovery of putative chemical ligands of SARS-CoV-2-Spike and Human-ACE2 proteins. AVAILABILITY AND IMPLEMENTATION Code and supplementary material available at https://github.com/adibayaseen/HKRCPI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Adiba Yaseen
- Department of Computer and Information Sciences (DCIS), Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad 45650, Pakistan
| | - Imran Amin
- National Institute for Biotechnology and Genetic Engineering, Faisalabad 38000, Pakistan
| | - Naeem Akhter
- Department of Computer and Information Sciences (DCIS), Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad 45650, Pakistan
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Fayyaz Minhas
- Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
46
|
Guterres H, Park S, Zhang H, Perone T, Kim J, Im W. CHARMM‐GUI
high‐throughput simulator
for efficient evaluation of protein–ligand interactions with different force fields. Protein Sci 2022. [DOI: 10.1002/pro.4413] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Hugo Guterres
- Departments of Biological Sciences, Chemistry, Bioengineering, and Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Sang‐Jun Park
- Departments of Biological Sciences, Chemistry, Bioengineering, and Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Han Zhang
- Departments of Biological Sciences, Chemistry, Bioengineering, and Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Thomas Perone
- Departments of Biological Sciences, Chemistry, Bioengineering, and Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| | - Jongtaek Kim
- Department of Physics and Chemistry Korea Air Force Academy Cheongju South Korea
| | - Wonpil Im
- Departments of Biological Sciences, Chemistry, Bioengineering, and Computer Science and Engineering Lehigh University Bethlehem Pennsylvania USA
| |
Collapse
|
47
|
Wong F, Krishnan A, Zheng EJ, Stärk H, Manson AL, Earl AM, Jaakkola T, Collins JJ. Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Mol Syst Biol 2022; 18:e11081. [PMID: 36065847 PMCID: PMC9446081 DOI: 10.15252/msb.202211081] [Citation(s) in RCA: 78] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 06/12/2022] [Accepted: 07/26/2022] [Indexed: 11/25/2022] Open
Abstract
Efficient identification of drug mechanisms of action remains a challenge. Computational docking approaches have been widely used to predict drug binding targets; yet, such approaches depend on existing protein structures, and accurate structural predictions have only recently become available from AlphaFold2. Here, we combine AlphaFold2 with molecular docking simulations to predict protein-ligand interactions between 296 proteins spanning Escherichia coli's essential proteome, and 218 active antibacterial compounds and 100 inactive compounds, respectively, pointing to widespread compound and protein promiscuity. We benchmark model performance by measuring enzymatic activity for 12 essential proteins treated with each antibacterial compound. We confirm extensive promiscuity, but find that the average area under the receiver operating characteristic curve (auROC) is 0.48, indicating weak model performance. We demonstrate that rescoring of docking poses using machine learning-based approaches improves model performance, resulting in average auROCs as large as 0.63, and that ensembles of rescoring functions improve prediction accuracy and the ratio of true-positive rate to false-positive rate. This work indicates that advances in modeling protein-ligand interactions, particularly using machine learning-based approaches, are needed to better harness AlphaFold2 for drug discovery.
Collapse
Affiliation(s)
- Felix Wong
- Institute for Medical Engineering & ScienceMassachusetts Institute of TechnologyCambridgeMAUSA
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMAUSA
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
| | - Aarti Krishnan
- Institute for Medical Engineering & ScienceMassachusetts Institute of TechnologyCambridgeMAUSA
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMAUSA
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
| | - Erica J Zheng
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
- Program in Chemical BiologyHarvard UniversityCambridgeMAUSA
| | - Hannes Stärk
- Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeMAUSA
| | - Abigail L Manson
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
| | - Ashlee M Earl
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of TechnologyCambridgeMAUSA
| | - James J Collins
- Institute for Medical Engineering & ScienceMassachusetts Institute of TechnologyCambridgeMAUSA
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMAUSA
- Infectious Disease and Microbiome ProgramBroad Institute of MIT and HarvardCambridgeMAUSA
- Wyss Institute for Biologically Inspired EngineeringHarvard UniversityBostonMAUSA
| |
Collapse
|
48
|
García-Ortegón M, Simm GNC, Tripp AJ, Hernández-Lobato JM, Bender A, Bacallado S. DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. J Chem Inf Model 2022; 62:3486-3502. [PMID: 35849793 PMCID: PMC9364321 DOI: 10.1021/acs.jcim.1c01334] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Indexed: 01/05/2023]
Abstract
The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate compound's interaction with the target. By contrast, molecular docking is a widely applied method in drug discovery to estimate binding affinities. However, docking studies require a significant amount of domain knowledge to set up correctly, which hampers adoption. Here, we present dockstring, a bundle for meaningful and robust comparison of ML models using docking scores. dockstring consists of three components: (1) an open-source Python package for straightforward computation of docking scores, (2) an extensive dataset of docking scores and poses of more than 260,000 molecules for 58 medically relevant targets, and (3) a set of pharmaceutically relevant benchmark tasks such as virtual screening or de novo design of selective kinase inhibitors. The Python package implements a robust ligand and target preparation protocol that allows nonexperts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more realistic evaluation objective than simple physicochemical properties, yielding benchmark tasks that are more challenging and more closely related to real problems in drug discovery.
Collapse
Affiliation(s)
- Miguel García-Ortegón
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| | - Gregor N. C. Simm
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | - Austin J. Tripp
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | | | - Andreas Bender
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Rd., Cambridge CB2 1EW, United Kingdom
| | - Sergio Bacallado
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| |
Collapse
|
49
|
Shen C, Zhang X, Deng Y, Gao J, Wang D, Xu L, Pan P, Hou T, Kang Y. Boosting Protein-Ligand Binding Pose Prediction and Virtual Screening Based on Residue-Atom Distance Likelihood Potential and Graph Transformer. J Med Chem 2022; 65:10691-10706. [PMID: 35917397 DOI: 10.1021/acs.jmedchem.2c00991] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
The past few years have witnessed enormous progress toward applying machine learning approaches to the development of protein-ligand scoring functions. However, the robust performance and wide applicability of scoring functions remain a big challenge for increasing the success rate of docking-based virtual screening. Herein, a novel scoring function named RTMScore was developed by introducing a tailored residue-based graph representation strategy and several graph transformer layers for the learning of protein and ligand representations, followed by a mixture density network to obtain residue-atom distance likelihood potential. Our approach was resolutely validated on the CASF-2016 benchmark, and the results indicate that RTMScore can outperform almost all of the other state-of-the-art methods in terms of both the docking and screening powers. Further evaluation confirms the robustness of our approach that can not only retain its docking power on cross-docked poses but also achieve improved performance as a rescoring tool in larger-scale virtual screening.
Collapse
Affiliation(s)
- Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou, Zhejiang 310018, China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, Zhejiang 310018, China
| | - Junbo Gao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| |
Collapse
|
50
|
McGibbon M, Money-Kyrle S, Blay V, Houston DR. SCORCH: Improving structure-based virtual screening with machine learning classifiers, data augmentation, and uncertainty estimation. J Adv Res 2022; 46:135-147. [PMID: 35901959 PMCID: PMC10105235 DOI: 10.1016/j.jare.2022.07.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2022] [Revised: 07/08/2022] [Accepted: 07/09/2022] [Indexed: 11/17/2022] Open
Abstract
INTRODUCTION The discovery of a new drug is a costly and lengthy endeavour. The computational prediction of which small molecules can bind to a protein target can accelerate this process if the predictions are fast and accurate enough. Recent machine-learning scoring functions re-evaluate the output of molecular docking to achieve more accurate predictions. However, previous scoring functions were trained on crystalised protein-ligand complexes and datasets of decoys. The limited availability of crystal structures and biases in the decoy datasets can lower the performance of scoring functions. OBJECTIVES To address key limitations of previous scoring functions and thus improve the predictive performance of structure-based virtual screening. METHODS A novel machine-learning scoring function was created, named SCORCH (Scoring COnsensus for RMSD-based Classification of Hits). To develop SCORCH, training data is augmented by considering multiple ligand poses and labelling poses based on their RMSD from the native pose. Decoy bias is addressed by generating property-matched decoys for each ligand and using the same methodology for preparing and docking decoys and ligands. A consensus of 3 different machine learning approaches is also used to improve performance. RESULTS We find that multi-pose augmentation in SCORCH improves its docking power and screening power on independent benchmark datasets. SCORCH outperforms an equivalent scoring function trained on single poses, with a 1% enrichment factor (EF) of 13.78 vs. 10.86 on 18 DEKOIS 2.0 targets and a mean native pose rank of 5.9 vs 30.4 on CSAR 2014. Additionally, SCORCH outperforms widely used scoring functions in virtual screening and pose prediction on independent benchmark datasets. CONCLUSION By rationally addressing key limitations of previous scoring functions, SCORCH improves the performance of virtual screening. SCORCH also provides an estimate of its uncertainty, which can help reduce the cost and time required for drug discovery.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK
| | - Sam Money-Kyrle
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK
| | - Vincent Blay
- Department of Microbiology and Environmental Toxicology, University of California at Santa Cruz, Santa Cruz, CA 95064, USA; Institute for Integrative Systems Biology (I(2)SysBio), Universitat de València and Spanish Research Council (CSIC), 46980 Valencia, Spain.
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, UK.
| |
Collapse
|