1
|
Kashafutdinova IM, Poyezzhayeva A, Gimadiev T, Madzhidov T. Active learning approaches in molecule pKi prediction. Mol Inform 2024:e202400154. [PMID: 39105614 DOI: 10.1002/minf.202400154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 06/21/2024] [Accepted: 06/23/2024] [Indexed: 08/07/2024]
Abstract
During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.
Collapse
Affiliation(s)
- I M Kashafutdinova
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - A Poyezzhayeva
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Gimadiev
- A.M. Butlerov Institute of Chemistry, Kazan Federal University, Kazan, 420008, Russia
| | - T Madzhidov
- Chemistry Solutions, Elsevier, London, EC2Y 5AS, UK
| |
Collapse
|
2
|
Wang L, Zhou Z, Yang X, Shi S, Zeng X, Cao D. The present state and challenges of active learning in drug discovery. Drug Discov Today 2024; 29:103985. [PMID: 38642700 DOI: 10.1016/j.drudis.2024.103985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 04/08/2024] [Accepted: 04/15/2024] [Indexed: 04/22/2024]
Abstract
Active learning (AL) is an iterative feedback process that efficiently identifies valuable data within vast chemical space, even with limited labeled data. This characteristic renders it a valuable approach to tackle the ongoing challenges faced in drug discovery, such as the ever-expanding explore space and the limitations of labeled data. Consequently, AL is increasingly gaining prominence in the field of drug development. In this paper, we comprehensively review the application of AL at all stages of drug discovery, including compounds-target interaction prediction, virtual screening, molecular generation and optimization, as well as molecular properties prediction. Additionally, we discuss the challenges and prospects associated with the current applications of AL in drug discovery.
Collapse
Affiliation(s)
- Lei Wang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China
| | - Zhenran Zhou
- Department of Computer Science, Hunan University, Changsha 410082, Hunan, China
| | - Xixi Yang
- Department of Computer Science, Hunan University, Changsha 410082, Hunan, China
| | - Shaohua Shi
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
| | - Xiangxiang Zeng
- Department of Computer Science, Hunan University, Changsha 410082, Hunan, China.
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China.
| |
Collapse
|
3
|
Gorantla R, Kubincová A, Suutari B, Cossins BP, Mey ASJS. Benchmarking Active Learning Protocols for Ligand-Binding Affinity Prediction. J Chem Inf Model 2024; 64:1955-1965. [PMID: 38446131 PMCID: PMC10966646 DOI: 10.1021/acs.jcim.4c00220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 02/23/2024] [Indexed: 03/07/2024]
Abstract
Active learning (AL) has become a powerful tool in computational drug discovery, enabling the identification of top binders from vast molecular libraries. To design a robust AL protocol, it is important to understand the influence of AL parameters, as well as the features of the data sets on the outcomes. We use four affinity data sets for different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate the performance of machine learning models [Gaussian process (GP) model and Chemprop model], sample selection protocols, and the batch size based on metrics describing the overall predictive power of the model (R2, Spearman rank, root-mean-square error) as well as the accurate identification of top 2%/5% binders (Recall, F1 score). Both models have a comparable Recall of top binders on large data sets, but the GP model surpasses the Chemprop model when training data are sparse. A larger initial batch size, especially on diverse data sets, increased the Recall of both models as well as overall correlation metrics. However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds proved to be desirable. Furthermore, adding artificial Gaussian noise to the data up to a certain threshold still allowed the model to identify clusters with top-scoring compounds. However, excessive noise (<1σ) did impact the model's predictive and exploitative capabilities.
Collapse
Affiliation(s)
- Rohan Gorantla
- School
of Informatics, University of Edinburgh, Edinburgh EH8 9AB, U.K.
- EaStCHEM
School of Chemistry, University of Edinburgh, Edinburgh EH9 3FJ, U.K.
- Exscientia, Schrödinger Building, Oxford OX4 4GE, U.K.
| | | | | | | | | |
Collapse
|
4
|
Joshi RP, Kumar N. Artificial Intelligence for Autonomous Molecular Design: A Perspective. Molecules 2021; 26:6761. [PMID: 34833853 PMCID: PMC8619999 DOI: 10.3390/molecules26226761] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 10/23/2021] [Accepted: 10/29/2021] [Indexed: 11/23/2022] Open
Abstract
Domain-aware artificial intelligence has been increasingly adopted in recent years to expedite molecular design in various applications, including drug design and discovery. Recent advances in areas such as physics-informed machine learning and reasoning, software engineering, high-end hardware development, and computing infrastructures are providing opportunities to build scalable and explainable AI molecular discovery systems. This could improve a design hypothesis through feedback analysis, data integration that can provide a basis for the introduction of end-to-end automation for compound discovery and optimization, and enable more intelligent searches of chemical space. Several state-of-the-art ML architectures are predominantly and independently used for predicting the properties of small molecules, their high throughput synthesis, and screening, iteratively identifying and optimizing lead therapeutic candidates. However, such deep learning and ML approaches also raise considerable conceptual, technical, scalability, and end-to-end error quantification challenges, as well as skepticism about the current AI hype to build automated tools. To this end, synergistically and intelligently using these individual components along with robust quantum physics-based molecular representation and data generation tools in a closed-loop holds enormous promise for accelerated therapeutic design to critically analyze the opportunities and challenges for their more widespread application. This article aims to identify the most recent technology and breakthrough achieved by each of the components and discusses how such autonomous AI and ML workflows can be integrated to radically accelerate the protein target or disease model-based probe design that can be iteratively validated experimentally. Taken together, this could significantly reduce the timeline for end-to-end therapeutic discovery and optimization upon the arrival of any novel zoonotic transmission event. Our article serves as a guide for medicinal, computational chemistry and biology, analytical chemistry, and the ML community to practice autonomous molecular design in precision medicine and drug discovery.
Collapse
Affiliation(s)
| | - Neeraj Kumar
- Computational Biology Group, Biological Science Division, Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA 99352, USA;
| |
Collapse
|
5
|
Guest EE, Pickett SD, Hirst JD. Structural variation of protein-ligand complexes of the first bromodomain of BRD4. Org Biomol Chem 2021; 19:5632-5641. [PMID: 34105560 DOI: 10.1039/d1ob00658d] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
The bromodomain-containing protein 4 (BRD4), a member of the bromodomain and extra-terminal domain (BET) family, plays a key role in several diseases, especially cancers. With increased interest in BRD4 as a therapeutic target, many X-ray crystal structures of the protein in complex with small molecule inhibitors are publicly available over the recent decade. In this study, we use this structural information to investigate the conformations of the first bromodomain (BD1) of BRD4. Structural alignment of 297 BRD4-BD1 complexes shows a high level of similarity between the structures of BRD4-BD1, regardless of the bound ligand. We employ WONKA, a tool for detailed analyses of protein binding sites, to compare the active site of over 100 of these crystal structures. The positions of key binding site residues show a high level of conformational similarity, with the exception of Trp81. A focused analysis on the highly conserved water network in the binding site of BRD4-BD1 is performed to identify the positions of these water molecules across the crystal structures. The importance of the water network is illustrated using molecular docking and absolute free energy perturbation simulations. 82% of the ligand poses were better predicted when including water molecules as part of the receptor. Our analysis provides guidance for the design of new BRD4-BD1 inhibitors and the selection of the best structure of BRD4-BD1 to use in structure-based drug design, an important approach for faster and more cost-efficient lead discovery.
Collapse
Affiliation(s)
- Ellen E Guest
- School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, UK.
| | - Stephen D Pickett
- GlaxoSmithKline R&D Pharmaceuticals, Computational Chemistry, Stevenage, UK
| | - Jonathan D Hirst
- School of Chemistry, University of Nottingham, University Park, Nottingham, NG7 2RD, UK.
| |
Collapse
|
6
|
Reker D. Practical considerations for active machine learning in drug discovery. DRUG DISCOVERY TODAY. TECHNOLOGIES 2020; 32-33:73-79. [PMID: 33386097 DOI: 10.1016/j.ddtec.2020.06.001] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 06/01/2020] [Accepted: 06/10/2020] [Indexed: 02/01/2023]
Abstract
Active machine learning enables the automated selection of the most valuable next experiments to improve predictive modelling and hasten active retrieval in drug discovery. Although a long established theoretical concept and introduced to drug discovery approximately 15 years ago, the deployment of active learning technology in the discovery pipelines across academia and industry remains slow. With the recent re-discovered enthusiasm for artificial intelligence as well as improved flexibility of laboratory automation, active learning is expected to surge and become a key technology for molecular optimizations. This review recapitulates key findings from previous active learning studies to highlight the challenges and opportunities of applying adaptive machine learning to drug discovery. Specifically, considerations regarding implementation, infrastructural integration, and expected benefits are discussed. By focusing on these practical aspects of active learning, this review aims at providing insights for scientists planning to implement active learning workflows in their discovery pipelines.
Collapse
Affiliation(s)
- Daniel Reker
- Koch Institute for Integrative Cancer Research and MIT-IBM Watson AI Lab, Massachusetts Institute of Technology, Cambridge, MA, USA; Division of Gastroenterology, Hepatology and Endoscopy, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
7
|
Kumar SP, Patel CN, Rawal RM, Pandya HA. Energetic contributions of amino acid residues and its cross‐talk to delineate ligand‐binding mechanism. Proteins 2020; 88:1207-1225. [DOI: 10.1002/prot.25894] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 02/20/2020] [Accepted: 04/03/2020] [Indexed: 02/02/2023]
Affiliation(s)
| | - Chirag N. Patel
- Department of Botany, Bioinformatics, and Climate Change Impacts ManagementUniversity School of Sciences, Gujarat University Ahmedabad India
| | - Rakesh M. Rawal
- Department of Life SciencesUniversity School of Sciences, Gujarat University Ahmedabad India
| | - Himanshu A. Pandya
- Department of Life SciencesUniversity School of Sciences, Gujarat University Ahmedabad India
- Department of Botany, Bioinformatics, and Climate Change Impacts ManagementUniversity School of Sciences, Gujarat University Ahmedabad India
| |
Collapse
|