1
|
Loeffler HH, Wan S, Klähn M, Bhati AP, Coveney PV. Optimal Molecular Design: Generative Active Learning Combining REINVENT with Precise Binding Free Energy Ranking Simulations. J Chem Theory Comput 2024; 20. [PMID: 39225482 PMCID: PMC11428133 DOI: 10.1021/acs.jctc.4c00576] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 08/08/2024] [Accepted: 08/08/2024] [Indexed: 09/04/2024]
Abstract
Active learning (AL) is a specific instance of sequential experimental design and uses machine learning to intelligently choose the next data point or batch of molecular structures to be evaluated. In this sense, it closely mimics the iterative design-make-test-analysis cycle of laboratory experiments to find optimized compounds for a given design task. Here, we describe an AL protocol which combines generative molecular AI, using REINVENT, and physics-based absolute binding free energy molecular dynamics simulation, using ESMACS, to discover new ligands for two different target proteins, 3CLpro and TNKS2. We have deployed our generative active learning (GAL) protocol on Frontier, the world's only exa-scale machine. We show that the protocol can find higher-scoring molecules compared to the baseline, a surrogate ML docking model for 3CLpro and compounds with experimentally determined binding affinities for TNKS2. The ligands found are also chemically diverse and occupy a different chemical space than the baseline. We vary the batch sizes that are put forward for free energy assessment in each GAL cycle to assess the impact on their efficiency on the GAL protocol and recommend their optimal values in different scenarios. Overall, we demonstrate a powerful capability of the combination of physics-based and AI methods which yields effective chemical space sampling at an unprecedented scale and is of immediate and direct relevance to modern, data-driven drug discovery.
Collapse
Affiliation(s)
- Hannes H. Loeffler
- Molecular
AI, Discovery Sciences, R&D, AstraZeneca, Mölndal 431 83, Sweden
| | - Shunzhou Wan
- Centre
for Computational Science, Department of Chemistry, University College London, London WC1H 0AJ, U.K.
| | - Marco Klähn
- Molecular
AI, Discovery Sciences, R&D, AstraZeneca, Mölndal 431 83, Sweden
| | - Agastya P. Bhati
- Centre
for Computational Science, Department of Chemistry, University College London, London WC1H 0AJ, U.K.
| | - Peter V. Coveney
- Centre
for Computational Science, Department of Chemistry, University College London, London WC1H 0AJ, U.K.
- Advanced
Research Computing Centre, University College
London, London WC1H 0AJ, U.K.
- Institute
for Informatics, Faculty of Science, University
of Amsterdam, Amsterdam 1098XH, The Netherlands
| |
Collapse
|
2
|
Lavecchia A. Navigating the frontier of drug-like chemical space with cutting-edge generative AI models. Drug Discov Today 2024; 29:104133. [PMID: 39103144 DOI: 10.1016/j.drudis.2024.104133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 07/20/2024] [Accepted: 07/31/2024] [Indexed: 08/07/2024]
Abstract
Deep generative models (GMs) have transformed the exploration of drug-like chemical space (CS) by generating novel molecules through complex, nontransparent processes, bypassing direct structural similarity. This review examines five key architectures for CS exploration: recurrent neural networks (RNNs), variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows (NF), and Transformers. It discusses molecular representation choices, training strategies for focused CS exploration, evaluation criteria for CS coverage, and related challenges. Future directions include refining models, exploring new notations, improving benchmarks, and enhancing interpretability to better understand biologically relevant molecular properties.
Collapse
Affiliation(s)
- Antonio Lavecchia
- 'Drug Discovery' Laboratory, Department of Pharmacy, University of Naples Federico II, I-80131 Naples, Italy.
| |
Collapse
|
3
|
Samanipour S, Barron LP, van Herwerden D, Praetorius A, Thomas KV, O’Brien JW. Exploring the Chemical Space of the Exposome: How Far Have We Gone? JACS AU 2024; 4:2412-2425. [PMID: 39055136 PMCID: PMC11267556 DOI: 10.1021/jacsau.4c00220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 05/29/2024] [Accepted: 05/31/2024] [Indexed: 07/27/2024]
Abstract
Around two-thirds of chronic human disease can not be explained by genetics alone. The Lancet Commission on Pollution and Health estimates that 16% of global premature deaths are linked to pollution. Additionally, it is now thought that humankind has surpassed the safe planetary operating space for introducing human-made chemicals into the Earth System. Direct and indirect exposure to a myriad of chemicals, known and unknown, poses a significant threat to biodiversity and human health, from vaccine efficacy to the rise of antimicrobial resistance as well as autoimmune diseases and mental health disorders. The exposome chemical space remains largely uncharted due to the sheer number of possible chemical structures, estimated at over 1060 unique forms. Conventional methods have cataloged only a fraction of the exposome, overlooking transformation products and often yielding uncertain results. In this Perspective, we have reviewed the latest efforts in mapping the exposome chemical space and its subspaces. We also provide our view on how the integration of data-driven approaches might be able to bridge the identified gaps.
Collapse
Affiliation(s)
- Saer Samanipour
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- UvA
Data Science Center, University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| | - Leon Patrick Barron
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- MRC
Centre for Environment and Health, Environmental Research Group, School
of Public Health, Faculty of Medicine, Imperial
College London, London W12 0BZ, United Kingdom
| | - Denice van Herwerden
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
| | - Antonia Praetorius
- Institute
for Biodiversity and Ecosystem Dynamics (IBED), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
| | - Kevin V. Thomas
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| | - Jake William O’Brien
- Van’t
Hoff Institute for Molecular Sciences (HIMS), University of Amsterdam, Amsterdam 1090 GD, The Netherlands
- Queensland
Alliance for Environmental Health Sciences (QAEHS), The University of Queensland, Cornwall Street, Woolloongabba, Queensland 4102, Australia
| |
Collapse
|
4
|
Ai C, Yang H, Liu X, Dong R, Ding Y, Guo F. MTMol-GPT: De novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS Comput Biol 2024; 20:e1012229. [PMID: 38924082 PMCID: PMC11233020 DOI: 10.1371/journal.pcbi.1012229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 07/09/2024] [Accepted: 06/03/2024] [Indexed: 06/28/2024] Open
Abstract
De novo drug design is crucial in advancing drug discovery, which aims to generate new drugs with specific pharmacological properties. Recently, deep generative models have achieved inspiring progress in generating drug-like compounds. However, the models prioritize a single target drug generation for pharmacological intervention, neglecting the complicated inherent mechanisms of diseases, and influenced by multiple factors. Consequently, developing novel multi-target drugs that simultaneously target specific targets can enhance anti-tumor efficacy and address issues related to resistance mechanisms. To address this issue and inspired by Generative Pre-trained Transformers (GPT) models, we propose an upgraded GPT model with generative adversarial imitation learning for multi-target molecular generation called MTMol-GPT. The multi-target molecular generator employs a dual discriminator model using the Inverse Reinforcement Learning (IRL) method for a concurrently multi-target molecular generation. Extensive results show that MTMol-GPT generates various valid, novel, and effective multi-target molecules for various complex diseases, demonstrating robustness and generalization capability. In addition, molecular docking and pharmacophore mapping experiments demonstrate the drug-likeness properties and effectiveness of generated molecules potentially improve neuropsychiatric interventions. Furthermore, our model's generalizability is exemplified by a case study focusing on the multi-targeted drug design for breast cancer. As a broadly applicable solution for multiple targets, MTMol-GPT provides new insight into future directions to enhance potential complex disease therapeutics by generating high-quality multi-target molecules in drug discovery.
Collapse
Affiliation(s)
- Chengwei Ai
- School of computer science and engineering, Central South University, Changsha, China
| | - Hongpeng Yang
- Department of computer science and engineering, University of South Carolina, Columbia, South Carolina, United States of America
| | - Xiaoyi Liu
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing, China
- Ministry of Education, Engineering Research Center for Pharmaceutics of Chinese Materia Medica and New Drug Development, Beijing, China
| | - Ruihan Dong
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Fei Guo
- School of computer science and engineering, Central South University, Changsha, China
| |
Collapse
|
5
|
Gangwal A, Lavecchia A. Unleashing the power of generative AI in drug discovery. Drug Discov Today 2024; 29:103992. [PMID: 38663579 DOI: 10.1016/j.drudis.2024.103992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/22/2024] [Accepted: 04/18/2024] [Indexed: 05/04/2024]
Abstract
Artificial intelligence (AI) is revolutionizing drug discovery by enhancing precision, reducing timelines and costs, and enabling AI-driven computer-aided drug design. This review focuses on recent advancements in deep generative models (DGMs) for de novo drug design, exploring diverse algorithms and their profound impact. It critically analyses the challenges that are intricately interwoven into these technologies, proposing strategies to unlock their full potential. It features case studies of both successes and failures in advancing drugs to clinical trials with AI assistance. Last, it outlines a forward-looking plan for optimizing DGMs in de novo drug design, thereby fostering faster and more cost-effective drug development.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule 424001, Maharashtra, India
| | - Antonio Lavecchia
- "Drug Discovery" Laboratory, Department of Pharmacy, University of Naples Federico II, I-80131 Naples, Italy.
| |
Collapse
|
6
|
Luginina AP, Khnykin AN, Khorn PA, Moiseeva OV, Safronova NA, Pospelov VA, Dashevskii DE, Belousov AS, Borschevskiy VI, Mishin AV. Rational Design of Drugs Targeting G-Protein-Coupled Receptors: Ligand Search and Screening. BIOCHEMISTRY. BIOKHIMIIA 2024; 89:958-972. [PMID: 38880655 DOI: 10.1134/s0006297924050158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 02/22/2024] [Accepted: 02/23/2024] [Indexed: 06/18/2024]
Abstract
G protein-coupled receptors (GPCRs) are transmembrane proteins that participate in many physiological processes and represent major pharmacological targets. Recent advances in structural biology of GPCRs have enabled the development of drugs based on the receptor structure (structure-based drug design, SBDD). SBDD utilizes information about the receptor-ligand complex to search for suitable compounds, thus expanding the chemical space of possible receptor ligands without the need for experimental screening. The review describes the use of structure-based virtual screening (SBVS) for GPCR ligands and approaches for the functional testing of potential drug compounds, as well as discusses recent advances and successful examples in the application of SBDD for the identification of GPCR ligands.
Collapse
Affiliation(s)
- Aleksandra P Luginina
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Andrey N Khnykin
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Polina A Khorn
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Olga V Moiseeva
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
- Skryabin Institute of Biochemistry and Physiology of Microorganisms, Russian Academy of Sciences, Pushchino, Moscow Region, 142290, Russia
| | - Nadezhda A Safronova
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Vladimir A Pospelov
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Dmitrii E Dashevskii
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Anatolii S Belousov
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia
| | - Valentin I Borschevskiy
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia.
- Frank Laboratory of Neutron Physics, Joint Institute for Nuclear Research, Dubna, Moscow Region, 141980, Russia
| | - Alexey V Mishin
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, 141701, Russia.
| |
Collapse
|
7
|
Vogt M. Chemoinformatic approaches for navigating large chemical spaces. Expert Opin Drug Discov 2024; 19:403-414. [PMID: 38300511 DOI: 10.1080/17460441.2024.2313475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 01/30/2024] [Indexed: 02/02/2024]
Abstract
INTRODUCTION Large chemical spaces (CSs) include traditional large compound collections, combinatorial libraries covering billions to trillions of molecules, DNA-encoded chemical libraries comprising complete combinatorial CSs in a single mixture, and virtual CSs explored by generative models. The diverse nature of these types of CSs require different chemoinformatic approaches for navigation. AREAS COVERED An overview of different types of large CSs is provided. Molecular representations and similarity metrics suitable for large CS exploration are discussed. A summary of navigation of CSs in generative models is provided. Methods for characterizing and comparing CSs are discussed. EXPERT OPINION The size of large CSs might restrict navigation to specialized algorithms and limit it to considering neighborhoods of structurally similar molecules. Efficient navigation of large CSs not only requires methods that scale with size but also requires smart approaches that focus on better but not necessarily larger molecule selections. Deep generative models aim to provide such approaches by implicitly learning features relevant for targeted biological properties. It is unclear whether these models can fulfill this ideal as validation is difficult as long as the covered CSs remain mainly virtual without experimental verification.
Collapse
Affiliation(s)
- Martin Vogt
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany
| |
Collapse
|
8
|
Humayun F, Khan F, Khan A, Alshammari A, Ji J, Farhan A, Fawad N, Alam W, Ali A, Wei DQ. De novo generation of dual-target ligands for the treatment of SARS-CoV-2 using deep learning, virtual screening, and molecular dynamic simulations. J Biomol Struct Dyn 2024; 42:3019-3029. [PMID: 37449757 DOI: 10.1080/07391102.2023.2234481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 04/30/2023] [Indexed: 07/18/2023]
Abstract
De novo generation of molecules with the necessary features offers a promising opportunity for artificial intelligence, such as deep generative approaches. However, creating novel compounds having biological activities toward two distinct targets continues to be a very challenging task. In this study, we develop a unique computational framework for the de novo synthesis of bioactive compounds directed at two predetermined therapeutic targets. This framework is referred to as the dual-target ligand generative network. Our approach uses a stochastic policy to explore chemical spaces called a sequence-based simple molecular input line entry system (SMILES) generator. The steps in the high-level workflow would be to gather and prepare the training data for both targets' molecules, build a neural network model and train it to make molecules, create new molecules using generative AI, and then virtually screen the newly validated molecules against the SARS-CoV-2 PLpro and 3CLpro drug targets. Results shows that novel molecules generated have higher binding affinity with both targets than the conventional drug i.e. Remdesivir being used for the treatment of SARS-CoV-2.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Fahad Humayun
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
- State Key Laboratory of Microbial Metabolism and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
| | - Fatima Khan
- National Institute of Health, Islamabad, Pakistan
| | - Abbas Khan
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
- State Key Laboratory of Microbial Metabolism and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
| | - Abdulrahman Alshammari
- Department of Pharmacology and Toxicology, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| | - Jun Ji
- Henan Provincial Engineering and Technology Center of Health Products for Livestock and Poultry, Henan Provincial Engineering and Technology Center of Animal Disease Diagnosis and Integrated Control, Nanyang Normal University, Nanyang, PR China
| | - Ali Farhan
- Department of Chemistry, Chung Yuan Christian University, Taoyuan, Taiwan
| | - Nasim Fawad
- Poultry Research Institute, Rawalpindi, Pakistan
| | - Waheed Alam
- National Institute of Health, Islamabad, Pakistan
| | - Arif Ali
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
- State Key Laboratory of Microbial Metabolism and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
| | - Dong-Qing Wei
- Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
- State Key Laboratory of Microbial Metabolism and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, PR China
- Centre for Research in Molecular Modeling, Concordia University, Québec, Canada
| |
Collapse
|
9
|
Helal H, Firoz J, Bilbrey JA, Sprueill H, Herman KM, Krell MM, Murray T, Roldan ML, Kraus M, Li A, Das P, Xantheas SS, Choudhury S. Acceleration of Graph Neural Network-Based Prediction Models in Chemistry via Co-Design Optimization on Intelligence Processing Units. J Chem Inf Model 2024; 64:1568-1580. [PMID: 38382011 DOI: 10.1021/acs.jcim.3c01312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/23/2024]
Abstract
Atomic structure prediction and associated property calculations are the bedrock of chemical physics. Since high-fidelity ab initio modeling techniques for computing the structure and properties can be prohibitively expensive, this motivates the development of machine-learning (ML) models that make these predictions more efficiently. Training graph neural networks over large atomistic databases introduces unique computational challenges, such as the need to process millions of small graphs with variable size and support communication patterns that are distinct from learning over large graphs, such as social networks. We demonstrate a novel hardware-software codesign approach to scale up the training of atomistic graph neural networks (GNN) for structure and property prediction. First, to eliminate redundant computation and memory associated with alternative padding techniques and to improve throughput via minimizing communication, we formulate the effective coalescing of the batches of variable-size atomistic graphs as the bin packing problem and introduce a hardware-agnostic algorithm to pack these batches. In addition, we propose hardware-specific optimizations, including a planner and vectorization for the gather-scatter operations targeted for Graphcore's Intelligence Processing Unit (IPU), as well as model-specific optimizations such as merged communication collectives and optimized softplus. Putting these all together, we demonstrate the effectiveness of the proposed codesign approach by providing an implementation of a well-established atomistic GNN on the Graphcore IPUs. We evaluate the training performance on multiple atomistic graph databases with varying degrees of graph counts, sizes, and sparsity. We demonstrate that such a codesign approach can reduce the training time of atomistic GNNs and can improve their performance by up to 1.5× compared to the baseline implementation of the model on the IPUs. Additionally, we compare our IPU implementation with a Nvidia GPU-based implementation and show that our atomistic GNN implementation on the IPUs can run 1.8× faster on average compared to the execution time on the GPUs.
Collapse
Affiliation(s)
- Hatem Helal
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | - Jesun Firoz
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 1100 Dexter Ave N, Seattle, Washington 98109, United States
| | - Jenna A Bilbrey
- Artificial Intelligence and Data Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Henry Sprueill
- Artificial Intelligence and Data Analytics Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Kristina M Herman
- Department of Chemistry, University of Washington, Seattle, Washington 98185, United States
| | | | - Tom Murray
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | | | - Mike Kraus
- Graphcore, Kett House, Station Rd, Cambridge CB1 2JH, U.K
| | - Ang Li
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Payel Das
- IBM Research, Yorktown Heights, New York 10598, United States
| | - Sotiris S Xantheas
- Department of Chemistry, University of Washington, Seattle, Washington 98185, United States
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| | - Sutanay Choudhury
- Advanced Computing, Mathematics and Data Division, Pacific Northwest National Laboratory, 902 Battelle Boulevard, Richland, Washington 99352, United States
| |
Collapse
|
10
|
Loeffler HH, He J, Tibo A, Janet JP, Voronov A, Mervin LH, Engkvist O. Reinvent 4: Modern AI-driven generative molecule design. J Cheminform 2024; 16:20. [PMID: 38383444 PMCID: PMC10882833 DOI: 10.1186/s13321-024-00812-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 02/09/2024] [Indexed: 02/23/2024] Open
Abstract
REINVENT 4 is a modern open-source generative AI framework for the design of small molecules. The software utilizes recurrent neural networks and transformer architectures to drive molecule generation. These generators are seamlessly embedded within the general machine learning optimization algorithms, transfer learning, reinforcement learning and curriculum learning. REINVENT 4 enables and facilitates de novo design, R-group replacement, library design, linker design, scaffold hopping and molecule optimization. This contribution gives an overview of the software and describes its design. Algorithms and their applications are discussed in detail. REINVENT 4 is a command line tool which reads a user configuration in either TOML or JSON format. The aim of this release is to provide reference implementations for some of the most common algorithms in AI based molecule generation. An additional goal with the release is to create a framework for education and future innovation in AI based molecular design. The software is available from https://github.com/MolecularAI/REINVENT4 and released under the permissive Apache 2.0 license. Scientific contribution. The software provides an open-source reference implementation for generative molecular design where the software is also being used in production to support in-house drug discovery projects. The publication of the most common machine learning algorithms in one code and full documentation thereof will increase transparency of AI and foster innovation, collaboration and education.
Collapse
Affiliation(s)
- Hannes H Loeffler
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden.
| | - Jiazhen He
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Alessandro Tibo
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Jon Paul Janet
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Alexey Voronov
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Lewis H Mervin
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| |
Collapse
|
11
|
Kyro GW, Morgunov A, Brent RI, Batista VS. ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation. J Chem Inf Model 2024; 64:653-665. [PMID: 38287889 DOI: 10.1021/acs.jcim.3c01456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
The incredible capabilities of generative artificial intelligence models have inevitably led to their application in the domain of drug discovery. Within this domain, the vastness of chemical space motivates the development of more efficient methods for identifying regions with molecules that exhibit desired characteristics. In this work, we present a computationally efficient active learning methodology and demonstrate its applicability to targeted molecular generation. When applied to c-Abl kinase, a protein with FDA-approved small-molecule inhibitors, the model learns to generate molecules similar to the inhibitors without prior knowledge of their existence and even reproduces two of them exactly. We also show that the methodology is effective for a protein without any commercially available small-molecule inhibitors, the HNH domain of the CRISPR-associated protein 9 (Cas9) enzyme. To facilitate implementation and reproducibility, we made all of our software available through the open-source ChemSpaceAL Python package.
Collapse
Affiliation(s)
- Gregory W Kyro
- Department of Chemistry, Yale University, New Haven, Connecticut 06511-8499, United States
| | - Anton Morgunov
- Department of Chemistry, Yale University, New Haven, Connecticut 06511-8499, United States
| | - Rafael I Brent
- Department of Chemistry, Yale University, New Haven, Connecticut 06511-8499, United States
| | - Victor S Batista
- Department of Chemistry, Yale University, New Haven, Connecticut 06511-8499, United States
| |
Collapse
|
12
|
Olmedo DA, Durant-Archibold AA, López-Pérez JL, Medina-Franco JL. Design and Diversity Analysis of Chemical Libraries in Drug Discovery. Comb Chem High Throughput Screen 2024; 27:502-515. [PMID: 37409545 DOI: 10.2174/1386207326666230705150110] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 05/30/2023] [Accepted: 05/30/2023] [Indexed: 07/07/2023]
Abstract
Chemical libraries and compound data sets are among the main inputs to start the drug discovery process at universities, research institutes, and the pharmaceutical industry. The approach used in the design of compound libraries, the chemical information they possess, and the representation of structures, play a fundamental role in the development of studies: chemoinformatics, food informatics, in silico pharmacokinetics, computational toxicology, bioinformatics, and molecular modeling to generate computational hits that will continue the optimization process of drug candidates. The prospects for growth in drug discovery and development processes in chemical, biotechnological, and pharmaceutical companies began a few years ago by integrating computational tools with artificial intelligence methodologies. It is anticipated that it will increase the number of drugs approved by regulatory agencies shortly.
Collapse
Affiliation(s)
- Dionisio A Olmedo
- Centro de Investigaciones Farmacognósticas de la Flora Panameña (CIFLORPAN), Facultad de Farmacia, Universidad de Panamá, Ciudad de Panamá, Apartado, 0824-00178, Panamá
- Sistema Nacional de Investigación (SNI), Secretaria Nacional de Ciencia, Tecnología e Innovación (SENACYT), Ciudad del Saber, Clayton, Panamá
| | - Armando A Durant-Archibold
- Centro de Biodiversidad y Descubrimiento de Drogas, Instituto de Investigaciones Científicas y Servicios de Alta Tecnología (INDICASAT AIP), Apartado, 0843-01103, Panamá
- Departamento de Bioquímica, Facultad de Ciencias Naturales, Exactas y Tecnología, Universidad de Panamá, Ciudad de Panamá, Panamá
| | - José Luis López-Pérez
- CESIFAR, Departamento de Farmacología, Facultad de Medicina, Universidad de Panamá, Ciudad de Panamá, Panamá
- Departamento de Ciencias Farmacéuticas, Facultad de Farmacia, Universidad de Salamanca, Avda. Campo Charro s/n, 37071 Salamanca, España
| | - José Luis Medina-Franco
- DIFACQUIM Grupo de Investigación, Departamento de Farmacia, Escuela de Química, Universidad Nacional Autónoma de México, Ciudad de México, Apartado, 04510, México
| |
Collapse
|
13
|
Qin R, Zhang H, Huang W, Shao Z, Lei J. Deep learning-based design and screening of benzimidazole-pyrazine derivatives as adenosine A 2B receptor antagonists. J Biomol Struct Dyn 2023:1-17. [PMID: 38133953 DOI: 10.1080/07391102.2023.2295974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 12/11/2023] [Indexed: 12/24/2023]
Abstract
The Adenosine A2B receptor (A2BAR) is considered a novel potential target for the immunotherapy of cancer, and A2BAR antagonists have an inhibitory effect on tumor growth, proliferation, and metastasis. In our previous studies, we identified a class of benzimidazole-pyrazine scaffolds whose derivatives exhibited the antagonistic effect but lacked subtype selectivity towards A2BAR. In this work, we developed a scaffold-based protocol that incorporates a deep generative model and multilayer virtual screening to design benzimidazole-pyrazine derivatives as potential selective A2BAR antagonists. By utilizing a generative model with reported A2BAR antagonists as the training set, we built up a scaffold-focused library of benzimidazole-pyrazine derivatives and processed a virtual screening protocol to discover potential A2BAR antagonists. Finally, five molecules with different Bemis-Murcko scaffolds were identified and exhibited higher binding free energies than the reference molecule 12o. Further computational analysis revealed that the 3-benzyl derivative ABA-1266 presented high selectivity toward A2BAR and showed preferred draggability, providing future potent development of selective A2BAR antagonists.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Rui Qin
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hao Zhang
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Weifeng Huang
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Zhenglin Shao
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Jinping Lei
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
14
|
Kyro GW, Morgunov A, Brent RI, Batista VS. ChemSpaceAL: An Efficient Active Learning Methodology Applied to Protein-Specific Molecular Generation. ARXIV 2023:arXiv:2309.05853v2. [PMID: 37744464 PMCID: PMC10516108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 09/26/2023]
Abstract
The incredible capabilities of generative artificial intelligence models have inevitably led to their application in the domain of drug discovery. Within this domain, the vastness of chemical space motivates the development of more efficient methods for identifying regions with molecules that exhibit desired characteristics. In this work, we present a computationally efficient active learning methodology that requires evaluation of only a subset of the generated data in the constructed sample space to successfully align a generative model with respect to a specified objective. We demonstrate the applicability of this methodology to targeted molecular generation by fine-tuning a GPT-based molecular generator toward a protein with FDA-approved small-molecule inhibitors, c-Abl kinase. Remarkably, the model learns to generate molecules similar to the inhibitors without prior knowledge of their existence, and even reproduces two of them exactly. We also show that the methodology is effective for a protein without any commercially available small-molecule inhibitors, the HNH domain of the CRISPR-associated protein 9 (Cas9) enzyme. We believe that the inherent generality of this method ensures that it will remain applicable as the exciting field of in silico molecular generation evolves. To facilitate implementation and reproducibility, we have made all of our software available through the open-source ChemSpaceAL Python package.
Collapse
|
15
|
Sauer S, Matter H, Hessler G, Grebner C. Integrating Reaction Schemes, Reagent Databases, and Virtual Libraries into Fragment-Based Design by Reinforcement Learning. J Chem Inf Model 2023; 63:5709-5726. [PMID: 37668352 DOI: 10.1021/acs.jcim.3c00735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
Abstract
Lead optimization supported by artificial intelligence (AI)-based generative models has become increasingly important in drug design. Success factors are reagent availability, novelty, and the optimization of multiple properties. Directed fragment-replacement is particularly attractive, as it mimics medicinal chemistry tactics. Here, we present variations of fragment-based reinforcement learning using an actor-critic model. Novel features include freezing fragments and using reagents as the fragment source. Splitting molecules according to reaction schemes improves synthesizability, while tuning network output probabilities allows us to balance novelty versus diversity. Combining fragment-based optimization with virtual library encodings allows the exploration of large chemical spaces with synthesizable ideas. Collectively, these enhancements influence design toward high-quality molecules with favorable profiles. A validation study using 15 pharmaceutically relevant targets reveals that novel structures are obtained for most cases, which are identical or related to independent validation sets for each target. Hence, these modifications significantly increase the value of fragment-based reinforcement learning for drug design. The code is available on GitHub: https://github.com/Sanofi-Public/IDD-papers-fragrl.
Collapse
Affiliation(s)
- Susanne Sauer
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi-Aventis Deutschland GmbH, 65926 Frankfurt am Main, Germany
| | - Hans Matter
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi-Aventis Deutschland GmbH, 65926 Frankfurt am Main, Germany
| | - Gerhard Hessler
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi-Aventis Deutschland GmbH, 65926 Frankfurt am Main, Germany
| | - Christoph Grebner
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi-Aventis Deutschland GmbH, 65926 Frankfurt am Main, Germany
| |
Collapse
|
16
|
Dost K, Pullar-Strecker Z, Brydon L, Zhang K, Hafner J, Riddle PJ, Wicker JS. Combatting over-specialization bias in growing chemical databases. J Cheminform 2023; 15:53. [PMID: 37208694 DOI: 10.1186/s13321-023-00716-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 03/25/2023] [Indexed: 05/21/2023] Open
Abstract
BACKGROUND Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers' experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space. PROPOSED SOLUTION In this paper, we propose CANCELS (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. CANCELS does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain. RESULTS An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that CANCELS produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor's performance while reducing the number of required experiments. Overall, we believe that CANCELS can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .
Collapse
Affiliation(s)
- Katharina Dost
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand.
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany.
| | - Zac Pullar-Strecker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Liam Brydon
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Kunyang Zhang
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Jasmin Hafner
- Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland
| | - Patricia J Riddle
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
| | - Jörg S Wicker
- School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand
- enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany
| |
Collapse
|
17
|
Andrianov AM, Shuldau MA, Furs KV, Yushkevich AM, Tuzikov AV. AI-Driven De Novo Design and Molecular Modeling for Discovery of Small-Molecule Compounds as Potential Drug Candidates Targeting SARS-CoV-2 Main Protease. Int J Mol Sci 2023; 24:ijms24098083. [PMID: 37175788 PMCID: PMC10178971 DOI: 10.3390/ijms24098083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 04/21/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023] Open
Abstract
Over the past three years, significant progress has been made in the development of novel promising drug candidates against COVID-19. However, SARS-CoV-2 mutations resulting in the emergence of new viral strains that can be resistant to the drugs used currently in the clinic necessitate the development of novel potent and broad therapeutic agents targeting different vulnerable spots of the viral proteins. In this study, two deep learning generative models were developed and used in combination with molecular modeling tools for de novo design of small molecule compounds that can inhibit the catalytic activity of SARS-CoV-2 main protease (Mpro), an enzyme critically important for mediating viral replication and transcription. As a result, the seven best scoring compounds that exhibited low values of binding free energy comparable with those calculated for two potent inhibitors of Mpro, via the same computational protocol, were selected as the most probable inhibitors of the enzyme catalytic site. In light of the data obtained, the identified compounds are assumed to present promising scaffolds for the development of new potent and broad-spectrum drugs inhibiting SARS-CoV-2 Mpro, an attractive therapeutic target for anti-COVID-19 agents.
Collapse
Affiliation(s)
- Alexander M Andrianov
- Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, 220141 Minsk, Belarus
| | - Mikita A Shuldau
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, 220012 Minsk, Belarus
| | - Konstantin V Furs
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, 220012 Minsk, Belarus
| | - Artsemi M Yushkevich
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, 220012 Minsk, Belarus
| | - Alexander V Tuzikov
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, 220012 Minsk, Belarus
| |
Collapse
|
18
|
Johnston RC, Yao K, Kaplan Z, Chelliah M, Leswing K, Seekins S, Watts S, Calkins D, Chief Elk J, Jerome SV, Repasky MP, Shelley JC. Epik: p Ka and Protonation State Prediction through Machine Learning. J Chem Theory Comput 2023; 19:2380-2388. [PMID: 37023332 DOI: 10.1021/acs.jctc.3c00044] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023]
Abstract
Epik version 7 is a software program that uses machine learning for predicting the pKa values and protonation state distribution of complex, druglike molecules. Using an ensemble of atomic graph convolutional neural networks (GCNNs) trained on over 42,000 pKa values across broad chemical space from both experimental and computed origins, the model predicts pKa values with 0.42 and 0.72 pKa unit median absolute and root mean square errors, respectively, across seven test sets. Epik version 7 also generates protonation states and recovers 95% of the most populated protonation states compared to previous versions. Requiring on average only 47 ms per ligand, Epik version 7 is rapid and accurate enough to evaluate protonation states for crucial molecules and prepare ultra-large libraries of compounds to explore vast regions of chemical space. The simplicity and time required for the training allow for the generation of highly accurate models customized to a program's specific chemistry.
Collapse
Affiliation(s)
- Ryne C Johnston
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - Kun Yao
- Schrödinger, Inc., 1540 Broadway Street, 24th Floor, New York, New York 10036, United States
| | - Zachary Kaplan
- Schrödinger, Inc., 1540 Broadway Street, 24th Floor, New York, New York 10036, United States
| | - Monica Chelliah
- Schrödinger, Inc., 1540 Broadway Street, 24th Floor, New York, New York 10036, United States
| | - Karl Leswing
- Schrödinger, Inc., 1540 Broadway Street, 24th Floor, New York, New York 10036, United States
| | - Sean Seekins
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - Shawn Watts
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - David Calkins
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - Jackson Chief Elk
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - Steven V Jerome
- Schrödinger, Inc., 9171 Towne Centre Drive, San Diego, California 92122, United States
| | - Matthew P Repasky
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| | - John C Shelley
- Schrödinger, Inc., 101 SW Main Street, Suite 1300, Portland, Oregon 97204, United States
| |
Collapse
|
19
|
Mokaya M, Imrie F, van Hoorn WP, Kalisz A, Bradley AR, Deane CM. Testing the limits of SMILES-based de novo molecular generation with curriculum and deep reinforcement learning. NAT MACH INTELL 2023. [DOI: 10.1038/s42256-023-00636-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
20
|
Molecule generation toward target protein (SARS-CoV-2) using reinforcement learning-based graph neural network via knowledge graph. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS 2023; 12:13. [PMID: 36627927 PMCID: PMC9817447 DOI: 10.1007/s13721-023-00409-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 11/23/2022] [Accepted: 12/31/2022] [Indexed: 01/07/2023]
Abstract
AI-driven approaches are widely used in drug discovery, where candidate molecules are generated and tested on a target protein for binding affinity prediction. However, generating new compounds with desirable molecular properties such as Quantitative Estimate of Drug-likeness (QED) and Dopamine Receptor D2 activity (DRD2) while adhering to distinct chemical laws is challenging. To address these challenges, we proposed a graph-based deep learning framework to generate potential therapeutic drugs targeting the SARS-CoV-2 protein. Our proposed framework consists of two modules: a novel reinforcement learning (RL)-based graph generative module with knowledge graph (KG) and a graph early fusion approach (GEFA) for binding affinity prediction. The first module uses a gated graph neural network (GGNN) model under the RL environment for generating novel molecular compounds with desired properties and a custom-made KG for molecule screening. The second module uses GEFA to predict binding affinity scores between the generated compounds and target proteins. Experiments show how fine-tuning the GGNN model under the RL environment enhances the molecules with desired properties to generate 100 % valid and 100 % unique compounds using different scoring functions. Additionally, KG-based screening reduces the search space of generated candidate molecules by 96.64 % while retaining 95.38 % of promising binding molecules against SARS-CoV-2 protein, i.e., 3C-like protease (3CLpro). We achieved a binding affinity score of 8.185 from the top rank of generated compound. In addition, we compared top-ranked generated compounds to Indinavir on different parameters, including drug-likeness and medicinal chemistry, for qualitative analysis from a drug development perspective. Supplementary Information The online version contains supplementary material available at 10.1007/s13721-023-00409-2.
Collapse
|
21
|
Zhang Y, Jiang Q, Li L, Li Z, Xu Z, Chen Y, Sun Y, Liu C, Mao Z, Chen F, Li H, Cao Y, Pian C. Predicting the structure of unexplored novel fentanyl analogues by deep learning model. Brief Bioinform 2022; 23:6741166. [PMID: 36184256 DOI: 10.1093/bib/bbac418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 08/21/2022] [Accepted: 08/30/2022] [Indexed: 12/14/2022] Open
Abstract
Fentanyl and its analogues are psychoactive substances and the concern of fentanyl abuse has been existed in decades. Because the structure of fentanyl is easy to be modified, criminals may synthesize new fentanyl analogues to avoid supervision. The drug supervision is based on the structure matching to the database and too few kinds of fentanyl analogues are included in the database, so it is necessary to find out more potential fentanyl analogues and expand the sample space of fentanyl analogues. In this study, we introduced two deep generative models (SeqGAN and MolGPT) to generate potential fentanyl analogues, and a total of 11 041 valid molecules were obtained. The results showed that not only can we generate molecules with similar property distribution of original data, but the generated molecules also contain potential fentanyl analogues that are not pretty similar to any of original data. Ten molecules based on the rules of fentanyl analogues were selected for NMR, MS and IR validation. The results indicated that these molecules are all unreported fentanyl analogues. Furthermore, this study is the first to apply the deep learning to the generation of fentanyl analogues, greatly expands the exploring space of fentanyl analogues and provides help for the supervision of fentanyl.
Collapse
Affiliation(s)
| | | | | | - Zutan Li
- Bioinformatics Doctoral Student at Nanjing Agricultural University, China
| | - Zhihui Xu
- Researcher in Simcere Diagnostics Co., Ltd, China
| | - Yuanyuan Chen
- College of Sciences at Nanjing Agricultural University, China
| | - Yang Sun
- Nanjing Medical University, China
| | - Cheng Liu
- Department of Forensic Medicine, College of Basic Medical Science at Nanjing Medical University, China
| | - Zhengsheng Mao
- Forensic Science Department at Nanjing Medical University, China
| | | | - Hualan Li
- Bioinformatics Master Student at Nanjing Agricultural University, China
| | - Yue Cao
- Department of Forensic Medicine, Nanjing Medical University, Nanjing, Jiangsu, China
| | - Cong Pian
- College of Sciences, Nanjing Agricultural University, Nanjing, JiangsuChina
| |
Collapse
|
22
|
Medina‐Franco JL, Chávez‐Hernández AL, López‐López E, Saldívar‐González FI. Chemical Multiverse: An Expanded View of Chemical Space. Mol Inform 2022; 41:e2200116. [PMID: 35916110 PMCID: PMC9787733 DOI: 10.1002/minf.202200116] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 08/01/2022] [Indexed: 12/30/2022]
Abstract
Technological advances and practical applications of the chemical space concept in drug discovery, natural product research, and other research areas have attracted the scientific community's attention. The large- and ultra-large chemical spaces are associated with the significant increase in the number of compounds that can potentially be made and exist and the increasing number of experimental and calculated descriptors, that are emerging that encode the molecular structure and/or property aspects of the molecules. Due to the importance and continued evolution of compound libraries, herein, we discuss definitions proposed in the literature for chemical space and emphasize the convenience, discussed in the literature to use complementary descriptors to obtain a comprehensive view of the chemical space of compound data sets. In this regard, we introduce the term chemical multiverse to refer to the comprehensive analysis of compound data sets through several chemical spaces, each defined by a different set of chemical representations. The chemical multiverse is contrasted with a related idea: consensus chemical space.
Collapse
Affiliation(s)
- José L. Medina‐Franco
- DIFACQUIM research group, Department of Pharmacy, School of ChemistryNational Autonomous University of MexicoMexico City04510Mexico
| | - Ana L. Chávez‐Hernández
- DIFACQUIM research group, Department of Pharmacy, School of ChemistryNational Autonomous University of MexicoMexico City04510Mexico
| | - Edgar López‐López
- Department of PharmacologyCenter for Research and Advanced Studies of the National Polytechnic Institute (CINVESTAV)Mexico City07360Mexico
| | - Fernanda I. Saldívar‐González
- DIFACQUIM research group, Department of Pharmacy, School of ChemistryNational Autonomous University of MexicoMexico City04510Mexico
| |
Collapse
|
23
|
Sauer S, Matter H, Hessler G, Grebner C. Optimizing interactions to protein binding sites by integrating docking-scoring strategies into generative AI methods. Front Chem 2022; 10:1012507. [PMID: 36339033 PMCID: PMC9629386 DOI: 10.3389/fchem.2022.1012507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 09/20/2022] [Indexed: 11/14/2022] Open
Abstract
The identification and optimization of promising lead molecules is essential for drug discovery. Recently, artificial intelligence (AI) based generative methods provided complementary approaches for generating molecules under specific design constraints of relevance in drug design. The goal of our study is to incorporate protein 3D information directly into generative design by flexible docking plus an adapted protein-ligand scoring function, thereby moving towards automated structure-based design. First, the protein-ligand scoring function RFXscore integrating individual scoring terms, ligand descriptors, and combined terms was derived using the PDBbind database and internal data. Next, design results for different workflows are compared to solely ligand-based reward schemes. Our newly proposed, optimal workflow for structure-based generative design is shown to produce promising results, especially for those exploration scenarios, where diverse structures fitting to a protein binding site are requested. Best results are obtained using docking followed by RFXscore, while, depending on the exact application scenario, it was also found useful to combine this approach with other metrics that bias structure generation into "drug-like" chemical space, such as target-activity machine learning models, respectively.
Collapse
Affiliation(s)
| | | | | | - Christoph Grebner
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi, Frankfurt, Germany
| |
Collapse
|
24
|
Abdel-Aty H, Gould IR. Large-Scale Distributed Training of Transformers for Chemical Fingerprinting. J Chem Inf Model 2022; 62:4852-4862. [PMID: 36195574 PMCID: PMC9597661 DOI: 10.1021/acs.jcim.2c00715] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Transformer models
have become a popular choice for various
machine
learning tasks due to their often outstanding performance. Recently,
transformers have been used in chemistry for classifying reactions,
reaction prediction, physiochemical property prediction, and more.
These models require huge amounts of data and localized compute to
train effectively. In this work, we demonstrate that these models
can successfully be trained for chemical problems in a distributed
manner across many computers—a more common scenario for chemistry
institutions. We introduce MFBERT: Molecular Fingerprints through
Bidirectional Encoder Representations from Transformers. We use distributed
computing to pre-train a transformer model on one of the largest aggregate
datasets in chemical literature and achieve state-of-the-art scores
on a virtual screening benchmark for molecular fingerprints. We then
fine-tune our model on smaller, more specific datasets to generate
more targeted fingerprints and assess their quality. We utilize a
SentencePiece tokenization model, where the whole procedure from raw
molecular representation to molecular fingerprints becomes data-driven,
with no explicit tokenization rules.
Collapse
Affiliation(s)
- Hisham Abdel-Aty
- Department of Chemistry and Institute of Chemical Biology, Imperial College London, Molecular Sciences Research Hub, Shepherd's Bush, LondonW12 0BZ, UK
| | - Ian R Gould
- Department of Chemistry and Institute of Chemical Biology, Imperial College London, Molecular Sciences Research Hub, Shepherd's Bush, LondonW12 0BZ, UK
| |
Collapse
|
25
|
Li C, Wang C, Sun M, Zeng Y, Yuan Y, Gou Q, Wang G, Guo Y, Pu X. Correlated RNN Framework to Quickly Generate Molecules with Desired Properties for Energetic Materials in the Low Data Regime. J Chem Inf Model 2022; 62:4873-4887. [PMID: 35998331 DOI: 10.1021/acs.jcim.2c00997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which R2 is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
Collapse
Affiliation(s)
- Chuan Li
- College of Computer Science, Sichuan University, Chengdu 610064, China
| | - Chenghui Wang
- College of Computer Science, Sichuan University, Chengdu 610064, China
| | - Ming Sun
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Yan Zeng
- College of Computer Science, Sichuan University, Chengdu 610064, China
| | - Yuan Yuan
- College of Management, Southwest University for Nationalities, Chengdu 610041, China
| | - Qiaolin Gou
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Guangchuan Wang
- College of Computer Science, Sichuan University, Chengdu 610064, China
| | - Yanzhi Guo
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Xuemei Pu
- College of Chemistry, Sichuan University, Chengdu 610064, China
| |
Collapse
|
26
|
Zhang J, Chen H. De Novo Molecule Design Using Molecular Generative Models Constrained by Ligand-Protein Interactions. J Chem Inf Model 2022; 62:3291-3306. [PMID: 35793555 DOI: 10.1021/acs.jcim.2c00177] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
In recent years, molecular deep generative models have attracted much attention for its application in de novo drug design. The data-driven molecular deep generative model approximates the high dimensional distribution of the chemical space through learning from a large number of molecular structural data. So far, most of the molecular generative models rely on purely 2D ligand information in structure generation. Here, we propose a novel molecular deep generative model which adopts a recurrent neural network architecture coupled with a ligand-protein interaction fingerprint as constraints. The fingerprint was constructed on ligand docking poses and represents the 3D binding mode of ligands in the protein pocket. In the current work, generative models constrained with interaction fingerprints were trained and compared with normal RNN models. It has been shown that models trained with constraints of ligand-protein interaction fingerprint have a clear tendency to generating compounds maintaining similar binding modes. Our results demonstrate the potential application of the interaction fingerprint-constrained generative model for the targeted molecule generation and guided exploration on the drug-like chemical space.
Collapse
Affiliation(s)
- Jie Zhang
- Guangdong Provincial Key Laboratory of Laboratory Animals, Guangdong Laboratory Animals Monitoring Institute, Guangzhou 510663, P. R. China.,State Key Laboratory of Respiratory Disease, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou 510530, P. R. China.,Bioland Laboratory (Guangzhou Regenerative Medicine and Health─Guangdong Laboratory), Guangzhou 510530, P. R. China
| | - Hongming Chen
- Bioland Laboratory (Guangzhou Regenerative Medicine and Health─Guangdong Laboratory), Guangzhou 510530, P. R. China.,Guangzhou International Bio Island, Guangzhou Laboratory, No. 9 XinDaoHuanBei Road, Guangzhou 510005, China
| |
Collapse
|
27
|
Yu G. PageRank Topic Finder based Algorithm for Multimedia Resources in Preschool Education. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:4173243. [PMID: 35909855 PMCID: PMC9329012 DOI: 10.1155/2022/4173243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 05/07/2022] [Accepted: 06/29/2022] [Indexed: 11/17/2022]
Abstract
In traditional preschool education, it is time-consuming and laborious to acquire effective materials by using artificial search method. However, with the development of Internet technology, a variety of preschool education institutions or individuals have released their own preschool education resources on the Internet. At present, multimedia technology has been popularized in many schools, and it plays a more and more significant role in teaching. In preschool education teaching, teachers use multimedia resources not only conducive to improve children's learning efficiency but also make the teaching quality from the whole to a higher level. However, some kindergarten teachers rely too much on multimedia in teaching and do not effectively combine it with traditional teaching methods. Sometimes they even use video and related multimedia teaching resources throughout the class, which makes preschool children lack knowledge and knowledge. Therefore, this paper designs a multimedia resource retrieval system based on the theme of preschool education, which mainly achieves the extraction of multimedia resources from web pages and the analysis of multimedia-related text information. In order to design a high-performance topic search algorithm, we must first carry out page parsing, Chinese and English word segmentation, and other page preprocessing. The research results show that it is found that the text-based automatic classification of multimedia resources in preschool education and the filtering of multimedia noise in web pages can provide relevant personnel in the field of preschool education with the retrieval service of multimedia resources.
Collapse
Affiliation(s)
- Guiping Yu
- Normal College, Eastern Liaoning University, Dandong, Liaoning 118000, China
| |
Collapse
|
28
|
Saldívar-González FI, Medina-Franco JL. Approaches for enhancing the analysis of chemical space for drug discovery. Expert Opin Drug Discov 2022; 17:789-798. [PMID: 35640229 DOI: 10.1080/17460441.2022.2084608] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
INTRODUCTION Chemical space is a powerful, general, and practical conceptual framework in drug discovery and other areas in chemistry that addresses the diversity of molecules and it has various applications. Moreover, chemical space is a cornerstone of chemoinformatics as a scientific discipline. In response to the increase in the set of chemical compounds in databases, generators of chemical structures, and tools to calculate molecular descriptors, novel approaches to generate visual representations of chemical space in low dimensions are emerging and evolving. Such approaches include a wide range of commercial and free applications, software, and open-source methods. AREAS COVERED The current state of chemical space in drug design and discovery is reviewed. The topics discussed herein include advances for efficient navigation in chemical space, the use of this concept in assessing the diversity of different data sets, exploring structure-property/activity relationships for one or multiple endpoints, and compound library design. Recent advances in methodologies for generating visual representations of chemical space have been highlighted, thereby emphasizing open-source methods. EXPERT OPINION Quantitative and qualitative generation and analysis of chemical space require novel approaches for handling the increasing number of molecules and their information available in chemical databases (including emerging ultra-large libraries). In addition, it is of utmost importance to note that chemical space is a conceptual framework that goes beyond visual representation in low dimensions. However, the graphical representation of chemical space has several practical applications in drug discovery and beyond.
Collapse
Affiliation(s)
- Fernanda I Saldívar-González
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico
| |
Collapse
|
29
|
Bender A, Schneider N, Segler M, Patrick Walters W, Engkvist O, Rodrigues T. Evaluation guidelines for machine learning tools in the chemical sciences. Nat Rev Chem 2022; 6:428-442. [PMID: 37117429 DOI: 10.1038/s41570-022-00391-9] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/13/2022] [Indexed: 02/07/2023]
Abstract
Machine learning (ML) promises to tackle the grand challenges in chemistry and speed up the generation, improvement and/or ordering of research hypotheses. Despite the overarching applicability of ML workflows, one usually finds diverse evaluation study designs. The current heterogeneity in evaluation techniques and metrics leads to difficulty in (or the impossibility of) comparing and assessing the relevance of new algorithms. Ultimately, this may delay the digitalization of chemistry at scale and confuse method developers, experimentalists, reviewers and journal editors. In this Perspective, we critically discuss a set of method development and evaluation guidelines for different types of ML-based publications, emphasizing supervised learning. We provide a diverse collection of examples from various authors and disciplines in chemistry. While taking into account varying accessibility across research groups, our recommendations focus on reporting completeness and standardizing comparisons between tools. We aim to further contribute to improved ML transparency and credibility by suggesting a checklist of retro-/prospective tests and dissecting their importance. We envisage that the wide adoption and continuous update of best practices will encourage an informed use of ML on real-world problems related to the chemical sciences.
Collapse
|
30
|
Warr WA, Nicklaus MC, Nicolaou CA, Rarey M. Exploration of Ultralarge Compound Collections for Drug Discovery. J Chem Inf Model 2022; 62:2021-2034. [PMID: 35421301 DOI: 10.1021/acs.jcim.2c00224] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Designing new medicines more cheaply and quickly is tightly linked to the quest of exploring chemical space more widely and efficiently. Chemical space is monumentally large, but recent advances in computer software and hardware have enabled researchers to navigate virtual chemical spaces containing billions of chemical structures. This review specifically concerns collections of many millions or even billions of enumerated chemical structures as well as even larger chemical spaces that are not fully enumerated. We present examples of chemical libraries and spaces and the means used to construct them, and we discuss new technologies for searching huge libraries and for searching combinatorially in chemical space. We also cover space navigation techniques and consider new approaches to de novo drug design and the impact of the "autonomous laboratory" on synthesis of designed compounds. Finally, we summarize some other challenges and opportunities for the future.
Collapse
Affiliation(s)
- Wendy A Warr
- Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Crewe, Cheshire CW4 7HZ, United Kingdom
| | - Marc C Nicklaus
- NCI, NIH, CADD Group, NCI-Frederick, Frederick, Maryland 21702, United States
| | - Christos A Nicolaou
- Discovery Chemistry, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Matthias Rarey
- Universität Hamburg, ZBH Center for Bioinformatics, 20146 Hamburg, Germany
| |
Collapse
|
31
|
Choudhury C, Arul Murugan N, Deva Priyakumar U. Structure-based drug repurposing: traditional and advanced AI/ML-aided methods. Drug Discov Today 2022; 27:1847-1861. [PMID: 35301148 PMCID: PMC8920090 DOI: 10.1016/j.drudis.2022.03.006] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 02/16/2022] [Accepted: 03/10/2022] [Indexed: 02/08/2023]
Abstract
The current global health emergency in the form of the Coronavirus 2019 (COVID-19) pandemic has highlighted the need for fast, accurate, and efficient drug discovery pipelines. Traditional drug discovery projects relying on in vitro high-throughput screening (HTS) involve large investments and sophisticated experimental set-ups, affordable only to big biopharmaceutical companies. In this scenario, application of efficient state-of-the-art computational methods and modern artificial intelligence (AI)-based algorithms for rapid screening of repurposable chemical space [approved drugs and natural products (NPs) with proven pharmacokinetic profiles] to identify the initial leads is a powerful option to save resources and time. Structure-based drug repurposing is a popular in silico repurposing approach. In this review, we discuss traditional and modern AI-based computational methods and tools applied at various stages for structure-based drug discovery (SBDD) pipelines. Additionally, we highlight the role of generative models in generating molecules with scaffolds from repurposable chemical space. Teaser: This review highlights the importance of repurposable chemical space, and the contributions of conventional in silico approaches and modern machine-learning algorithms for rapid structure-based drug repurposing.
Collapse
Affiliation(s)
- Chinmayee Choudhury
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Sector-12, Chandigarh 160012, India
| | - N Arul Murugan
- Department of Computer Science, School of Electrical Engineering and Computer Sciences, KTH Royal Institute of Technology, S-100 44, Stockholm, Sweden; Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi 110020, India.
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
32
|
Kalikadien AV, Pidko EA, Sinha V. ChemSpaX: exploration of chemical space by automated functionalization of molecular scaffold. DIGITAL DISCOVERY 2022; 1:8-25. [PMID: 35340336 PMCID: PMC8887922 DOI: 10.1039/d1dd00017a] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 12/23/2021] [Indexed: 12/19/2022]
Abstract
Exploration of the local chemical space of molecular scaffolds by post-functionalization (PF) is a promising route to discover novel molecules with desired structure and function. PF with rationally chosen substituents based on known electronic and steric properties is a commonly used experimental and computational strategy in screening, design and optimization of catalytic scaffolds. Automated generation of reasonably accurate geometric representations of post-functionalized molecular scaffolds is highly desirable for data-driven applications. However, automated PF of transition metal (TM) complexes remains challenging. In this work a Python-based workflow, ChemSpaX, that is aimed at automating the PF of a given molecular scaffold with special emphasis on TM complexes, is introduced. In three representative applications of ChemSpaX by comparing with DFT and DFT-B calculations, we show that the generated structures have a reasonable quality for use in computational screening applications. Furthermore, we show that ChemSpaX generated geometries can be used in machine learning applications to accurately predict DFT computed HOMO-LUMO gaps for transition metal complexes. ChemSpaX is open-source and aims to bolster and democratize the efforts of the scientific community towards data-driven chemical discovery.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9 2629 HZ Delft The Netherlands
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9 2629 HZ Delft The Netherlands
| | - Vivek Sinha
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology Van der Maasweg 9 2629 HZ Delft The Netherlands
| |
Collapse
|
33
|
Abstract
INTRODUCTION The popularity and success of advanced AI methods like deep neural networks has led to novel ways for exploring chemical space. Their opaque nature poses challenges for model evaluation regarding novelty, uniqueness, and distribution of the chemical space covered. However, these methods also promise to be able to explore uncharted chemical space in novel ways that do not rely directly on structural similarity. AREAS COVERED This review provides an overview of popular deep learning methods for chemical space exploration. Crucial aspects like choice of molecular representation, training for focused chemical space exploration, and criteria for assessing and validating chemical space coverage are discussed. EXPERT OPINION Deep learning offers great potential for chemical space exploration beyond conventional fragment-based methods. Given the rarity of prospective applications and considering the difficulty in assessing representativeness and comprehensiveness of chemical space covered, developing criteria for assessing and validating generative models is of great significance. Latent space models like variational autoencoders are conceptually appealing for inverse QSAR/QSPR approaches as neighborhood relationships in latent space can be trained to reflect property similarities. Future research in understanding and interpreting generative models might lead to a better understanding of biologically relevant properties of molecules.
Collapse
Affiliation(s)
- Martin Vogt
- Department of Life Science Informatics, B-it, Limes Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich Wilhelms-Universität, Bonn, Germany
| |
Collapse
|
34
|
Grebner C, Matter H, Hessler G. Artificial Intelligence in Compound Design. Methods Mol Biol 2021; 2390:349-382. [PMID: 34731477 DOI: 10.1007/978-1-0716-1787-8_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Abstract
Artificial intelligence has seen an incredibly fast development in recent years. Many novel technologies for property prediction of drug molecules as well as for the design of novel molecules were introduced by different research groups. These artificial intelligence-based design methods can be applied for suggesting novel chemical motifs in lead generation or scaffold hopping as well as for optimization of desired property profiles during lead optimization. In lead generation, broad sampling of the chemical space for identification of novel motifs is required, while in the lead optimization phase, a detailed exploration of the chemical neighborhood of a current lead series is advantageous. These different requirements for successful design outcomes render different combinations of artificial intelligence technologies useful. Overall, we observe that a combination of different approaches with tailored scoring and evaluation schemes appears beneficial for efficient artificial intelligence-based compound design.
Collapse
Affiliation(s)
- Christoph Grebner
- Sanofi-Aventis Deutschland GmbH, R&D, Integrated Drug Discovery, Frankfurt am Main, Germany
| | - Hans Matter
- Sanofi-Aventis Deutschland GmbH, R&D, Integrated Drug Discovery, Frankfurt am Main, Germany
| | - Gerhard Hessler
- Sanofi-Aventis Deutschland GmbH, R&D, Integrated Drug Discovery, Frankfurt am Main, Germany.
| |
Collapse
|
35
|
Shrivastava AD, Swainston N, Samanta S, Roberts I, Wright Muelas M, Kell DB. MassGenie: A Transformer-Based Deep Learning Method for Identifying Small Molecules from Their Mass Spectra. Biomolecules 2021; 11:1793. [PMID: 34944436 PMCID: PMC8699281 DOI: 10.3390/biom11121793] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 11/14/2021] [Accepted: 11/27/2021] [Indexed: 12/15/2022] Open
Abstract
The 'inverse problem' of mass spectrometric molecular identification ('given a mass spectrum, calculate/predict the 2D structure of the molecule whence it came') is largely unsolved, and is especially acute in metabolomics where many small molecules remain unidentified. This is largely because the number of experimentally available electrospray mass spectra of small molecules is quite limited. However, the forward problem ('calculate a small molecule's likely fragmentation and hence at least some of its mass spectrum from its structure alone') is much more tractable, because the strengths of different chemical bonds are roughly known. This kind of molecular identification problem may be cast as a language translation problem in which the source language is a list of high-resolution mass spectral peaks and the 'translation' a representation (for instance in SMILES) of the molecule. It is thus suitable for attack using the deep neural networks known as transformers. We here present MassGenie, a method that uses a transformer-based deep neural network, trained on ~6 million chemical structures with augmented SMILES encoding and their paired molecular fragments as generated in silico, explicitly including the protonated molecular ion. This architecture (containing some 400 million elements) is used to predict the structure of a molecule from the various fragments that may be expected to be observed when some of its bonds are broken. Despite being given essentially no detailed nor explicit rules about molecular fragmentation methods, isotope patterns, rearrangements, neutral losses, and the like, MassGenie learns the effective properties of the mass spectral fragment and valency space, and can generate candidate molecular structures that are very close or identical to those of the 'true' molecules. We also use VAE-Sim, a previously published variational autoencoder, to generate candidate molecules that are 'similar' to the top hit. In addition to using the 'top hits' directly, we can produce a rank order of these by 'round-tripping' candidate molecules and comparing them with the true molecules, where known. As a proof of principle, we confine ourselves to positive electrospray mass spectra from molecules with a molecular mass of 500Da or lower, including those in the last CASMI challenge (for which the results are known), getting 49/93 (53%) precisely correct. The transformer method, applied here for the first time to mass spectral interpretation, works extremely effectively both for mass spectra generated in silico and on experimentally obtained mass spectra from pure compounds. It seems to act as a Las Vegas algorithm, in that it either gives the correct answer or simply states that it cannot find one. The ability to create and to 'learn' millions of fragmentation patterns in silico, and therefrom generate candidate structures (that do not have to be in existing libraries) directly, thus opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra.
Collapse
Affiliation(s)
- Aditya Divyakant Shrivastava
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
- Department of Computer Science and Engineering, Nirma University, Ahmedabad 382481, India
| | - Neil Swainston
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
- Mellizyme Biotechnology Ltd., Liverpool Science Park IC1, 131 Mount Pleasant, Liverpool L3 5TF, UK
| | - Soumitra Samanta
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
| | - Ivayla Roberts
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
| | - Marina Wright Muelas
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
| | - Douglas B. Kell
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, UK; (A.D.S.); (N.S.); (S.S.); (I.R.); (M.W.M.)
- Mellizyme Biotechnology Ltd., Liverpool Science Park IC1, 131 Mount Pleasant, Liverpool L3 5TF, UK
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kongens Lyngby, Denmark
| |
Collapse
|
36
|
Sousa T, Correia J, Pereira V, Rocha M. Generative Deep Learning for Targeted Compound Design. J Chem Inf Model 2021; 61:5343-5361. [PMID: 34699719 DOI: 10.1021/acs.jcim.0c01496] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
In the past few years, de novo molecular design has increasingly been using generative models from the emergent field of Deep Learning, proposing novel compounds that are likely to possess desired properties or activities. De novo molecular design finds applications in different fields ranging from drug discovery and materials sciences to biotechnology. A panoply of deep generative models, including architectures as Recurrent Neural Networks, Autoencoders, and Generative Adversarial Networks, can be trained on existing data sets and provide for the generation of novel compounds. Typically, the new compounds follow the same underlying statistical distributions of properties exhibited on the training data set Additionally, different optimization strategies, including transfer learning, Bayesian optimization, reinforcement learning, and conditional generation, can direct the generation process toward desired aims, regarding their biological activities, synthesis processes or chemical features. Given the recent emergence of these technologies and their relevance, this work presents a systematic and critical review on deep generative models and related optimization methods for targeted compound design, and their applications.
Collapse
Affiliation(s)
- Tiago Sousa
- Centre of Biological Engineering, Campus Gualtar, University of Minho, 4710-057 Braga, Portugal
| | - João Correia
- Centre of Biological Engineering, Campus Gualtar, University of Minho, 4710-057 Braga, Portugal
| | - Vítor Pereira
- Centre of Biological Engineering, Campus Gualtar, University of Minho, 4710-057 Braga, Portugal
| | - Miguel Rocha
- Centre of Biological Engineering, Campus Gualtar, University of Minho, 4710-057 Braga, Portugal
| |
Collapse
|
37
|
Molecular generation by Fast Assembly of (Deep)SMILES fragments. J Cheminform 2021; 13:88. [PMID: 34775976 PMCID: PMC8591910 DOI: 10.1186/s13321-021-00566-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 11/02/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, in silico molecular design is regaining interest. To generate on a computer molecules with optimized properties, scoring functions can be coupled with a molecular generator to design novel molecules with a desired property profile. RESULTS In this article, a simple method is described to generate only valid molecules at high frequency ([Formula: see text] molecule/s using a single CPU core), given a molecular training set. The proposed method generates diverse SMILES (or DeepSMILES) encoded molecules while also showing some propensity at training set distribution matching. When working with DeepSMILES, the method reaches peak performance ([Formula: see text] molecule/s) because it relies almost exclusively on string operations. The "Fast Assembly of SMILES Fragments" software is released as open-source at https://github.com/UnixJunkie/FASMIFRA . Experiments regarding speed, training set distribution matching, molecular diversity and benchmark against several other methods are also shown.
Collapse
|
38
|
Li Y, Pei J, Lai L. Structure-based de novo drug design using 3D deep generative models. Chem Sci 2021; 12:13664-13675. [PMID: 34760151 PMCID: PMC8549794 DOI: 10.1039/d1sc04444c] [Citation(s) in RCA: 57] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 09/09/2021] [Indexed: 12/14/2022] Open
Abstract
Deep generative models are attracting much attention in the field of de novo molecule design. Compared to traditional methods, deep generative models can be trained in a fully data-driven way with little requirement for expert knowledge. Although many models have been developed to generate 1D and 2D molecular structures, 3D molecule generation is less explored, and the direct design of drug-like molecules inside target binding sites remains challenging. In this work, we introduce DeepLigBuilder, a novel deep learning-based method for de novo drug design that generates 3D molecular structures in the binding sites of target proteins. We first developed Ligand Neural Network (L-Net), a novel graph generative model for the end-to-end design of chemically and conformationally valid 3D molecules with high drug-likeness. Then, we combined L-Net with Monte Carlo tree search to perform structure-based de novo drug design tasks. In the case study of inhibitor design for the main protease of SARS-CoV-2, DeepLigBuilder suggested a list of drug-like compounds with novel chemical structures, high predicted affinity, and similar binding features to those of known inhibitors. The current version of L-Net was trained on drug-like compounds from ChEMBL, which could be easily extended to other molecular datasets with desired properties based on users' demands and applied in functional molecule generation. Merging deep generative models with atomic-level interaction evaluation, DeepLigBuilder provides a state-of-the-art model for structure-based de novo drug design and lead optimization. DeepLigBuilder, a novel deep generative model for structure-based de novo drug design, directly generates 3D structures of drug-like compounds in the target binding site.![]()
Collapse
Affiliation(s)
- Yibo Li
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Luhua Lai
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China .,Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China .,BNLMS, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| |
Collapse
|
39
|
Tong X, Liu X, Tan X, Li X, Jiang J, Xiong Z, Xu T, Jiang H, Qiao N, Zheng M. Generative Models for De Novo Drug Design. J Med Chem 2021; 64:14011-14027. [PMID: 34533311 DOI: 10.1021/acs.jmedchem.1c00927] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Artificial intelligence (AI) is booming. Among various AI approaches, generative models have received much attention in recent years. Inspired by these successes, researchers are now applying generative model techniques to de novo drug design, which has been considered as the "holy grail" of drug discovery. In this Perspective, we first focus on describing models such as recurrent neural network, autoencoder, generative adversarial network, transformer, and hybrid models with reinforcement learning. Next, we summarize the applications of generative models to drug design, including generating various compounds to expand the compound library and designing compounds with specific properties, and we also list a few publicly available molecular design tools based on generative models which can be used directly to generate molecules. In addition, we also introduce current benchmarks and metrics frequently used for generative models. Finally, we discuss the challenges and prospects of using generative models to aid drug design.
Collapse
Affiliation(s)
- Xiaochu Tong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaohong Liu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaoqin Tan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Jiaxin Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Zhaoping Xiong
- Laboratory of Health Intelligence, Huawei Technologies Co., Ltd, Shenzhen 518100, China
| | | | - Hualiang Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Nan Qiao
- Laboratory of Health Intelligence, Huawei Technologies Co., Ltd, Shenzhen 518100, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China.,University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
40
|
Kell DB. The Transporter-Mediated Cellular Uptake and Efflux of Pharmaceutical Drugs and Biotechnology Products: How and Why Phospholipid Bilayer Transport Is Negligible in Real Biomembranes. Molecules 2021; 26:5629. [PMID: 34577099 PMCID: PMC8470029 DOI: 10.3390/molecules26185629] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 09/03/2021] [Accepted: 09/14/2021] [Indexed: 12/12/2022] Open
Abstract
Over the years, my colleagues and I have come to realise that the likelihood of pharmaceutical drugs being able to diffuse through whatever unhindered phospholipid bilayer may exist in intact biological membranes in vivo is vanishingly low. This is because (i) most real biomembranes are mostly protein, not lipid, (ii) unlike purely lipid bilayers that can form transient aqueous channels, the high concentrations of proteins serve to stop such activity, (iii) natural evolution long ago selected against transport methods that just let any undesirable products enter a cell, (iv) transporters have now been identified for all kinds of molecules (even water) that were once thought not to require them, (v) many experiments show a massive variation in the uptake of drugs between different cells, tissues, and organisms, that cannot be explained if lipid bilayer transport is significant or if efflux were the only differentiator, and (vi) many experiments that manipulate the expression level of individual transporters as an independent variable demonstrate their role in drug and nutrient uptake (including in cytotoxicity or adverse drug reactions). This makes such transporters valuable both as a means of targeting drugs (not least anti-infectives) to selected cells or tissues and also as drug targets. The same considerations apply to the exploitation of substrate uptake and product efflux transporters in biotechnology. We are also beginning to recognise that transporters are more promiscuous, and antiporter activity is much more widespread, than had been realised, and that such processes are adaptive (i.e., were selected by natural evolution). The purpose of the present review is to summarise the above, and to rehearse and update readers on recent developments. These developments lead us to retain and indeed to strengthen our contention that for transmembrane pharmaceutical drug transport "phospholipid bilayer transport is negligible".
Collapse
Affiliation(s)
- Douglas B. Kell
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St, Liverpool L69 7ZB, UK;
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs Lyngby, Denmark
- Mellizyme Biotechnology Ltd., IC1, Liverpool Science Park, Mount Pleasant, Liverpool L3 5TF, UK
| |
Collapse
|
41
|
Fialková V, Zhao J, Papadopoulos K, Engkvist O, Bjerrum EJ, Kogej T, Patronov A. LibINVENT: Reaction-based Generative Scaffold Decoration for in Silico Library Design. J Chem Inf Model 2021; 62:2046-2063. [PMID: 34460269 DOI: 10.1021/acs.jcim.1c00469] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Because of the strong relationship between the desired molecular activity and its structural core, the screening of focused, core-sharing chemical libraries is a key step in lead optimization. Despite the plethora of current research focused on in silico methods for molecule generation, to our knowledge, no tool capable of designing such libraries has been proposed. In this work, we present a novel tool for de novo drug design called LibINVENT. It is capable of rapidly proposing chemical libraries of compounds sharing the same core while maximizing a range of desirable properties. To further help the process of designing focused libraries, the user can list specific chemical reactions that can be used for the library creation. LibINVENT is therefore a flexible tool for generating virtual chemical libraries for lead optimization in a broad range of scenarios. Additionally, the shared core ensures that the compounds in the library are similar, possess desirable properties, and can also be synthesized under the same or similar conditions. The LibINVENT code is freely available in our public repository at https://github.com/MolecularAI/Lib-INVENT. The code necessary for data preprocessing is further available at: https://github.com/MolecularAI/Lib-INVENT-dataset.
Collapse
Affiliation(s)
- Vendy Fialková
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| | - Jiaxi Zhao
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden.,Department of Pharmaceutical Biosciences, Uppsala University, Uppsala 75237, Sweden
| | - Kostas Papadopoulos
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden.,Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg 41756, Sweden
| | | | - Thierry Kogej
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| | - Atanas Patronov
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| |
Collapse
|
42
|
Wang F, Feng X, Guo X, Xu L, Xie L, Chang S. Improving de novo Molecule Generation by Embedding LSTM and Attention Mechanism in CycleGAN. Front Genet 2021; 12:709500. [PMID: 34422013 PMCID: PMC8376287 DOI: 10.3389/fgene.2021.709500] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 07/19/2021] [Indexed: 11/13/2022] Open
Abstract
The application of deep learning in the field of drug discovery brings the development and expansion of molecular generative models along with new challenges in this field. One of challenges in de novo molecular generation is how to produce new reasonable molecules with desired pharmacological, physical, and chemical properties. To improve the similarity between the generated molecule and the starting molecule, we propose a new molecule generation model by embedding Long Short-Term Memory (LSTM) and Attention mechanism in CycleGAN architecture, LA-CycleGAN. The network layer of the generator in CycleGAN is fused head and tail to improve the similarity of the generated structure. The embedded LSTM and Attention mechanism can overcome long-term dependency problems in treating the normally used SMILES input. From our quantitative evaluation, we present that LA-CycleGAN expands the chemical space of the molecules and improves the ability of structure conversion. The generated molecules are highly similar to the starting compound structures while obtaining expected molecular properties during cycle generative adversarial network learning, which comprehensively improves the performance of the generative model.
Collapse
Affiliation(s)
- Feng Wang
- Changzhou University Huaide College, Taizhou, China.,School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, Changzhou University, Changzhou, China
| | | | - Xiao Guo
- Changzhou University Huaide College, Taizhou, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, China
| |
Collapse
|
43
|
|
44
|
Nonadditivity in public and inhouse data: implications for drug design. J Cheminform 2021; 13:47. [PMID: 34215341 PMCID: PMC8254291 DOI: 10.1186/s13321-021-00525-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 06/09/2021] [Indexed: 11/10/2022] Open
Abstract
Numerous ligand-based drug discovery projects are based on structure-activity relationship (SAR) analysis, such as Free-Wilson (FW) or matched molecular pair (MMP) analysis. Intrinsically they assume linearity and additivity of substituent contributions. These techniques are challenged by nonadditivity (NA) in protein-ligand binding where the change of two functional groups in one molecule results in much higher or lower activity than expected from the respective single changes. Identifying nonlinear cases and possible underlying explanations is crucial for a drug design project since it might influence which lead to follow. By systematically analyzing all AstraZeneca (AZ) inhouse compound data and publicly available ChEMBL25 bioactivity data, we show significant NA events in almost every second assay among the inhouse and once in every third assay in public data sets. Furthermore, 9.4% of all compounds of the AZ database and 5.1% from public sources display significant additivity shifts indicating important SAR features or fundamental measurement errors. Using NA data in combination with machine learning showed that nonadditive data is challenging to predict and even the addition of nonadditive data into training did not result in an increase in predictivity. Overall, NA analysis should be applied on a regular basis in many areas of computational chemistry and can further improve rational drug design.
Collapse
|
45
|
Bilsland AE, McAulay K, West R, Pugliese A, Bower J. Automated Generation of Novel Fragments Using Screening Data, a Dual SMILES Autoencoder, Transfer Learning and Syntax Correction. J Chem Inf Model 2021; 61:2547-2559. [PMID: 34029470 DOI: 10.1021/acs.jcim.0c01226] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Fragment-based hit identification (FBHI) allows proportionately greater coverage of chemical space using fewer molecules than traditional high-throughput screening approaches. However, effectively exploiting this advantage is highly dependent on the library design. Solubility, stability, chemical complexity, chemical/shape diversity, and synthetic tractability for fragment elaboration are all critical aspects, and molecule design remains a time-consuming task for computational and medicinal chemists. Artificial neural networks have attracted considerable attention in automated de novo design applications and could also prove useful for fragment library design. Chemical autoencoders are neural networks consisting of encoder and decoder parts, which respectively compress and decompress molecular representations. The decoder is applied to samples drawn from the space of compressed representations to generate novel molecules that can be scored for properties of interest. Here, we report an autoencoder model using a recurrent neural network architecture, which was trained using 486,565 fragments curated from commercial sources, to simultaneously reconstruct both SMILES and chemical fingerprints. To explore its utility in fragment design, we applied transfer learning to the fingerprint decoder layers to train a classifier using 66 frequent hitter fragments identified from our screening campaigns. Using a particle swarm optimization sampling approach, we compare the performance of this "dual" model to an architecture encoding SMILES only. The dual model produced valid SMILES with improved features, considering a range of properties including aromatic ring counts, heavy atom count, synthetic accessibility, and a new fragment complexity score we term Feature Complexity (FeCo). Additionally, we demonstrate that generative performance is further enhanced by use of a simple syntax-correction procedure during training, in which invalid and undesirable SMILES are spiked into the training set. Finally, we used the syntax-corrected model to generate a library of novel candidate privileged fragments.
Collapse
Affiliation(s)
- Alan E Bilsland
- Beatson Drug Discovery Unit, Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden, Glasgow, G61 1BD, U.K
| | - Kirsten McAulay
- Beatson Drug Discovery Unit, Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden, Glasgow, G61 1BD, U.K
| | - Ryan West
- Beatson Drug Discovery Unit, Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden, Glasgow, G61 1BD, U.K
| | - Angelo Pugliese
- Beatson Drug Discovery Unit, Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden, Glasgow, G61 1BD, U.K
- BioAscent Discovery Ltd., Bo'Ness Road, Newhouse, Lanarkshire ML1 5UH, U.K
| | - Justin Bower
- Beatson Drug Discovery Unit, Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden, Glasgow, G61 1BD, U.K
| |
Collapse
|
46
|
Blaschke T, Bajorath J. Fine-tuning of a generative neural network for designing multi-target compounds. J Comput Aided Mol Des 2021; 36:363-371. [PMID: 34046745 PMCID: PMC9325839 DOI: 10.1007/s10822-021-00392-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 05/23/2021] [Indexed: 12/20/2022]
Abstract
Exploring the origin of multi-target activity of small molecules and designing new multi-target compounds are highly topical issues in pharmaceutical research. We have investigated the ability of a generative neural network to create multi-target compounds. Data sets of experimentally confirmed multi-target, single-target, and consistently inactive compounds were extracted from public screening data considering positive and negative assay results. These data sets were used to fine-tune the REINVENT generative model via transfer learning to systematically recognize multi-target compounds, distinguish them from single-target or inactive compounds, and construct new multi-target compounds. During fine-tuning, the model showed a clear tendency to increasingly generate multi-target compounds and structural analogs. Our findings indicate that generative models can be adopted for de novo multi-target compound design.
Collapse
Affiliation(s)
- Thomas Blaschke
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany.
| |
Collapse
|
47
|
Zhang J, Mercado R, Engkvist O, Chen H. Comparative Study of Deep Generative Models on Chemical Space Coverage. J Chem Inf Model 2021; 61:2572-2581. [PMID: 34015916 DOI: 10.1021/acs.jcim.0c01328] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In recent years, deep molecular generative models have emerged as promising methods for de novo molecular design. Thanks to the rapid advance of deep learning techniques, deep learning architectures such as recurrent neural networks, variational autoencoders, and adversarial networks have been successfully employed for constructing generative models. Recently, quite a few metrics have been proposed to evaluate these deep generative models. However, many of these metrics cannot evaluate the chemical space coverage of sampled molecules. This work presents a novel and complementary metric for evaluating deep molecular generative models. The metric is based on the chemical space coverage of a reference dataset-GDB-13. The performance of seven different molecular generative models was compared by calculating what fraction of the structures, ring systems, and functional groups could be reproduced from the largely unseen reference set when using only a small fraction of GDB-13 for training. The results show that the performance of the generative models studied varies significantly using the benchmark metrics introduced herein, such that the generalization capabilities of the generative models can be clearly differentiated. In addition, the coverages of GDB-13 ring systems and functional groups were compared between the models. Our study provides a useful new metric that can be used for evaluating and comparing generative models.
Collapse
Affiliation(s)
- Jie Zhang
- Guangdong Provincial Key Laboratory of Laboratory Animals, Guangdong Laboratory Animals Monitoring Institute, Guangzhou 510663, P. R. China.,State Key Laboratory of Respiratory Disease, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou 510530, P. R. China.,Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou 510530, P. R. China
| | - Rocío Mercado
- Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| | - Ola Engkvist
- Discovery Sciences, R&D, AstraZeneca, Gothenburg 43183, Sweden
| | - Hongming Chen
- Bioland Laboratory (Guangzhou Regenerative Medicine and Health-Guangdong Laboratory), Guangzhou 510530, P. R. China
| |
Collapse
|
48
|
Thomas M, Smith RT, O'Boyle NM, de Graaf C, Bender A. Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study. J Cheminform 2021; 13:39. [PMID: 33985583 PMCID: PMC8117600 DOI: 10.1186/s13321-021-00516-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Accepted: 05/02/2021] [Indexed: 12/14/2022] Open
Abstract
Deep generative models have shown the ability to devise both valid and novel chemistry, which could significantly accelerate the identification of bioactive compounds. Many current models, however, use molecular descriptors or ligand-based predictive methods to guide molecule generation towards a desirable property space. This restricts their application to relatively data-rich targets, neglecting those where little data is available to sufficiently train a predictor. Moreover, ligand-based approaches often bias molecule generation towards previously established chemical space, thereby limiting their ability to identify truly novel chemotypes. In this work, we assess the ability of using molecular docking via Glide-a structure-based approach-as a scoring function to guide the deep generative model REINVENT and compare model performance and behaviour to a ligand-based scoring function. Additionally, we modify the previously published MOSES benchmarking dataset to remove any induced bias towards non-protonatable groups. We also propose a new metric to measure dataset diversity, which is less confounded by the distribution of heavy atom count than the commonly used internal diversity metric. With respect to the main findings, we found that when optimizing the docking score against DRD2, the model improves predicted ligand affinity beyond that of known DRD2 active molecules. In addition, generated molecules occupy complementary chemical and physicochemical space compared to the ligand-based approach, and novel physicochemical space compared to known DRD2 active molecules. Furthermore, the structure-based approach learns to generate molecules that satisfy crucial residue interactions, which is information only available when taking protein structure into account. Overall, this work demonstrates the advantage of using molecular docking to guide de novo molecule generation over ligand-based predictors with respect to predicted affinity, novelty, and the ability to identify key interactions between ligand and protein target. Practically, this approach has applications in early hit generation campaigns to enrich a virtual library towards a particular target, and also in novelty-focused projects, where de novo molecule generation either has no prior ligand knowledge available or should not be biased by it.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK
| | - Robert T Smith
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG, UK
| | - Noel M O'Boyle
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG, UK
| | - Chris de Graaf
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG, UK.
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| |
Collapse
|
49
|
Andrianov AM, Nikolaev GI, Shuldov NA, Bosko IP, Anischenko AI, Tuzikov AV. Application of deep learning and molecular modeling to identify small drug-like compounds as potential HIV-1 entry inhibitors. J Biomol Struct Dyn 2021; 40:7555-7573. [PMID: 33855929 DOI: 10.1080/07391102.2021.1905559] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
A generative adversarial autoencoder for the rational design of potential HIV-1 entry inhibitors able to block CD4-binding site of the viral envelope protein gp120 was developed. To do this, the following studies were carried out: (i) an autoencoder architecture was constructed; (ii) a virtual compound library of potential anti-HIV-1 agents for training the neural network was formed by the concept of click chemistry allowing one to generate a large number of drug candidates by their assembly from small modular units; (iii) molecular docking of all compounds from this library with gp120 was made and calculations of the values of binding free energy were performed; (iv) molecular fingerprints of chemical compounds from the training dataset were generated; (v) training of the developed autoencoder was implemented followed by the validation of this neural network using more than 21 million molecules from the ZINC15 database. As a result, three small drug-like compounds that exhibited the high-affinity binding to gp120 were identified. According to the data from molecular docking, machine learning, quantum chemical calculations, and molecular dynamics simulations, these compounds show the low values of binding free energy in the complexes with gp120 similar to those calculated using the same computational protocols for the HIV-1 entry inhibitors NBD-11021 and NBD-14010, highly potent and broad anti-HIV-1 agents presenting a new generation of the viral CD4 antagonists. The identified CD4-mimetic candidates are suggested to present good scaffolds for the design of novel antiviral drugs inhibiting the early stages of HIV-1 infection.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Alexander M Andrianov
- Institute of Bioorganic Chemistry, National Academy of Sciences of Belarus, Minsk, Republic of Belarus
| | - Grigory I Nikolaev
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, Republic of Belarus
| | - Nikita A Shuldov
- Faculty of Applied Mathematics & Computer Science, Belarusian State University, Minsk, Republic of Belarus
| | - Ivan P Bosko
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, Republic of Belarus
| | - Arseny I Anischenko
- Faculty of Applied Mathematics & Computer Science, Belarusian State University, Minsk, Republic of Belarus
| | - Alexander V Tuzikov
- United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, Republic of Belarus
| |
Collapse
|
50
|
Yu P, Sterling AJ, Hein J. A Novel Automated Screening Method for Combinatorially Generated Small Molecules. J Chem Inf Model 2021; 61:1637-1646. [PMID: 33844913 DOI: 10.1021/acs.jcim.0c01462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
A main challenge in the enumeration of small-molecule chemical spaces for drug design is to quickly and accurately differentiate between possible and impossible molecules. Current approaches for screening enumerated molecules (e.g., 2D heuristics and 3D force fields) have not been able to achieve a balance between accuracy and speed. We have developed a new automated approach for fast and high-quality screening of small molecules, with the following steps: (1) for each molecule in the set, an ensemble of 2D descriptors as feature encoding is computed; (2) on a random small subset, classification (feasible/infeasible) targets via a 3D-based approach are generated; (3) a classification dataset with the computed features and targets is formed and a machine learning model for predicting the 3D approach's decisions is trained; and (4) the trained model is used to screen the remainder of the enumerated set. Our approach is ≈8× (7.96× to 8.84×) faster than screening via 3D simulations without significantly sacrificing accuracy; while compared to 2D-based pruning rules, this approach is more accurate, with better coverage of known feasible molecules. Once the topological features and 3D conformer evaluation methods are established, the process can be fully automated, without any additional chemistry expertise.
Collapse
Affiliation(s)
- Pingshi Yu
- Department of Statistics, University of Oxford, 29 St Giles', Oxford OX1 2JD, U.K.,Department of Computer Science, University of Oxford, 15 Parks Road, Oxford OX1 3QD, U.K
| | - Alistair J Sterling
- Department of Chemistry, University of Oxford, Mansfield Road, Oxford OX1 3TA, U.K
| | - Jotun Hein
- Department of Statistics, University of Oxford, 29 St Giles', Oxford OX1 2JD, U.K
| |
Collapse
|