1
|
Gao C, Bao W, Wang S, Zheng J, Wang L, Ren Y, Jiao L, Wang J, Wang X. DockingGA: enhancing targeted molecule generation using transformer neural network and genetic algorithm with docking simulation. Brief Funct Genomics 2024; 23:595-606. [PMID: 38582610 DOI: 10.1093/bfgp/elae011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 02/25/2024] [Accepted: 03/13/2024] [Indexed: 04/08/2024] Open
Abstract
Generative molecular models generate novel molecules with desired properties by searching chemical space. Traditional combinatorial optimization methods, such as genetic algorithms, have demonstrated superior performance in various molecular optimization tasks. However, these methods do not utilize docking simulation to inform the design process, and heavy dependence on the quality and quantity of available data, as well as require additional structural optimization to become candidate drugs. To address this limitation, we propose a novel model named DockingGA that combines Transformer neural networks and genetic algorithms to generate molecules with better binding affinity for specific targets. In order to generate high quality molecules, we chose the Self-referencing Chemical Structure Strings to represent the molecule and optimize the binding affinity of the molecules to different targets. Compared to other baseline models, DockingGA proves to be the optimal model in all docking results for the top 1, 10 and 100 molecules, while maintaining 100% novelty. Furthermore, the distribution of physicochemical properties demonstrates the ability of DockingGA to generate molecules with favorable and appropriate properties. This innovation creates new opportunities for the application of generative models in practical drug discovery.
Collapse
Affiliation(s)
- Changnan Gao
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Wenjie Bao
- Guanghua School of Management, Peking University, Beijing 100091, China
| | - Shuang Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Jianyang Zheng
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Lulu Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Yongqi Ren
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Linfang Jiao
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
| | - Jianmin Wang
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Republic of Korea
| | - Xun Wang
- College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China
- High Performance Computer Research Center, Institute of Computing Technology, CAS, Beijing 100190, China
| |
Collapse
|
2
|
Kneiding H, Balcells D. Augmenting genetic algorithms with machine learning for inverse molecular design. Chem Sci 2024:d4sc02934h. [PMID: 39296997 PMCID: PMC11404003 DOI: 10.1039/d4sc02934h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Accepted: 09/09/2024] [Indexed: 09/21/2024] Open
Abstract
Evolutionary and machine learning methods have been successfully applied to the generation of molecules and materials exhibiting desired properties. The combination of these two paradigms in inverse design tasks can yield powerful methods that explore massive chemical spaces more efficiently, improving the quality of the generated compounds. However, such synergistic approaches are still an incipient area of research and appear underexplored in the literature. This perspective covers different ways of incorporating machine learning approaches into evolutionary learning frameworks, with the overall goal of increasing the optimization efficiency of genetic algorithms. In particular, machine learning surrogate models for faster fitness function evaluation, discriminator models to control population diversity on-the-fly, machine learning based crossover operations, and evolution in latent space are discussed. The further potential of these synergistic approaches in generative tasks is also assessed, outlining promising directions for future developments.
Collapse
Affiliation(s)
- Hannes Kneiding
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo P.O. Box 1033, Blindern 0315 Oslo Norway
| | - David Balcells
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo P.O. Box 1033, Blindern 0315 Oslo Norway
| |
Collapse
|
3
|
Renz P, Luukkonen S, Klambauer G. Diverse Hits in De Novo Molecule Design: Diversity-Based Comparison of Goal-Directed Generators. J Chem Inf Model 2024; 64:5756-5761. [PMID: 39029090 PMCID: PMC11323242 DOI: 10.1021/acs.jcim.4c00519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 07/10/2024] [Accepted: 07/11/2024] [Indexed: 07/21/2024]
Abstract
Since the rise of generative AI models, many goal-directed molecule generators have been proposed as tools for discovering novel drug candidates. However, molecule generators often produce highly similar molecules and tend to overemphasize conformity to an imperfect scoring function rather than capturing the true underlying properties sought. We rectify these two shortcomings by offering diversity-based evaluations using the #Circles metric and considering constraints on scoring function calls or computation time. Our findings highlight the superior performance of SMILES-based autoregressive models in generating diverse sets of desired molecules compared to graph-based models or genetic algorithms.
Collapse
Affiliation(s)
- Philipp Renz
- Johannes Kepler University Linz, Altenbergerstraße 69, Linz, AT 4040, Austria
| | - Sohvi Luukkonen
- Johannes Kepler University Linz, ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Altenbergerstraße 69, Linz, AT 4040, Austria
| | - Günter Klambauer
- Johannes Kepler University Linz, ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Altenbergerstraße 69, Linz, AT 4040, Austria
| |
Collapse
|
4
|
Fallani A, Medrano Sandonas L, Tkatchenko A. Inverse mapping of quantum properties to structures for chemical space of small organic molecules. Nat Commun 2024; 15:6061. [PMID: 39025883 PMCID: PMC11258234 DOI: 10.1038/s41467-024-50401-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 07/01/2024] [Indexed: 07/20/2024] Open
Abstract
Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify chemical compounds with tailored properties. While quantum-mechanical (QM) methods, coupled with machine learning, already offer a direct mapping from 3D molecular structures to their properties, effective methodologies for the inverse mapping in chemical space remain elusive. We address this challenge by demonstrating the possibility of parametrizing a chemical space with a finite set of QM properties. Our proof-of-concept implementation achieves an approximate property-to-structure mapping, the QIM model (which stands for "Quantum Inverse Mapping"), by forcing a variational auto-encoder with a property encoder to obtain a common internal representation for both structures and properties. After validating this mapping for small drug-like molecules, we illustrate its capabilities with an explainability study as well as by the generation of de novo molecular structures with targeted properties and transition pathways between conformational isomers. Our findings thus provide a proof-of-principle demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces.
Collapse
Affiliation(s)
- Alessio Fallani
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| | - Leonardo Medrano Sandonas
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
- Institute for Materials Science and Max Bergmann Center of Biomaterials, TU Dresden, 01062, Dresden, Germany.
| | - Alexandre Tkatchenko
- Department of Physics and Materials Science, University of Luxembourg, L-1511, Luxembourg City, Luxembourg.
| |
Collapse
|
5
|
Xia X, Liu Y, Zheng C, Zhang X, Wu Q, Gao X, Zeng X, Su Y. Evolutionary Multiobjective Molecule Optimization in an Implicit Chemical Space. J Chem Inf Model 2024; 64:5161-5174. [PMID: 38870455 PMCID: PMC11235097 DOI: 10.1021/acs.jcim.4c00031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 05/08/2024] [Accepted: 05/13/2024] [Indexed: 06/15/2024]
Abstract
Optimization techniques play a pivotal role in advancing drug development, serving as the foundation of numerous generative methods tailored to efficiently design optimized molecules derived from existing lead compounds. However, existing methods often encounter difficulties in generating diverse, novel, and high-property molecules that simultaneously optimize multiple drug properties. To overcome this bottleneck, we propose a multiobjective molecule optimization framework (MOMO). MOMO employs a specially designed Pareto-based multiproperty evaluation strategy at the molecular sequence level to guide the evolutionary search in an implicit chemical space. A comparative analysis of MOMO with five state-of-the-art methods across two benchmark multiproperty molecule optimization tasks reveals that MOMO markedly outperforms them in terms of diversity, novelty, and optimized properties. The practical applicability of MOMO in drug discovery has also been validated on four challenging tasks in the real-world discovery problem. These results suggest that MOMO can provide a useful tool to facilitate molecule optimization problems with multiple properties.
Collapse
Affiliation(s)
- Xin Xia
- The
Key Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
- Institute
of Artificial Intelligence, Hefei Comprehensive
National Science Center, 5089 Wangjiang West Road, Hefei 230088, AnhuiChina
| | - Yiping Liu
- College
of Computer Science and Electronic Engineering, Hunan University, Changsha 410012, China
| | - Chunhou Zheng
- The
Key Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Xingyi Zhang
- The
Key Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Qingwen Wu
- The
Key Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
| | - Xin Gao
- Computer
Science Program, Computer, Electrical and Mathematical Sciences and
Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology
(KAUST), Thuwal 23955-6900, Kingdom
of Saudi Arabia
| | - Xiangxiang Zeng
- College
of Computer Science and Electronic Engineering, Hunan University, Changsha 410012, China
| | - Yansen Su
- The
Key Laboratory of Intelligent Computing and Signal Processing of Ministry
of Education, School of Artificial Intelligence, Anhui University, Hefei 230601, China
- Institute
of Artificial Intelligence, Hefei Comprehensive
National Science Center, 5089 Wangjiang West Road, Hefei 230088, AnhuiChina
| |
Collapse
|
6
|
Alberga D, Lamanna G, Graziano G, Delre P, Lomuscio MC, Corriero N, Ligresti A, Siliqi D, Saviano M, Contino M, Stefanachi A, Mangiatordi GF. DeLA-DrugSelf: Empowering multi-objective de novo design through SELFIES molecular representation. Comput Biol Med 2024; 175:108486. [PMID: 38653065 DOI: 10.1016/j.compbiomed.2024.108486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 04/08/2024] [Accepted: 04/15/2024] [Indexed: 04/25/2024]
Abstract
In this paper, we introduce DeLA-DrugSelf, an upgraded version of DeLA-Drug [J. Chem. Inf. Model. 62 (2022) 1411-1424], which incorporates essential advancements for automated multi-objective de novo design. Unlike its predecessor, which relies on SMILES notation for molecular representation, DeLA-DrugSelf employs a novel and robust molecular representation string named SELFIES (SELF-referencing Embedded String). The generation process in DeLA-DrugSelf not only involves substitutions to the initial string representing the starting query molecule but also incorporates insertions and deletions. This enhancement makes DeLA-DrugSelf significantly more adept at executing data-driven scaffold decoration and lead optimization strategies. Remarkably, DeLA-DrugSelf explicitly addresses the SELFIES-related collapse issue, considering only collapse-free compounds during generation. These compounds undergo a rigorous quality metrics evaluation, highlighting substantial advancements in terms of drug-likeness, uniqueness, and novelty compared to the molecules generated by the previous version of the algorithm. To evaluate the potential of DeLA-DrugSelf as a mutational operator within a genetic algorithm framework for multi-objective optimization, we employed a fitness function based on Pareto dominance. Our objectives focused on target-oriented properties aimed at optimizing known cannabinoid receptor 2 (CB2R) ligands. The results obtained indicate that DeLA-DrugSelf, available as a user-friendly web platform (https://www.ba.ic.cnr.it/softwareic/delaself/), can effectively contribute to the data-driven optimization of starting bioactive molecules based on user-defined parameters.
Collapse
Affiliation(s)
- Domenico Alberga
- CNR - Institute of Crystallography, Via Amendola 122/o, 70126, Bari, Italy
| | - Giuseppe Lamanna
- CNR - Institute of Crystallography, Via Amendola 122/o, 70126, Bari, Italy
| | - Giovanni Graziano
- Department of Pharmacy - Pharmaceutical Sciences, University of Bari "Aldo Moro", via E. Orabona, 4, I-70125, Bari, Italy
| | - Pietro Delre
- CNR - Institute of Crystallography, Via Amendola 122/o, 70126, Bari, Italy
| | | | - Nicola Corriero
- CNR - Institute of Crystallography, Via Amendola 122/o, 70126, Bari, Italy
| | - Alessia Ligresti
- CNR - Institute of Biomolecular Chemistry, Via Campi Flegrei 34, 80078, Pozzuoli, Italy
| | - Dritan Siliqi
- CNR - Institute of Crystallography, Via Amendola 122/o, 70126, Bari, Italy
| | - Michele Saviano
- CNR - Institute of Crystallography, Via Vivaldi 43, 81100, Caserta, Italy
| | - Marialessandra Contino
- Department of Pharmacy - Pharmaceutical Sciences, University of Bari "Aldo Moro", via E. Orabona, 4, I-70125, Bari, Italy
| | - Angela Stefanachi
- Department of Pharmacy - Pharmaceutical Sciences, University of Bari "Aldo Moro", via E. Orabona, 4, I-70125, Bari, Italy
| | | |
Collapse
|
7
|
Lamens A, Bajorath J. Systematic generation and analysis of counterfactuals for compound activity predictions using multi-task models. RSC Med Chem 2024; 15:1547-1555. [PMID: 38784468 PMCID: PMC11110787 DOI: 10.1039/d4md00128a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/05/2024] [Indexed: 05/25/2024] Open
Abstract
Most machine learning (ML) methods produce predictions that are hard or impossible to understand. The black box nature of predictive models obscures potential learning bias and makes it difficult to recognize and trace problems. Moreover, the inability to rationalize model decisions causes reluctance to accept predictions for experimental design. For ML, limited trust in predictions presents a substantial problem and continues to limit its impact in interdisciplinary research, including early-phase drug discovery. As a desirable remedy, approaches from explainable artificial intelligence (XAI) are increasingly applied to shed light on the ML black box and help to rationalize predictions. Among these is the concept of counterfactuals (CFs), which are best understood as test cases with small modifications yielding opposing prediction outcomes (such as different class labels in object classification). For ML applications in medicinal chemistry, for example, compound activity predictions, CFs are particularly intuitive because these hypothetical molecules enable immediate comparisons with actual test compounds that do not require expert ML knowledge and are accessible to practicing chemists. Such comparisons often reveal structural moieties in compounds that determine their predictions and can be further investigated. Herein, we adapt and extend a recently introduced concept for the systematic generation of molecular CFs to multi-task predictions of different classes of protein kinase inhibitors, analyze CFs in detail, rationalize the origins of CF formation in multi-task modeling, and present exemplary explanations of predictions.
Collapse
Affiliation(s)
- Alec Lamens
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany
| |
Collapse
|
8
|
Luo Y, Zhang Y, Zhu J, Tian X, Liu G, Feng Z, Pan L, Liu X, Han N, Tan R. Material Engineering Strategies for Efficient Hydrogen Evolution Reaction Catalysts. SMALL METHODS 2024:e2400158. [PMID: 38745530 DOI: 10.1002/smtd.202400158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/27/2024] [Indexed: 05/16/2024]
Abstract
Water electrolysis, a key enabler of hydrogen energy production, presents significant potential as a strategy for achieving net-zero emissions. However, the widespread deployment of water electrolysis is currently limited by the high-cost and scarce noble metal electrocatalysts in hydrogen evolution reaction (HER). Given this challenge, design and synthesis of cost-effective and high-performance alternative catalysts have become a research focus, which necessitates insightful understandings of HER fundamentals and material engineering strategies. Distinct from typical reviews that concentrate only on the summary of recent catalyst materials, this review article shifts focus to material engineering strategies for developing efficient HER catalysts. In-depth analysis of key material design approaches for HER catalysts, such as doping, vacancy defect creation, phase engineering, and metal-support engineering, are illustrated along with typical research cases. A special emphasis is placed on designing noble metal-free catalysts with a brief discussion on recent advancements in electrocatalytic water-splitting technology. The article also delves into important descriptors, reliable evaluation parameters and characterization techniques, aiming to link the fundamental mechanisms of HER with its catalytic performance. In conclusion, it explores future trends in HER catalysts by integrating theoretical, experimental and industrial perspectives, while acknowledging the challenges that remain.
Collapse
Affiliation(s)
- Yue Luo
- School of Resources, Environment and Materials, Guangxi University, Nanning, 530004, China
| | - Yulong Zhang
- College of Mechatronical and Electrical Engineering, Hebei Agricultrual Univesity, Baoding, 07001, China
| | - Jiayi Zhu
- Warwick Electrochemical Engineering, WMG, University of Warwick, Coventry, CV4 7AL, UK
| | - Xingpeng Tian
- Warwick Electrochemical Engineering, WMG, University of Warwick, Coventry, CV4 7AL, UK
| | - Gang Liu
- IDTECH (Suzhou) Co. Ltd., Suzhou, 215217, China
| | - Zhiming Feng
- Department of Chemical Engineering, Imperial College London, London, SW7 2AZ, UK
| | - Liwen Pan
- School of Resources, Environment and Materials, Guangxi University, Nanning, 530004, China
- Education Department of Guangxi Zhuang Autonomous Region, Key Laboratory of High Performance Structural Materials and Thermo-surface Processing (Guangxi University), Nanning, 530004, China
| | - Xinhua Liu
- School of Transportation Science and Engineering, Beihang University, Beijing, 100191, China
| | - Ning Han
- Department of Materials Engineering, KU Leuven, Kasteelpark Arenberg 44, bus 2450, Heverlee, B-3001, Belgium
| | - Rui Tan
- Warwick Electrochemical Engineering, WMG, University of Warwick, Coventry, CV4 7AL, UK
- Department of Chemcial Engineering, Swansea University, Swansea, SA1 8EN, United Kingdom
| |
Collapse
|
9
|
Chandraghatgi R, Ji HF, Rosen GL, Sokhansanj BA. Streamlining Computational Fragment-Based Drug Discovery through Evolutionary Optimization Informed by Ligand-Based Virtual Prescreening. J Chem Inf Model 2024; 64:3826-3840. [PMID: 38696451 PMCID: PMC11197033 DOI: 10.1021/acs.jcim.4c00234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 04/18/2024] [Accepted: 04/19/2024] [Indexed: 05/04/2024]
Abstract
Recent advances in computational methods provide the promise of dramatically accelerating drug discovery. While mathematical modeling and machine learning have become vital in predicting drug-target interactions and properties, there is untapped potential in computational drug discovery due to the vast and complex chemical space. This paper builds on our recently published computational fragment-based drug discovery (FBDD) method called fragment databases from screened ligand drug discovery (FDSL-DD). FDSL-DD uses in silico screening to identify ligands from a vast library, fragmenting them while attaching specific attributes based on predicted binding affinity and interaction with the target subdomain. In this paper, we further propose a two-stage optimization method that utilizes the information from prescreening to optimize computational ligand synthesis. We hypothesize that using prescreening information for optimization shrinks the search space and focuses on promising regions, thereby improving the optimization for candidate ligands. The first optimization stage assembles these fragments into larger compounds using genetic algorithms, followed by a second stage of iterative refinement to produce compounds with enhanced bioactivity. To demonstrate broad applicability, the methodology is demonstrated on three diverse protein targets found in human solid cancers, bacterial antimicrobial resistance, and the SARS-CoV-2 virus. Combined, the proposed FDSL-DD and a two-stage optimization approach yield high-affinity ligand candidates more efficiently than other state-of-the-art computational FBDD methods. We further show that a multiobjective optimization method accounting for drug-likeness can still produce potential candidate ligands with a high binding affinity. Overall, the results demonstrate that integrating detailed chemical information with a constrained search framework can markedly optimize the initial drug discovery process, offering a more precise and efficient route to developing new therapeutics.
Collapse
Affiliation(s)
- Rohan Chandraghatgi
- Department
of Biology, Drexel University, Philadelphia, Pennsylvania 19104, United States
| | - Hai-Feng Ji
- Department
of Chemistry, Drexel University, Philadelphia, Pennsylvania 19104, United States
| | - Gail L. Rosen
- Department
of Electrical & Computer Engineering, Drexel University, Philadelphia, Pennsylvania 19104, United States
| | - Bahrad A. Sokhansanj
- Department
of Electrical & Computer Engineering, Drexel University, Philadelphia, Pennsylvania 19104, United States
| |
Collapse
|
10
|
Choi S, Lee J, Seo J, Han SW, Lee SH, Seo JH, Seok J. Automated BigSMILES conversion workflow and dataset for homopolymeric macromolecules. Sci Data 2024; 11:371. [PMID: 38605036 PMCID: PMC11009387 DOI: 10.1038/s41597-024-03212-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 04/02/2024] [Indexed: 04/13/2024] Open
Abstract
The simplified molecular-input line-entry system (SMILES) has been utilized in a variety of artificial intelligence analyses owing to its capability of representing chemical structures using line notation. However, its ease of representation is limited, which has led to the proposal of BigSMILES as an alternative method suitable for the representation of macromolecules. Nevertheless, research on BigSMILES remains limited due to its preprocessing requirements. Thus, this study proposes a conversion workflow of BigSMILES, focusing on its automated generation from SMILES representations of homopolymers. BigSMILES representations for 4,927,181 records are provided, thereby enabling its immediate use for various research and development applications. Our study presents detailed descriptions on a validation process to ensure the accuracy, interchangeability, and robustness of the conversion. Additionally, a systematic overview of utilized codes and functions that emphasizes their relevance in the context of BigSMILES generation are produced. This advancement is anticipated to significantly aid researchers and facilitate further studies in BigSMILES representation, including potential applications in deep learning and further extension to complex structures such as copolymers.
Collapse
Affiliation(s)
- Sunho Choi
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Joonbum Lee
- Department of Materials Science and Engineering, Korea University, Seoul, South Korea
| | - Jangwon Seo
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Sung Won Han
- School of Industrial Management Engineering, Korea University, Seoul, South Korea
| | - Sang Hyun Lee
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Ji-Hun Seo
- Department of Materials Science and Engineering, Korea University, Seoul, South Korea
| | - Junhee Seok
- School of Electrical Engineering, Korea University, Seoul, South Korea.
| |
Collapse
|
11
|
Zhang X, Sheng Y, Liu X, Yang J, Goddard Iii WA, Ye C, Zhang W. Polymer-Unit Graph: Advancing Interpretability in Graph Neural Network Machine Learning for Organic Polymer Semiconductor Materials. J Chem Theory Comput 2024; 20:2908-2920. [PMID: 38551455 DOI: 10.1021/acs.jctc.3c01385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2024]
Abstract
The graph representation of complex materials plays a crucial role in the field of inorganic and organic materials investigations for developing data-centric materials science, such as those using graph neural networks (GNNs). However, the currently prevalent GNN models are primarily employed for investigating periodic crystals and organic small molecule data, yet they still encounter challenges in terms of interpretability and computational efficiency when applied to polymer monomers and organic macromolecules data. There is still a lack of graph representation of organic polymers and macromolecules specifically tailored for GNN models to explore the structural characteristics. The Polymer-unit Graph, a novel coarse-grained graph representation method introduced in study, is dedicated to expressing and analyzing polymers and macromolecules. By incorporating the Polymer-unit Graph into the GNN models and analyzing the organic semiconductor (OSC) materials database, it becomes possible to uncover intricate structure-property relationships involving branched-chain engineering, fluoridation substitution, and donor-acceptor combination effects on the elementary structure of OSC polymers. Furthermore, the Polymer-unit Graph enables visualizing the relationship between target properties and polymer units while reducing training time by an impressive 98% and minimizing molecular graph representation models. In conclusion, the Polymer-unit Graph successfully integrates the concept of Polymer-unit into the field of GNNs, enabling more accurate analysis and understanding of organic polymers and macromolecules.
Collapse
Affiliation(s)
- Xinyue Zhang
- Department of Materials Science and Engineering & Guangdong Provincial Key Laboratory of Computational Science and Material Design, Southern University of Science and Technology, Shenzhen 518055, PR China
| | - Ye Sheng
- Department of Materials Science and Engineering & Guangdong Provincial Key Laboratory of Computational Science and Material Design, Southern University of Science and Technology, Shenzhen 518055, PR China
| | - Xiumin Liu
- Department of Materials Science and Engineering & Guangdong Provincial Key Laboratory of Computational Science and Material Design, Southern University of Science and Technology, Shenzhen 518055, PR China
- Key Laboratory of Soft Chemistry and Functional Materials of MOE, School of Chemistry and Chemical Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China
| | - Jiong Yang
- Materials Genome Institute, Shanghai University, Shanghai 200444, PR China
| | - William A Goddard Iii
- Materials and Process Simulation Center (MSC), California Institute of Technology, Pasadena, California 91125, United States
| | - Caichao Ye
- Department of Materials Science and Engineering & Guangdong Provincial Key Laboratory of Computational Science and Material Design, Southern University of Science and Technology, Shenzhen 518055, PR China
- Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology, Shenzhen 518055, PR China
| | - Wenqing Zhang
- Department of Materials Science and Engineering & Guangdong Provincial Key Laboratory of Computational Science and Material Design, Southern University of Science and Technology, Shenzhen 518055, PR China
| |
Collapse
|
12
|
Kneiding H, Nova A, Balcells D. Directional multiobjective optimization of metal complexes at the billion-system scale. NATURE COMPUTATIONAL SCIENCE 2024; 4:263-273. [PMID: 38553635 DOI: 10.1038/s43588-024-00616-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 02/29/2024] [Indexed: 04/14/2024]
Abstract
The discovery of transition metal complexes (TMCs) with optimal properties requires large ligand libraries and efficient multiobjective optimization algorithms. Here we provide the tmQMg-L library, containing 30k diverse and synthesizable ligands with robustly assigned charges and metal coordination modes. tmQMg-L enabled the generation of 1.37 million palladium TMCs, which were used to develop and benchmark the Pareto-Lighthouse multiobjective genetic algorithm (PL-MOGA). With fine control over aim and scope, this algorithm maximized both the polarizability and highest occupied molecular orbital-lowest unoccupied molecular orbital gap of the TMCs within selected regions of the Pareto front, without requiring prior knowledge on the objective limits. Instead of genetic operations on small ligand fragments, the PL-MOGA did whole-ligand mutation and crossover operations, which in chemical spaces containing billions of systems, yielded thousands of highly diverse TMCs in an interpretable manner.
Collapse
Affiliation(s)
- Hannes Kneiding
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, Oslo, Norway
| | - Ainara Nova
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, Oslo, Norway
- Centre for Materials Science and Nanotechnology, Department of Chemistry, University of Oslo, Oslo, Norway
| | - David Balcells
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, Oslo, Norway.
| |
Collapse
|
13
|
Korolev V, Mitrofanov A. Coarse-Grained Crystal Graph Neural Networks for Reticular Materials Design. J Chem Inf Model 2024; 64:1919-1931. [PMID: 38456446 DOI: 10.1021/acs.jcim.3c02083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2024]
Abstract
Reticular materials, including metal-organic frameworks and covalent organic frameworks, combine the relative ease of synthesis and an impressive range of applications in various fields from gas storage to biomedicine. Diverse properties arise from the variation of building units─metal centers and organic linkers─in almost infinite chemical space. Such variation substantially complicates the experimental design and promotes the use of computational methods. In particular, the most successful artificial intelligence algorithms for predicting the properties of reticular materials are atomic-level graph neural networks, which optionally incorporate domain knowledge. Nonetheless, the data-driven inverse design involving these models suffers from the incorporation of irrelevant and redundant features such as a full atomistic graph and network topology. In this study, we propose a new way of representing materials, aiming to overcome the limitations of existing methods; the message passing is performed on a coarse-grained crystal graph that comprises molecular building units. To highlight the merits of our approach, we assessed the predictive performance and energy efficiency of neural networks built on different materials representations, including composition-based and crystal-structure-aware models. Coarse-grained crystal graph neural networks showed decent accuracy at low computational costs, making them a valuable alternative to omnipresent atomic-level algorithms. Moreover, the presented models can be successfully integrated into an inverse materials design pipeline as estimators of the objective function. Overall, the coarse-grained crystal graph framework is aimed at challenging the prevailing atom-centric perspective on reticular materials design.
Collapse
Affiliation(s)
- Vadim Korolev
- Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
- MSU Institute for Artificial Intelligence, Lomonosov Moscow State University, Moscow 119192, Russia
| | - Artem Mitrofanov
- Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia
- MSU Institute for Artificial Intelligence, Lomonosov Moscow State University, Moscow 119192, Russia
| |
Collapse
|
14
|
Nigam A, Pollice R, Friederich P, Aspuru-Guzik A. Artificial design of organic emitters via a genetic algorithm enhanced by a deep neural network. Chem Sci 2024; 15:2618-2639. [PMID: 38362419 PMCID: PMC10866360 DOI: 10.1039/d3sc05306g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Accepted: 01/10/2024] [Indexed: 02/17/2024] Open
Abstract
The design of molecules requires multi-objective optimizations in high-dimensional chemical space with often conflicting target properties. To navigate this space, classical workflows rely on the domain knowledge and creativity of human experts, which can be the bottleneck in high-throughput approaches. Herein, we present an artificial molecular design workflow relying on a genetic algorithm and a deep neural network to find a new family of organic emitters with inverted singlet-triplet gaps and appreciable fluorescence rates. We combine high-throughput virtual screening and inverse design infused with domain knowledge and artificial intelligence to accelerate molecular generation significantly. This enabled us to explore more than 800 000 potential emitter molecules and find more than 10 000 candidates estimated to have inverted singlet-triplet gaps (INVEST) and appreciable fluorescence rates, many of which likely emit blue light. This class of molecules has the potential to realize a new generation of organic light-emitting diodes.
Collapse
Affiliation(s)
- AkshatKumar Nigam
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto 80 St. George St Toronto Ontario M5S 3H6 Canada
- Department of Computer Science, University of Toronto 40 St. George St Toronto Ontario M5S 2E4 Canada
| | - Robert Pollice
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto 80 St. George St Toronto Ontario M5S 3H6 Canada
- Department of Computer Science, University of Toronto 40 St. George St Toronto Ontario M5S 2E4 Canada
| | - Pascal Friederich
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto 80 St. George St Toronto Ontario M5S 3H6 Canada
- Department of Computer Science, University of Toronto 40 St. George St Toronto Ontario M5S 2E4 Canada
- Institute of Nanotechnology, Karlsruhe Institute of Technology Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology Am Fasanengarten 5 76131 Karlsruhe Germany
| | - Alán Aspuru-Guzik
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto 80 St. George St Toronto Ontario M5S 3H6 Canada
- Department of Computer Science, University of Toronto 40 St. George St Toronto Ontario M5S 2E4 Canada
- Vector Institute for Artificial Intelligence 661 University Ave Suite 710 Toronto Ontario M5G 1M1 Canada
- Department of Chemical Engineering & Applied Chemistry, University of Toronto 200 College St. Ontario M5S 3E5 Canada
- Department of Materials Science & Engineering, University of Toronto, 184 College St. Ontario M5S 3E4 Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR) 661 University Ave Toronto Ontario M5G Canada
- Acceleration Consortium Toronto Ontario M5G 3H6 Canada
| |
Collapse
|
15
|
Lamens A, Bajorath J. Generation of Molecular Counterfactuals for Explainable Machine Learning Based on Core-Substituent Recombination. ChemMedChem 2024; 19:e202300586. [PMID: 37983655 DOI: 10.1002/cmdc.202300586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 11/20/2023] [Accepted: 11/20/2023] [Indexed: 11/22/2023]
Abstract
The use of black box machine learning models whose decisions cannot be understood limits the acceptance of predictions in interdisciplinary research and camouflages artificial learning characteristics leading to predictions for other than anticipated reasons. Consequently, there is increasing interest in explainable artificial intelligence to rationalize predictions and uncover potential pitfalls. Among others, relevant approaches include feature attribution methods to identify molecular structures determining predictions and counterfactuals (CFs) or contrastive explanations. CFs are defined as variants of test instances with minimal modifications leading to opposing predictions. In medicinal chemistry, CFs have thus far only been little investigated although they are particularly intuitive from a chemical perspective. We introduce a new methodology for the systematic generation of CFs that is centered on well-defined structural analogues of test compounds. The approach is transparent, computationally straightforward, and shown to provide a wealth of CFs for test sets. The method is made freely available.
Collapse
Affiliation(s)
- Alec Lamens
- Department of Life Science Informatics and Data Science B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| |
Collapse
|
16
|
Qian Y, Shi M, Zhang Q. CONSMI: Contrastive Learning in the Simplified Molecular Input Line Entry System Helps Generate Better Molecules. Molecules 2024; 29:495. [PMID: 38276573 PMCID: PMC10821140 DOI: 10.3390/molecules29020495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/12/2024] [Accepted: 01/16/2024] [Indexed: 01/27/2024] Open
Abstract
In recent years, the application of deep learning in molecular de novo design has gained significant attention. One successful approach involves using SMILES representations of molecules and treating the generation task as a text generation problem, yielding promising results. However, the generation of more effective and novel molecules remains a key research area. Due to the fact that a molecule can have multiple SMILES representations, it is not sufficient to consider only one of them for molecular generation. To make up for this deficiency, and also motivated by the advancements in contrastive learning in natural language processing, we propose a contrastive learning framework called CONSMI to learn more comprehensive SMILES representations. This framework leverages different SMILES representations of the same molecule as positive examples and other SMILES representations as negative examples for contrastive learning. The experimental results of generation tasks demonstrate that CONSMI significantly enhances the novelty of generated molecules while maintaining a high validity. Moreover, the generated molecules have similar chemical properties compared to the original dataset. Additionally, we find that CONSMI can achieve favorable results in classifier tasks, such as the compound-protein interaction task.
Collapse
Affiliation(s)
| | | | - Qian Zhang
- School of Computer Science and Technology, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, East China Normal University, 3663 North Zhongshan Road, Putuo District, Shanghai 200062, China; (Y.Q.); (M.S.)
| |
Collapse
|
17
|
Karandashev K, Weinreich J, Heinen S, Arismendi Arrieta DJ, von Rudorff GF, Hermansson K, von Lilienfeld OA. Evolutionary Monte Carlo of QM Properties in Chemical Space: Electrolyte Design. J Chem Theory Comput 2023; 19:8861-8870. [PMID: 38009856 PMCID: PMC10720348 DOI: 10.1021/acs.jctc.3c00822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/29/2023] [Accepted: 10/30/2023] [Indexed: 11/29/2023]
Abstract
Optimizing a target function over the space of organic molecules is an important problem appearing in many fields of applied science but also a very difficult one due to the vast number of possible molecular systems. We propose an evolutionary Monte Carlo algorithm for solving such problems which is capable of straightforwardly tuning both exploration and exploitation characteristics of an optimization procedure while retaining favorable properties of genetic algorithms. The method, dubbed MOSAiCS (Metropolis Optimization by Sampling Adaptively in Chemical Space), is tested on problems related to optimizing components of battery electrolytes, namely, minimizing solvation energy in water or maximizing dipole moment while enforcing a lower bound on the HOMO-LUMO gap; optimization was carried out over sets of molecular graphs inspired by QM9 and Electrolyte Genome Project (EGP) data sets. MOSAiCS reliably generated molecular candidates with good target quantity values, which were in most cases better than the ones found in QM9 or EGP. While the optimization results presented in this work sometimes required up to 106 QM calculations and were thus feasible only thanks to computationally efficient ab initio approximations of properties of interest, we discuss possible strategies for accelerating MOSAiCS using machine learning approaches.
Collapse
Affiliation(s)
| | - Jan Weinreich
- Faculty
of Physics, University of Vienna, Kolingasse 14-16, AT-1090 Wien, Austria
| | - Stefan Heinen
- Vector
Institute for Artificial Intelligence, Toronto, M5S 1M1 Ontario, Canada
| | | | - Guido Falk von Rudorff
- Department
of Chemistry, University Kassel, Heinrich-Plett-Str.40, 34132 Kassel, Germany
- Center
for Interdisciplinary Nanostructure Science and Technology (CINSaT), Heinrich-Plett-Straße 40, 34132 Kassel, Germany
| | - Kersti Hermansson
- Department
of Chemistry-Ångström Laboratory, Uppsala University, Box 538, SE-75121 Uppsala, Sweden
| | - O. Anatole von Lilienfeld
- Vector
Institute for Artificial Intelligence, Toronto, M5S 1M1 Ontario, Canada
- Departments
of Chemistry, Materials Science and Engineering, and Physics, University of Toronto, St. George
Campus, Toronto, M5S 1A1 Ontario, Canada
- Machine
Learning Group, Technische Universität
Berlin and Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
| |
Collapse
|
18
|
Lo A, Pollice R, Nigam A, White AD, Krenn M, Aspuru-Guzik A. Recent advances in the self-referencing embedded strings (SELFIES) library. DIGITAL DISCOVERY 2023; 2:897-908. [PMID: 38013816 PMCID: PMC10408573 DOI: 10.1039/d3dd00044c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 06/23/2023] [Indexed: 11/29/2023]
Abstract
String-based molecular representations play a crucial role in cheminformatics applications, and with the growing success of deep learning in chemistry, have been readily adopted into machine learning pipelines. However, traditional string-based representations such as SMILES are often prone to syntactic and semantic errors when produced by generative models. To address these problems, a novel representation, SELF-referencing embedded strings (SELFIES), was proposed that is inherently 100% robust, alongside an accompanying open-source implementation called selfies. Since then, we have generalized SELFIES to support a wider range of molecules and semantic constraints, and streamlined its underlying grammar. We have implemented this updated representation in subsequent versions of selfies, where we have also made major advances with respect to design, efficiency, and supported features. Hence, we present the current status of selfies (version 2.1.1) in this manuscript. Our library, selfies, is available at GitHub (https://github.com/aspuru-guzik-group/selfies).
Collapse
Affiliation(s)
- Alston Lo
- Department of Computer Science, University of Toronto Canada
| | - Robert Pollice
- Department of Computer Science, University of Toronto Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto Canada
- Stratingh Institute for Chemistry, University of Groningen The Netherlands
| | | | - Andrew D White
- Department of Chemical Engineering, University of Rochester USA
| | - Mario Krenn
- Max Planck Institute for the Science of Light (MPL) Erlangen Germany
| | - Alán Aspuru-Guzik
- Department of Computer Science, University of Toronto Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto Canada
- Vector Institute for Artificial Intelligence Toronto Canada
- Canadian Institute for Advanced Research (CIFAR) Lebovic Fellow Toronto Canada
| |
Collapse
|
19
|
Lamens A, Bajorath J. Explaining Multiclass Compound Activity Predictions Using Counterfactuals and Shapley Values. Molecules 2023; 28:5601. [PMID: 37513472 PMCID: PMC10383571 DOI: 10.3390/molecules28145601] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 07/18/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023] Open
Abstract
Most machine learning (ML) models produce black box predictions that are difficult, if not impossible, to understand. In pharmaceutical research, black box predictions work against the acceptance of ML models for guiding experimental work. Hence, there is increasing interest in approaches for explainable ML, which is a part of explainable artificial intelligence (XAI), to better understand prediction outcomes. Herein, we have devised a test system for the rationalization of multiclass compound activity prediction models that combines two approaches from XAI for feature relevance or importance analysis, including counterfactuals (CFs) and Shapley additive explanations (SHAP). For compounds with different single- and dual-target activities, we identified small compound modifications that induce feature changes inverting class label predictions. In combination with feature mapping, CFs and SHAP value calculations provide chemically intuitive explanations for model decisions.
Collapse
Affiliation(s)
- Alec Lamens
- Department of Life Science Informatics, B-IT, LIMES Program, Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program, Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| |
Collapse
|
20
|
Demir H, Daglar H, Gulbalkan HC, Aksu GO, Keskin S. Recent advances in computational modeling of MOFs: From molecular simulations to machine learning. Coord Chem Rev 2023. [DOI: 10.1016/j.ccr.2023.215112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
21
|
Yang Y, Hsieh CY, Kang Y, Hou T, Liu H, Yao X. Deep Generation Model Guided by the Docking Score for Active Molecular Design. J Chem Inf Model 2023; 63:2983-2991. [PMID: 37163364 DOI: 10.1021/acs.jcim.3c00572] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
A deep generation model, as a novel drug design and discovery tool, shows obvious advantages in generating compounds with novel backbones and has been applied successfully in the field of drug discovery. However, it is still a challenge to generate molecules with expected properties, especially high activity. Here, to obtain compounds both with novelty and high activity to a target, we proposed a conditional molecular generation model COMG by considering the docking score and 3D pharmacophore matching during molecular generation. The proposed model was based on the conditional variational autoencoder architecture constrained by the pharmacophore matching score. During Bayesian optimization, the docking score was applied to enhance the target relevance of generated compounds. Furthermore, to overcome the problem of high structural similarity caused by Bayesian optimization, the idea of the scaffold memory unit was also introduced. The evaluation results of COMG show that our model not only can improve the structural diversity of generated molecules but also can effectively improve the proportion of target-related drug-active molecules. The obtained results indicate that our proposed model COMG is a useful drug design tool.
Collapse
Affiliation(s)
- Yuwei Yang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao (SAR) 999078, P. R. China
- School of Pharmacy, Lanzhou University, Lanzhou 730000, Gansu, P. R. China
| | - Chang-Yu Hsieh
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, P. R. China
| | - Yu Kang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, P. R. China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, P. R. China
| | - Huanxiang Liu
- Faculty of Applied Sciences, Macao Polytechnic University, Macao (SAR) 999078, P. R. China
| | - Xiaojun Yao
- State Key Laboratory of Quality Research in Chinese Medicine, Macau University of Science and Technology, Taipa, 999078 Macau (SAR), P. R. China
| |
Collapse
|
22
|
Wu Z, Wang J, Du H, Jiang D, Kang Y, Li D, Pan P, Deng Y, Cao D, Hsieh CY, Hou T. Chemistry-intuitive explanation of graph neural networks for molecular property prediction with substructure masking. Nat Commun 2023; 14:2585. [PMID: 37142585 PMCID: PMC10160109 DOI: 10.1038/s41467-023-38192-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Accepted: 04/12/2023] [Indexed: 05/06/2023] Open
Abstract
Graph neural networks (GNNs) have been widely used in molecular property prediction, but explaining their black-box predictions is still a challenge. Most existing explanation methods for GNNs in chemistry focus on attributing model predictions to individual nodes, edges or fragments that are not necessarily derived from a chemically meaningful segmentation of molecules. To address this challenge, we propose a method named substructure mask explanation (SME). SME is based on well-established molecular segmentation methods and provides an interpretation that aligns with the understanding of chemists. We apply SME to elucidate how GNNs learn to predict aqueous solubility, genotoxicity, cardiotoxicity and blood-brain barrier permeation for small molecules. SME provides interpretation that is consistent with the understanding of chemists, alerts them to unreliable performance, and guides them in structural optimization for target properties. Hence, we believe that SME empowers chemists to confidently mine structure-activity relationship (SAR) from reliable GNNs through a transparent inspection on how GNNs pick up useful signals when learning from data.
Collapse
Affiliation(s)
- Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, P.R. China
| | - Jike Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, P.R. China
- National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, P.R. China
| | - Hongyan Du
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, P.R. China
| | - Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, P.R. China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
| | - Dan Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, 310018, Zhejiang, P.R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410004, Hunan, P.R. China.
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China.
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, Zhejiang, P.R. China.
| |
Collapse
|
23
|
Bhat V, Callaway CP, Risko C. Computational Approaches for Organic Semiconductors: From Chemical and Physical Understanding to Predicting New Materials. Chem Rev 2023. [PMID: 37141497 DOI: 10.1021/acs.chemrev.2c00704] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
While a complete understanding of organic semiconductor (OSC) design principles remains elusive, computational methods─ranging from techniques based in classical and quantum mechanics to more recent data-enabled models─can complement experimental observations and provide deep physicochemical insights into OSC structure-processing-property relationships, offering new capabilities for in silico OSC discovery and design. In this Review, we trace the evolution of these computational methods and their application to OSCs, beginning with early quantum-chemical methods to investigate resonance in benzene and building to recent machine-learning (ML) techniques and their application to ever more sophisticated OSC scientific and engineering challenges. Along the way, we highlight the limitations of the methods and how sophisticated physical and mathematical frameworks have been created to overcome those limitations. We illustrate applications of these methods to a range of specific challenges in OSCs derived from π-conjugated polymers and molecules, including predicting charge-carrier transport, modeling chain conformations and bulk morphology, estimating thermomechanical properties, and describing phonons and thermal transport, to name a few. Through these examples, we demonstrate how advances in computational methods accelerate the deployment of OSCsin wide-ranging technologies, such as organic photovoltaics (OPVs), organic light-emitting diodes (OLEDs), organic thermoelectrics, organic batteries, and organic (bio)sensors. We conclude by providing an outlook for the future development of computational techniques to discover and assess the properties of high-performing OSCs with greater accuracy.
Collapse
Affiliation(s)
- Vinayak Bhat
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506-0055, United States
| | - Connor P Callaway
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506-0055, United States
| | - Chad Risko
- Department of Chemistry & Center for Applied Energy Research, University of Kentucky, Lexington, Kentucky 40506-0055, United States
| |
Collapse
|
24
|
Chen L, Shen Q, Lou J. Magicmol: a light-weighted pipeline for drug-like molecule evolution and quick chemical space exploration. BMC Bioinformatics 2023; 24:173. [PMID: 37101113 PMCID: PMC10132416 DOI: 10.1186/s12859-023-05286-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2022] [Accepted: 04/13/2023] [Indexed: 04/28/2023] Open
Abstract
The flourishment of machine learning and deep learning methods has boosted the development of cheminformatics, especially regarding the application of drug discovery and new material exploration. Lower time and space expenses make it possible for scientists to search the enormous chemical space. Recently, some work combined reinforcement learning strategies with recurrent neural network (RNN)-based models to optimize the property of generated small molecules, which notably improved a batch of critical factors for these candidates. However, a common problem among these RNN-based methods is that several generated molecules have difficulty in synthesizing despite owning higher desired properties such as binding affinity. However, RNN-based framework better reproduces the molecule distribution among the training set than other categories of models during molecule exploration tasks. Thus, to optimize the whole exploration process and make it contribute to the optimization of specified molecules, we devised a light-weighted pipeline called Magicmol; this pipeline has a re-mastered RNN network and utilize SELFIES presentation instead of SMILES. Our backbone model achieved extraordinary performance while reducing the training cost; moreover, we devised reward truncate strategies to eliminate the model collapse problem. Additionally, adopting SELFIES presentation made it possible to combine STONED-SELFIES as a post-processing procedure for specified molecule optimization and quick chemical space exploration.
Collapse
Affiliation(s)
- Lin Chen
- Yangtze Delta Region (Huzhou) Institute of Intelligent Transportation, Huzhou University, Huzhou, China
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, China
| | - Qing Shen
- Yangtze Delta Region (Huzhou) Institute of Intelligent Transportation, Huzhou University, Huzhou, China
- School of Electronic Information, Huzhou College, Huzhou, China
| | - Jungang Lou
- Yangtze Delta Region (Huzhou) Institute of Intelligent Transportation, Huzhou University, Huzhou, China.
- Zhejiang Province Key Laboratory of Smart Management and Application of Modern Agricultural Resources, School of Information Engineering, Huzhou University, Huzhou, China.
| |
Collapse
|
25
|
Wellawatte G, Gandhi HA, Seshadri A, White AD. A Perspective on Explanations of Molecular Prediction Models. J Chem Theory Comput 2023; 19:2149-2160. [PMID: 36972469 PMCID: PMC10134429 DOI: 10.1021/acs.jctc.2c01235] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Indexed: 03/29/2023]
Abstract
Chemists can be skeptical in using deep learning (DL) in decision making, due to the lack of interpretability in "black-box" models. Explainable artificial intelligence (XAI) is a branch of artificial intelligence (AI) which addresses this drawback by providing tools to interpret DL models and their predictions. We review the principles of XAI in the domain of chemistry and emerging methods for creating and evaluating explanations. Then, we focus on methods developed by our group and their applications in predicting solubility, blood-brain barrier permeability, and the scent of molecules. We show that XAI methods like chemical counterfactuals and descriptor explanations can explain DL predictions while giving insight into structure-property relationships. Finally, we discuss how a two-step process of developing a black-box model and explaining predictions can uncover structure-property relationships.
Collapse
Affiliation(s)
- Geemi
P. Wellawatte
- Department
of Chemistry, University of Rochester, Rochester, New York 14627, United States
| | - Heta A. Gandhi
- Department
of Chemical Engineering, University of Rochester, Rochester, New York 14627, United States
| | - Aditi Seshadri
- Department
of Chemical Engineering, University of Rochester, Rochester, New York 14627, United States
| | - Andrew D. White
- Department
of Chemical Engineering, University of Rochester, Rochester, New York 14627, United States
| |
Collapse
|
26
|
Luukkonen S, van den Maagdenberg HW, Emmerich MTM, van Westen GJP. Artificial intelligence in multi-objective drug design. Curr Opin Struct Biol 2023; 79:102537. [PMID: 36774727 DOI: 10.1016/j.sbi.2023.102537] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 12/21/2022] [Accepted: 01/03/2023] [Indexed: 02/12/2023]
Abstract
The factors determining a drug's success are manifold, making de novo drug design an inherently multi-objective optimisation (MOO) problem. With the advent of machine learning and optimisation methods, the field of multi-objective compound design has seen a rapid increase in developments and applications. Population-based metaheuris-tics and deep reinforcement learning are the most commonly used artificial intelligence methods in the field, but recently conditional learning methods are gaining popularity. The former approaches are coupled with a MOO strat-egy which is most commonly an aggregation function, but Pareto-based strategies are widespread too. Besides these and conditional learning, various innovative approaches to tackle MOO in drug design have been proposed. Here we provide a brief overview of the field and the latest innovations.
Collapse
Affiliation(s)
- Sohvi Luukkonen
- Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, Leiden, 2333 CC, the Netherlands. https://twitter.com/sohvi_luukkonen
| | - Helle W van den Maagdenberg
- Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, Leiden, 2333 CC, the Netherlands
| | - Michael T M Emmerich
- Leiden Institute of Advanced Computer Science, Leiden University, Niels Bohrweg 1, Leiden, 2333 CC, the Netherlands
| | - Gerard J P van Westen
- Leiden Academic Centre for Drug Research, Leiden University, Einsteinweg 55, Leiden, 2333 CC, the Netherlands.
| |
Collapse
|
27
|
Brinkhaus HO, Rajan K, Schaub J, Zielesny A, Steinbeck C. Open data and algorithms for open science in AI-driven molecular informatics. Curr Opin Struct Biol 2023; 79:102542. [PMID: 36805192 DOI: 10.1016/j.sbi.2023.102542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/10/2023] [Accepted: 01/13/2023] [Indexed: 02/19/2023]
Abstract
Recent years have seen a sharp increase in the development of deep learning and artificial intelligence-based molecular informatics. There has been a growing interest in applying deep learning to several subfields, including the digital transformation of synthetic chemistry, extraction of chemical information from the scientific literature, and AI in natural product-based drug discovery. The application of AI to molecular informatics is still constrained by the fact that most of the data used for training and testing deep learning models are not available as FAIR and open data. As open science practices continue to grow in popularity, initiatives which support FAIR and open data as well as open-source software have emerged. It is becoming increasingly important for researchers in the field of molecular informatics to embrace open science and to submit data and software in open repositories. With the advent of open-source deep learning frameworks and cloud computing platforms, academic researchers are now able to deploy and test their own deep learning models with ease. With the development of new and faster hardware for deep learning and the increasing number of initiatives towards digital research data management infrastructures, as well as a culture promoting open data, open source, and open science, AI-driven molecular informatics will continue to grow. This review examines the current state of open data and open algorithms in molecular informatics, as well as ways in which they could be improved in future.
Collapse
Affiliation(s)
- Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Jonas Schaub
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665 Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany.
| |
Collapse
|
28
|
Danel T, Łęski J, Podlewska S, Podolak IT. Docking-based generative approaches in the search for new drug candidates. Drug Discov Today 2023; 28:103439. [PMID: 36372330 DOI: 10.1016/j.drudis.2022.103439] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/08/2022] [Accepted: 11/08/2022] [Indexed: 11/13/2022]
Abstract
Despite the popularity of virtual screening (VS) of existing compound libraries, the search for new potential drug candidates also takes advantage of generative protocols, where new compound suggestions are enumerated using various algorithms. To increase the activity potency of generative approaches, they have recently been coupled with molecular docking, a leading methodology of structure-based drug design (SBDD). In this review, we summarize progress since docking-based generative models emerged. We propose a new taxonomy for these methods and discuss their importance for the field of computer-aided drug design (CADD). In addition, we discuss the most promising directions for the further development of generative protocols coupled with docking.
Collapse
Affiliation(s)
- Tomasz Danel
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland.
| | - Jan Łęski
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland
| | - Sabina Podlewska
- Maj Institute of Pharmacology, Polish Academy of Sciences, Department of Medicinal Chemistry, 31-343 Kraków, Smętna Street 12, Poland
| | - Igor T Podolak
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland
| |
Collapse
|
29
|
Urbina F, Ekins S. The Commoditization of AI for Molecule Design. ARTIFICIAL INTELLIGENCE IN THE LIFE SCIENCES 2022; 2:100031. [PMID: 36211981 PMCID: PMC9541920 DOI: 10.1016/j.ailsci.2022.100031] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Anyone involved in designing or finding molecules in the life sciences over the past few years has witnessed a dramatic change in how we now work due to the COVID-19 pandemic. Computational technologies like artificial intelligence (AI) seemed to become ubiquitous in 2020 and have been increasingly applied as scientists worked from home and were separated from the laboratory and their colleagues. This shift may be more permanent as the future of molecule design across different industries will increasingly require machine learning models for design and optimization of molecules as they become "designed by AI". AI and machine learning has essentially become a commodity within the pharmaceutical industry. This perspective will briefly describe our personal opinions of how machine learning has evolved and is being applied to model different molecule properties that crosses industries in their utility and ultimately suggests the potential for tight integration of AI into equipment and automated experimental pipelines. It will also describe how many groups have implemented generative models covering different architectures, for de novo design of molecules. We also highlight some of the companies at the forefront of using AI to demonstrate how machine learning has impacted and influenced our work. Finally, we will peer into the future and suggest some of the areas that represent the most interesting technologies that may shape the future of molecule design, highlighting how we can help increase the efficiency of the design-make-test cycle which is currently a major focus across industries.
Collapse
Affiliation(s)
- Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|
30
|
Bon M, Bilsland A, Bower J, McAulay K. Fragment-based drug discovery-the importance of high-quality molecule libraries. Mol Oncol 2022; 16:3761-3777. [PMID: 35749608 PMCID: PMC9627785 DOI: 10.1002/1878-0261.13277] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Revised: 05/16/2022] [Accepted: 06/23/2022] [Indexed: 12/24/2022] Open
Abstract
Fragment-based drug discovery (FBDD) is now established as a complementary approach to high-throughput screening (HTS). Contrary to HTS, where large libraries of drug-like molecules are screened, FBDD screens involve smaller and less complex molecules which, despite a low affinity to protein targets, display more 'atom-efficient' binding interactions than larger molecules. Fragment hits can, therefore, serve as a more efficient start point for subsequent optimisation, particularly for hard-to-drug targets. Since the number of possible molecules increases exponentially with molecular size, small fragment libraries allow for a proportionately greater coverage of their respective 'chemical space' compared with larger HTS libraries comprising larger molecules. However, good library design is essential to ensure optimal chemical and pharmacophore diversity, molecular complexity, and physicochemical characteristics. In this review, we describe our views on fragment library design, and on what constitutes a good fragment from a medicinal and computational chemistry perspective. We highlight emerging chemical and computational technologies in FBDD and discuss strategies for optimising fragment hits. The impact of novel FBDD approaches is already being felt, with the recent approval of the covalent KRASG12C inhibitor sotorasib highlighting the utility of FBDD against targets that were long considered undruggable.
Collapse
Affiliation(s)
- Marta Bon
- Cancer Research HorizonsCancer Research UK Beatson InstituteGlasgowUK
| | - Alan Bilsland
- Cancer Research HorizonsCancer Research UK Beatson InstituteGlasgowUK
| | - Justin Bower
- Cancer Research HorizonsCancer Research UK Beatson InstituteGlasgowUK
| | - Kirsten McAulay
- Cancer Research HorizonsCancer Research UK Beatson InstituteGlasgowUK
| |
Collapse
|
31
|
Krenn M, Ai Q, Barthel S, Carson N, Frei A, Frey NC, Friederich P, Gaudin T, Gayle AA, Jablonka KM, Lameiro RF, Lemm D, Lo A, Moosavi SM, Nápoles-Duarte JM, Nigam A, Pollice R, Rajan K, Schatzschneider U, Schwaller P, Skreta M, Smit B, Strieth-Kalthoff F, Sun C, Tom G, Falk von Rudorff G, Wang A, White AD, Young A, Yu R, Aspuru-Guzik A. SELFIES and the future of molecular string representations. PATTERNS (NEW YORK, N.Y.) 2022; 3:100588. [PMID: 36277819 PMCID: PMC9583042 DOI: 10.1016/j.patter.2022.100588] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
Collapse
Affiliation(s)
- Mario Krenn
- Max Planck Institute for the Science of Light (MPL), Erlangen, Germany
| | - Qianxiang Ai
- Department of Chemistry, Fordham University, The Bronx, NY, USA
| | - Senja Barthel
- Department of Mathematics, Vrije Universiteit Amsterdam, Amsterdam, the Netherlands
| | - Nessa Carson
- Syngenta Jealott’s Hill International Research Centre, Bracknell, Berkshire, UK
| | - Angelo Frei
- Department of Chemistry, Imperial College London, Molecular Sciences Research Hub, White City Campus, Wood Lane, London, UK
| | - Nathan C. Frey
- Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Pascal Friederich
- Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
- Institute of Nanotechnology, Karlsruhe Institute of Technology, Eggenstein-Leopoldshafen, Germany
| | - Théophile Gaudin
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- IBM Research Europe, Zürich, Switzerland
| | | | - Kevin Maik Jablonka
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Sion, Valais, Switzerland
| | - Rafael F. Lameiro
- Medicinal and Biological Chemistry Group, São Carlos Institute of Chemistry, University of São Paulo, São Paulo, Brazil
| | - Dominik Lemm
- Faculty of Physics, University of Vienna, Vienna, Austria
| | - Alston Lo
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Seyed Mohamad Moosavi
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany
| | | | - AkshatKumar Nigam
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Robert Pollice
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller Universität Jena, Jena, Germany
| | - Ulrich Schatzschneider
- Institut für Anorganische Chemie, Julius-Maximilians-Universität Würzburg, Würzburg, Germany
| | - Philippe Schwaller
- IBM Research Europe, Zürich, Switzerland
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Marta Skreta
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Berend Smit
- Laboratory of Molecular Simulation (LSMO), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Sion, Valais, Switzerland
| | - Felix Strieth-Kalthoff
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Chong Sun
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Gary Tom
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | | | - Andrew Wang
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Solar Fuels Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, USA
| | - Adamo Young
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
| | - Rose Yu
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Alán Aspuru-Guzik
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
- Department of Materials Science, University of Toronto, Toronto, ON, Canada
- Canadian Institute for Advanced Research (CIFAR) Lebovic Fellow, Toronto, ON, Canada
| |
Collapse
|
32
|
Thomas M, O’Boyle NM, Bender A, de Graaf C. Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. J Cheminform 2022; 14:68. [PMID: 36192789 PMCID: PMC9531503 DOI: 10.1186/s13321-022-00646-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 09/23/2022] [Indexed: 11/10/2022] Open
Abstract
A plethora of AI-based techniques now exists to conduct de novo molecule generation that can devise molecules conditioned towards a particular endpoint in the context of drug design. One popular approach is using reinforcement learning to update a recurrent neural network or language-based de novo molecule generator. However, reinforcement learning can be inefficient, sometimes requiring up to 105 molecules to be sampled to optimize more complex objectives, which poses a limitation when using computationally expensive scoring functions like docking or computer-aided synthesis planning models. In this work, we propose a reinforcement learning strategy called Augmented Hill-Climb based on a simple, hypothesis-driven hybrid between REINVENT and Hill-Climb that improves sample-efficiency by addressing the limitations of both currently used strategies. We compare its ability to optimize several docking tasks with REINVENT and benchmark this strategy against other commonly used reinforcement learning strategies including REINFORCE, REINVENT (version 1 and 2), Hill-Climb and best agent reminder. We find that optimization ability is improved ~ 1.5-fold and sample-efficiency is improved ~ 45-fold compared to REINVENT while still delivering appealing chemistry as output. Diversity filters were used, and their parameters were tuned to overcome observed failure modes that take advantage of certain diversity filter configurations. We find that Augmented Hill-Climb outperforms the other reinforcement learning strategies used on six tasks, especially in the early stages of training or for more difficult objectives. Lastly, we show improved performance not only on recurrent neural networks but also on a reinforcement learning stabilized transformer architecture. Overall, we show that Augmented Hill-Climb improves sample-efficiency for language-based de novo molecule generation conditioning via reinforcement learning, compared to the current state-of-the-art. This makes more computationally expensive scoring functions, such as docking, more accessible on a relevant timescale.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Noel M. O’Boyle
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Chris de Graaf
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| |
Collapse
|
33
|
Nigam A, Pollice R, Aspuru-Guzik A. Parallel tempered genetic algorithm guided by deep neural networks for inverse molecular design. DIGITAL DISCOVERY 2022; 1:390-404. [PMID: 36091415 PMCID: PMC9358752 DOI: 10.1039/d2dd00003b] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Accepted: 05/03/2022] [Indexed: 12/30/2022]
Abstract
Inverse molecular design involves algorithms that sample molecules with specific target properties from a multitude of candidates and can be posed as an optimization problem. High-dimensional optimization tasks in the natural sciences are commonly tackled via population-based metaheuristic optimization algorithms such as evolutionary algorithms. However, often unavoidable expensive property evaluation can limit the widespread use of such approaches as the associated cost can become prohibitive. Herein, we present JANUS, a genetic algorithm inspired by parallel tempering. It propagates two populations, one for exploration and another for exploitation, improving optimization by reducing property evaluations. JANUS is augmented by a deep neural network that approximates molecular properties and relies on active learning for enhanced molecular sampling. It uses the SELFIES representation and the STONED algorithm for the efficient generation of structures, and outperforms other generative models in common inverse molecular design tasks achieving state-of-the-art target metrics across multiple benchmarks. As neither most of the benchmarks nor the structure generator in JANUS account for synthesizability, a significant fraction of the proposed molecules is synthetically infeasible demonstrating that this aspect needs to be considered when evaluating the performance of molecular generative models.
Collapse
Affiliation(s)
- AkshatKumar Nigam
- Department of Computer Science, Stanford University USA
- Department of Computer Science, University of Toronto Canada
- Department of Chemistry, University of Toronto Canada
| | - Robert Pollice
- Department of Computer Science, University of Toronto Canada
- Department of Chemistry, University of Toronto Canada
| | - Alán Aspuru-Guzik
- Department of Computer Science, University of Toronto Canada
- Department of Chemistry, University of Toronto Canada
- Vector Institute for Artificial Intelligence Toronto Canada
- Lebovic Fellow, Canadian Institute for Advanced Research (CIFAR) 661 University Ave Toronto Ontario M5G Canada
| |
Collapse
|
34
|
García-Ortegón M, Simm GNC, Tripp AJ, Hernández-Lobato JM, Bender A, Bacallado S. DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. J Chem Inf Model 2022; 62:3486-3502. [PMID: 35849793 PMCID: PMC9364321 DOI: 10.1021/acs.jcim.1c01334] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Indexed: 01/05/2023]
Abstract
The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate compound's interaction with the target. By contrast, molecular docking is a widely applied method in drug discovery to estimate binding affinities. However, docking studies require a significant amount of domain knowledge to set up correctly, which hampers adoption. Here, we present dockstring, a bundle for meaningful and robust comparison of ML models using docking scores. dockstring consists of three components: (1) an open-source Python package for straightforward computation of docking scores, (2) an extensive dataset of docking scores and poses of more than 260,000 molecules for 58 medically relevant targets, and (3) a set of pharmaceutically relevant benchmark tasks such as virtual screening or de novo design of selective kinase inhibitors. The Python package implements a robust ligand and target preparation protocol that allows nonexperts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more realistic evaluation objective than simple physicochemical properties, yielding benchmark tasks that are more challenging and more closely related to real problems in drug discovery.
Collapse
Affiliation(s)
- Miguel García-Ortegón
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| | - Gregor N. C. Simm
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | - Austin J. Tripp
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | | | - Andreas Bender
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Rd., Cambridge CB2 1EW, United Kingdom
| | - Sergio Bacallado
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| |
Collapse
|
35
|
Guo M, Shou W, Makatura L, Erps T, Foshey M, Matusik W. Polygrammar: Grammar for Digital Polymer Representation and Generation. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022; 9:e2101864. [PMID: 35678650 PMCID: PMC9376847 DOI: 10.1002/advs.202101864] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 12/04/2021] [Indexed: 05/22/2023]
Abstract
Polymers are widely studied materials with diverse properties and applications determined by molecular structures. It is essential to represent these structures clearly and explore the full space of achievable chemical designs. However, existing approaches cannot offer comprehensive design models for polymers because of their inherent scale and structural complexity. Here, a parametric, context-sensitive grammar designed specifically for polymers (PolyGrammar) is proposed. Using the symbolic hypergraph representation and 14 simple production rules, PolyGrammar can represent and generate all valid polyurethane structures. An algorithm is presented to translate any polyurethane structure from the popular Simplified Molecular-Input Line-entry System (SMILES) string format into the PolyGrammar representation. The representative power of PolyGrammar is tested by translating a dataset of over 600 polyurethane samples collected from the literature. Furthermore, it is shown that PolyGrammar can be easily extended to other copolymers and homopolymers. By offering a complete, explicit representation scheme and an explainable generative model with validity guarantees, PolyGrammar takes an essential step toward a more comprehensive and practical system for polymer discovery and exploration. As the first bridge between formal languages and chemistry, PolyGrammar also serves as a critical blueprint to inform the design of similar grammars for other chemistries, including organic and inorganic molecules.
Collapse
Affiliation(s)
- Minghao Guo
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
- CUHK Multimedia LabThe Chinese University of Hong KongSha TinHong Kong
| | - Wan Shou
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
| | - Liane Makatura
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
| | - Timothy Erps
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
| | - Michael Foshey
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
| | - Wojciech Matusik
- Computer Science and Artificial Intelligence LabMassachusetts Institute of TechnologyCambridgeMA02139USA
| |
Collapse
|
36
|
Weinreich J, Lemm D, von Rudorff GF, von Lilienfeld OA. Ab initio machine learning of phase space averages. J Chem Phys 2022; 157:024303. [PMID: 35840379 DOI: 10.1063/5.0095674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Equilibrium structures determine material properties and biochemical functions. We here propose to machine learn phase space averages, conventionally obtained by ab initio or force-field-based molecular dynamics (MD) or Monte Carlo (MC) simulations. In analogy to ab initio MD, our ab initio machine learning (AIML) model does not require bond topologies and, therefore, enables a general machine learning pathway to obtain ensemble properties throughout the chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. The AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data and to reach competitive prediction errors (mean absolute error ∼ 0.8 kcal/mol) for out-of-sample molecules-within milliseconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns of Boltzmann averages throughout the chemical compound space at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time.
Collapse
Affiliation(s)
- Jan Weinreich
- Faculty of Physics, University of Vienna, Kolingasse 14-16, AT-1090 Wien, Austria
| | - Dominik Lemm
- Faculty of Physics, University of Vienna, Kolingasse 14-16, AT-1090 Wien, Austria
| | | | | |
Collapse
|
37
|
Verhellen J. Graph-based molecular Pareto optimisation. Chem Sci 2022; 13:7526-7535. [PMID: 35872811 PMCID: PMC9241971 DOI: 10.1039/d2sc00821a] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 06/02/2022] [Indexed: 12/02/2022] Open
Abstract
Computer-assisted design of small molecules has experienced a resurgence in academic and industrial interest due to the widespread use of data-driven techniques such as deep generative models. While the ability to generate molecules that fulfil required chemical properties is encouraging, the use of deep learning models requires significant, if not prohibitive, amounts of data and computational power. At the same time, open-sourcing of more traditional techniques such as graph-based genetic algorithms for molecular optimisation [Jensen, Chem. Sci., 2019, 12, 3567-3572] has shown that simple and training-free algorithms can be efficient and robust alternatives. Further research alleviated the common genetic algorithm issue of evolutionary stagnation by enforcing molecular diversity during optimisation [Van den Abeele, Chem. Sci., 2020, 42, 11485-11491]. The crucial lesson distilled from the simultaneous development of deep generative models and advanced genetic algorithms has been the importance of chemical space exploration [Aspuru-Guzik, Chem. Sci., 2021, 12, 7079-7090]. For single-objective optimisation problems, chemical space exploration had to be discovered as a useable resource but in multi-objective optimisation problems, an exploration of trade-offs between conflicting objectives is inherently present. In this paper we provide state-of-the-art and open-source implementations of two generations of graph-based non-dominated sorting genetic algorithms (NSGA-II, NSGA-III) for molecular multi-objective optimisation. We provide the results of a series of benchmarks for the inverse design of small molecule drugs for both the NSGA-II and NSGA-III algorithms. In addition, we introduce the dominated hypervolume and extended fingerprint based internal similarity as novel metrics for these benchmarks. By design, NSGA-II, and NSGA-III outperform a single optimisation method baseline in terms of dominated hypervolume, but remarkably our results show they do so without relying on a greater internal chemical diversity.
Collapse
Affiliation(s)
- Jonas Verhellen
- Centre for Integrative Neuroplasticity, University of Oslo N-0316 Oslo Norway
| |
Collapse
|
38
|
Liu T, Johnson KR, Jansone-Popova S, Jiang DE. Advancing Rare-Earth Separation by Machine Learning. JACS AU 2022; 2:1428-1434. [PMID: 35783179 PMCID: PMC9241157 DOI: 10.1021/jacsau.2c00122] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 05/24/2022] [Accepted: 06/01/2022] [Indexed: 05/24/2023]
Abstract
Constituting the bulk of rare-earth elements, lanthanides need to be separated to fully realize their potential as critical materials in many important technologies. The discovery of new ligands for improving rare-earth separations by solvent extraction, the most practical rare-earth separation process, is still largely based on trial and error, a low-throughput and inefficient approach. A predictive model that allows high-throughput screening of ligands is needed to identify suitable ligands to achieve enhanced separation performance. Here, we show that deep neural networks, trained on the available experimental data, can be used to predict accurate distribution coefficients for solvent extraction of lanthanide ions, thereby opening the door to high-throughput screening of ligands for rare-earth separations. One innovative approach that we employed is a combined representation of ligands with both molecular physicochemical descriptors and atomic extended-connectivity fingerprints, which greatly boosts the accuracy of the trained model. More importantly, we synthesized four new ligands and found that the predicted distribution coefficients from our trained machine-learning model match well with the measured values. Therefore, our machine-learning approach paves the way for accelerating the discovery of new ligands for rare-earth separations.
Collapse
Affiliation(s)
- Tongyu Liu
- Department
of Chemistry, University of California, Riverside, California 92521, United States
| | - Katherine R. Johnson
- Chemical
Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Santa Jansone-Popova
- Chemical
Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - De-en Jiang
- Department
of Chemistry, University of California, Riverside, California 92521, United States
| |
Collapse
|
39
|
Saldívar-González FI, Medina-Franco JL. Approaches for enhancing the analysis of chemical space for drug discovery. Expert Opin Drug Discov 2022; 17:789-798. [PMID: 35640229 DOI: 10.1080/17460441.2022.2084608] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
INTRODUCTION Chemical space is a powerful, general, and practical conceptual framework in drug discovery and other areas in chemistry that addresses the diversity of molecules and it has various applications. Moreover, chemical space is a cornerstone of chemoinformatics as a scientific discipline. In response to the increase in the set of chemical compounds in databases, generators of chemical structures, and tools to calculate molecular descriptors, novel approaches to generate visual representations of chemical space in low dimensions are emerging and evolving. Such approaches include a wide range of commercial and free applications, software, and open-source methods. AREAS COVERED The current state of chemical space in drug design and discovery is reviewed. The topics discussed herein include advances for efficient navigation in chemical space, the use of this concept in assessing the diversity of different data sets, exploring structure-property/activity relationships for one or multiple endpoints, and compound library design. Recent advances in methodologies for generating visual representations of chemical space have been highlighted, thereby emphasizing open-source methods. EXPERT OPINION Quantitative and qualitative generation and analysis of chemical space require novel approaches for handling the increasing number of molecules and their information available in chemical databases (including emerging ultra-large libraries). In addition, it is of utmost importance to note that chemical space is a conceptual framework that goes beyond visual representation in low dimensions. However, the graphical representation of chemical space has several practical applications in drug discovery and beyond.
Collapse
Affiliation(s)
- Fernanda I Saldívar-González
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Avenida Universidad 3000, Mexico City 04510, Mexico
| |
Collapse
|
40
|
Lu C, Liu S, Shi W, Yu J, Zhou Z, Zhang X, Lu X, Cai F, Xia N, Wang Y. Systemic evolutionary chemical space exploration for drug discovery. J Cheminform 2022; 14:19. [PMID: 35365231 PMCID: PMC8973791 DOI: 10.1186/s13321-022-00598-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 03/11/2022] [Indexed: 11/29/2022] Open
Abstract
Chemical space exploration is a major task of the hit-finding process during the pursuit of novel chemical entities. Compared with other screening technologies, computational de novo design has become a popular approach to overcome the limitation of current chemical libraries. Here, we reported a de novo design platform named systemic evolutionary chemical space explorer (SECSE). The platform was conceptually inspired by fragment-based drug design, that miniaturized a “lego-building” process within the pocket of a certain target. The key to virtual hits generation was then turned into a computational search problem. To enhance search and optimization, human intelligence and deep learning were integrated. Application of SECSE against phosphoglycerate dehydrogenase (PHGDH), proved its potential in finding novel and diverse small molecules that are attractive starting points for further validation. This platform is open-sourced and the code is available at http://github.com/KeenThera/SECSE.
Collapse
Affiliation(s)
- Chong Lu
- Keen Therapeutics Co., Ltd., Shanghai, China
| | - Shien Liu
- Keen Therapeutics Co., Ltd., Shanghai, China
| | - Weihua Shi
- Keen Therapeutics Co., Ltd., Shanghai, China
| | - Jun Yu
- Keen Therapeutics Co., Ltd., Shanghai, China
| | - Zhou Zhou
- Keen Therapeutics Co., Ltd., Shanghai, China
| | | | - Xiaoli Lu
- Keen Therapeutics Co., Ltd., Shanghai, China
| | - Faji Cai
- Keen Therapeutics Co., Ltd., Shanghai, China
| | | | - Yikai Wang
- Keen Therapeutics Co., Ltd., Shanghai, China.
| |
Collapse
|
41
|
Wellawatte GP, Seshadri A, White AD. Model agnostic generation of counterfactual explanations for molecules. Chem Sci 2022; 13:3697-3705. [PMID: 35432902 PMCID: PMC8966631 DOI: 10.1039/d1sc05259d] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 02/06/2022] [Indexed: 11/25/2022] Open
Abstract
An outstanding challenge in deep learning in chemistry is its lack of interpretability. The inability of explaining why a neural network makes a prediction is a major barrier to deployment of AI models. This not only dissuades chemists from using deep learning predictions, but also has led to neural networks learning spurious correlations that are difficult to notice. Counterfactuals are a category of explanations that provide a rationale behind a model prediction with satisfying properties like providing chemical structure insights. Yet, counterfactuals have been previously limited to specific model architectures or required reinforcement learning as a separate process. In this work, we show a universal model-agnostic approach that can explain any black-box model prediction. We demonstrate this method on random forest models, sequence models, and graph neural networks in both classification and regression.
Collapse
Affiliation(s)
| | - Aditi Seshadri
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| |
Collapse
|
42
|
Bilodeau C, Jin W, Jaakkola T, Barzilay R, Jensen KF. Generative models for molecular discovery: Recent advances and challenges. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1608] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Camille Bilodeau
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Wengong Jin
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Tommi Jaakkola
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge Massachusetts USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge Massachusetts USA
| |
Collapse
|
43
|
Wigh DS, Goodman JM, Lapkin AA. A review of molecular representation in the age of machine learning. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1603] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Daniel S. Wigh
- Department of Chemical Engineering and Biotechnology University of Cambridge Cambridge UK
| | | | - Alexei A. Lapkin
- Department of Chemical Engineering and Biotechnology University of Cambridge Cambridge UK
| |
Collapse
|
44
|
Iovanac NC, MacKnight R, Savoie BM. Actively Searching: Inverse Design of Novel Molecules with Simultaneously Optimized Properties. J Phys Chem A 2022; 126:333-340. [PMID: 34985908 DOI: 10.1021/acs.jpca.1c08191] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Combining quantum chemistry characterizations with generative machine learning models has the potential to accelerate molecular discovery. In this paradigm, quantum chemistry acts as a relatively cost-effective oracle for evaluating the properties of particular molecules, while generative models provide a means of sampling chemical space based on learned structure-function relationships. For practical applications, multiple potentially orthogonal properties must be optimized in tandem during a discovery workflow. This carries additional difficulties associated with the specificity of the targets and the ability for the model to reconcile all properties simultaneously. Here, we demonstrate an active learning approach to improve the performance of multi-target generative chemical models. We first demonstrate the effectiveness of a set of baseline models trained on single property prediction tasks in generating novel compounds (i.e., not present in the training data) with various property targets, including both interpolative and extrapolative generation scenarios. For property ranges where accurate targeting proves difficult, the novel compounds suggested by the model are characterized using quantum chemistry and the new molecules closest to expressing the desired properties are fed back into the generative model for additional training. This gradually improves the generative models' understanding of targeted areas of chemical space and shifts the distribution of the generated compounds toward the targeted values. We then demonstrate the effectiveness of this active learning approach in generating compounds with multiple chemical constraints, including vertical ionization potential, electron affinity, and dipole moment targets, and validate the results at the ωB97X-D3/def2-TZVP level. This method requires no modifications to extant generative approaches, but rather utilizes their inherent generative and predictive aspects for self-refinement, and can be applied to situations where any number of properties with varying degrees of correlation must be optimized simultaneously.
Collapse
Affiliation(s)
- Nicolae C Iovanac
- Charles D. Davidson School of Chemical Engineering, Purdue University, 480 Stadium Mall Drive, West Lafayette, Indiana 47906, United States
| | - Robert MacKnight
- Charles D. Davidson School of Chemical Engineering, Purdue University, 480 Stadium Mall Drive, West Lafayette, Indiana 47906, United States
| | - Brett M Savoie
- Charles D. Davidson School of Chemical Engineering, Purdue University, 480 Stadium Mall Drive, West Lafayette, Indiana 47906, United States
| |
Collapse
|
45
|
Steiner M, Reiher M. Autonomous Reaction Network Exploration in Homogeneous and Heterogeneous Catalysis. Top Catal 2022; 65:6-39. [PMID: 35185305 PMCID: PMC8816766 DOI: 10.1007/s11244-021-01543-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/17/2021] [Indexed: 12/11/2022]
Abstract
Autonomous computations that rely on automated reaction network elucidation algorithms may pave the way to make computational catalysis on a par with experimental research in the field. Several advantages of this approach are key to catalysis: (i) automation allows one to consider orders of magnitude more structures in a systematic and open-ended fashion than what would be accessible by manual inspection. Eventually, full resolution in terms of structural varieties and conformations as well as with respect to the type and number of potentially important elementary reaction steps (including decomposition reactions that determine turnover numbers) may be achieved. (ii) Fast electronic structure methods with uncertainty quantification warrant high efficiency and reliability in order to not only deliver results quickly, but also to allow for predictive work. (iii) A high degree of autonomy reduces the amount of manual human work, processing errors, and human bias. Although being inherently unbiased, it is still steerable with respect to specific regions of an emerging network and with respect to the addition of new reactant species. This allows for a high fidelity of the formalization of some catalytic process and for surprising in silico discoveries. In this work, we first review the state of the art in computational catalysis to embed autonomous explorations into the general field from which it draws its ingredients. We then elaborate on the specific conceptual issues that arise in the context of autonomous computational procedures, some of which we discuss at an example catalytic system. GRAPHICAL ABSTRACT SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s11244-021-01543-9.
Collapse
Affiliation(s)
- Miguel Steiner
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Markus Reiher
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|
46
|
Molecular generation by Fast Assembly of (Deep)SMILES fragments. J Cheminform 2021; 13:88. [PMID: 34775976 PMCID: PMC8591910 DOI: 10.1186/s13321-021-00566-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 11/02/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, in silico molecular design is regaining interest. To generate on a computer molecules with optimized properties, scoring functions can be coupled with a molecular generator to design novel molecules with a desired property profile. RESULTS In this article, a simple method is described to generate only valid molecules at high frequency ([Formula: see text] molecule/s using a single CPU core), given a molecular training set. The proposed method generates diverse SMILES (or DeepSMILES) encoded molecules while also showing some propensity at training set distribution matching. When working with DeepSMILES, the method reaches peak performance ([Formula: see text] molecule/s) because it relies almost exclusively on string operations. The "Fast Assembly of SMILES Fragments" software is released as open-source at https://github.com/UnixJunkie/FASMIFRA . Experiments regarding speed, training set distribution matching, molecular diversity and benchmark against several other methods are also shown.
Collapse
|
47
|
Zagidullin B, Wang Z, Guan Y, Pitkänen E, Tang J. Comparative analysis of molecular fingerprints in prediction of drug combination effects. Brief Bioinform 2021; 22:bbab291. [PMID: 34401895 PMCID: PMC8574997 DOI: 10.1093/bib/bbab291] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 06/01/2021] [Accepted: 07/07/2021] [Indexed: 12/18/2022] Open
Abstract
Application of machine and deep learning methods in drug discovery and cancer research has gained a considerable amount of attention in the past years. As the field grows, it becomes crucial to systematically evaluate the performance of novel computational solutions in relation to established techniques. To this end, we compare rule-based and data-driven molecular representations in prediction of drug combination sensitivity and drug synergy scores using standardized results of 14 high-throughput screening studies, comprising 64 200 unique combinations of 4153 molecules tested in 112 cancer cell lines. We evaluate the clustering performance of molecular representations and quantify their similarity by adapting the Centered Kernel Alignment metric. Our work demonstrates that to identify an optimal molecular representation type, it is necessary to supplement quantitative benchmark results with qualitative considerations, such as model interpretability and robustness, which may vary between and throughout preclinical drug development projects.
Collapse
Affiliation(s)
- B Zagidullin
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Finland
| | - Z Wang
- Department of Electrical Engineering & Computer Science, University of Michigan, Ann Arbor, USA
| | - Y Guan
- Department of Computational Medicine & Bioinformatics, University of Michigan, Ann Arbor, USA
| | - E Pitkänen
- Institute for Molecular Medicine Finland (FIMM) & Applied Tumor Genomics Research Program, Research Programs Unit, University of Helsinki, Finland
| | - J Tang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Finland
| |
Collapse
|
48
|
Li Y, Pei J, Lai L. Structure-based de novo drug design using 3D deep generative models. Chem Sci 2021; 12:13664-13675. [PMID: 34760151 PMCID: PMC8549794 DOI: 10.1039/d1sc04444c] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 09/09/2021] [Indexed: 12/14/2022] Open
Abstract
Deep generative models are attracting much attention in the field of de novo molecule design. Compared to traditional methods, deep generative models can be trained in a fully data-driven way with little requirement for expert knowledge. Although many models have been developed to generate 1D and 2D molecular structures, 3D molecule generation is less explored, and the direct design of drug-like molecules inside target binding sites remains challenging. In this work, we introduce DeepLigBuilder, a novel deep learning-based method for de novo drug design that generates 3D molecular structures in the binding sites of target proteins. We first developed Ligand Neural Network (L-Net), a novel graph generative model for the end-to-end design of chemically and conformationally valid 3D molecules with high drug-likeness. Then, we combined L-Net with Monte Carlo tree search to perform structure-based de novo drug design tasks. In the case study of inhibitor design for the main protease of SARS-CoV-2, DeepLigBuilder suggested a list of drug-like compounds with novel chemical structures, high predicted affinity, and similar binding features to those of known inhibitors. The current version of L-Net was trained on drug-like compounds from ChEMBL, which could be easily extended to other molecular datasets with desired properties based on users' demands and applied in functional molecule generation. Merging deep generative models with atomic-level interaction evaluation, DeepLigBuilder provides a state-of-the-art model for structure-based de novo drug design and lead optimization. DeepLigBuilder, a novel deep generative model for structure-based de novo drug design, directly generates 3D structures of drug-like compounds in the target binding site.![]()
Collapse
Affiliation(s)
- Yibo Li
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China
| | - Luhua Lai
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China .,Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University Beijing 100871 China .,BNLMS, College of Chemistry and Molecular Engineering, Peking University Beijing 100871 China
| |
Collapse
|
49
|
Nigam A, Pollice R, Hurley MFD, Hickman RJ, Aldeghi M, Yoshikawa N, Chithrananda S, Voelz VA, Aspuru-Guzik A. Assigning confidence to molecular property prediction. Expert Opin Drug Discov 2021; 16:1009-1023. [DOI: 10.1080/17460441.2021.1925247] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- AkshatKumar Nigam
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| | - Robert Pollice
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| | | | - Riley J. Hickman
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| | - Matteo Aldeghi
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
- Vector Institute for Artificial Intelligence, University Ave Suite 710, Toronto, Canada
| | - Naruki Yoshikawa
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
| | | | | | - Alán Aspuru-Guzik
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
- Vector Institute for Artificial Intelligence, University Ave Suite 710, Toronto, Canada
- Canadian Institute for Advanced Research (CIFAR), University Ave, Toronto, Canada
| |
Collapse
|
50
|
Steinmann C, Jensen JH. Using a genetic algorithm to find molecules with good docking scores. PEERJ PHYSICAL CHEMISTRY 2021. [DOI: 10.7717/peerj-pchem.18] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A graph-based genetic algorithm (GA) is used to identify molecules (ligands) with high absolute docking scores as estimated by the Glide software package, starting from randomly chosen molecules from the ZINC database, for four different targets: Bacillus subtilis chorismate mutase (CM), human β2-adrenergic G protein-coupled receptor (β2AR), the DDR1 kinase domain (DDR1), and β-cyclodextrin (BCD). By the combined use of functional group filters and a score modifier based on a heuristic synthetic accessibility (SA) score our approach identifies between ca 500 and 6,000 structurally diverse molecules with scores better than known binders by screening a total of 400,000 molecules starting from 8,000 randomly selected molecules from the ZINC database. Screening 250,000 molecules from the ZINC database identifies significantly more molecules with better docking scores than known binders, with the exception of CM, where the conventional screening approach only identifies 60 compounds compared to 511 with GA+Filter+SA. In the case of β2AR and DDR1, the GA+Filter+SA approach finds significantly more molecules with docking scores lower than −9.0 and −10.0. The GA+Filters+SA docking methodology is thus effective in generating a large and diverse set of synthetically accessible molecules with very good docking scores for a particular target. An early incarnation of the GA+Filter+SA approach was used to identify potential binders to the COVID-19 main protease and submitted to the early stages of the COVID Moonshot project, a crowd-sourced initiative to accelerate the development of a COVID antiviral.
Collapse
Affiliation(s)
- Casper Steinmann
- Department of Chemistry and Bioscience, Aalborg University, Aalborg, Denmark
| | - Jan H. Jensen
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|