1
|
Zhai S, Tan Y, Zhu C, Zhang C, Gao Y, Mao Q, Zhang Y, Duan H, Yin Y. PepExplainer: An explainable deep learning model for selection-based macrocyclic peptide bioactivity prediction and optimization. Eur J Med Chem 2024; 275:116628. [PMID: 38944933 DOI: 10.1016/j.ejmech.2024.116628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/21/2024] [Accepted: 06/24/2024] [Indexed: 07/02/2024]
Abstract
Macrocyclic peptides possess unique features, making them highly promising as a drug modality. However, evaluating their bioactivity through wet lab experiments is generally resource-intensive and time-consuming. Despite advancements in artificial intelligence (AI) for bioactivity prediction, challenges remain due to limited data availability and the interpretability issues in deep learning models, often leading to less-than-ideal predictions. To address these challenges, we developed PepExplainer, an explainable graph neural network based on substructure mask explanation (SME). This model excels at deciphering amino acid substructures, translating macrocyclic peptides into detailed molecular graphs at the atomic level, and efficiently handling non-canonical amino acids and complex macrocyclic peptide structures. PepExplainer's effectiveness is enhanced by utilizing the correlation between peptide enrichment data from selection-based focused library and bioactivity data, and employing transfer learning to improve bioactivity predictions of macrocyclic peptides against IL-17C/IL-17 RE interaction. Additionally, PepExplainer underwent further validation for bioactivity prediction using an additional set of thirteen newly synthesized macrocyclic peptides. Moreover, it enabled the optimization of the IC50 of a macrocyclic peptide, reducing it from 15 nM to 5.6 nM based on the contribution score provided by PepExplainer. This achievement underscores PepExplainer's skill in deciphering complex molecular patterns, highlighting its potential to accelerate the discovery and optimization of macrocyclic peptides.
Collapse
Affiliation(s)
- Silong Zhai
- School of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, China
| | - Yahong Tan
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China
| | - Cheng Zhu
- School of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, China
| | - Chengyun Zhang
- School of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, China
| | - Yan Gao
- Qilu Institute of Technology, Jinan, 250200, China
| | - Qingyi Mao
- School of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, China
| | - Youming Zhang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China
| | - Hongliang Duan
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, 999078, China.
| | - Yizhen Yin
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao, 266237, China; Shandong Research Institute of Industrial Technology, Jinan, 250101, China.
| |
Collapse
|
2
|
Shi Z, Wang D, Li Y, Deng R, Lin J, Liu C, Li H, Wang R, Zhao M, Mao Z, Yuan Q, Liao X, Ma H. REME: an integrated platform for reaction enzyme mining and evaluation. Nucleic Acids Res 2024; 52:W299-W305. [PMID: 38769057 PMCID: PMC11223788 DOI: 10.1093/nar/gkae405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 04/16/2024] [Accepted: 05/01/2024] [Indexed: 05/22/2024] Open
Abstract
A key challenge in pathway design is finding proper enzymes that can be engineered to catalyze a non-natural reaction. Although existing tools can identify potential enzymes based on similar reactions, these tools encounter several issues. Firstly, the calculated similar reactions may not even have the same reaction type. Secondly, the associated enzymes are often numerous and identifying the most promising candidate enzymes is difficult due to the lack of data for evaluation. Thirdly, existing web tools do not provide interactive functions that enable users to fine-tune results based on their expertise. Here, we present REME (https://reme.biodesign.ac.cn/), the first integrated web platform for reaction enzyme mining and evaluation. Combining atom-to-atom mapping, atom type change identification, and reaction similarity calculation enables quick ranking and visualization of reactions similar to an objective non-natural reaction. Additional functionality enables users to filter similar reactions by their specified functional groups and candidate enzymes can be further filtered (e.g. by organisms) or expanded by Enzyme Commission number (EC) or sequence homology. Afterward, enzyme attributes (such as kcat, Km, optimal temperature and pH) can be assessed with deep learning-based methods, facilitating the swift identification of potential enzymes that can catalyze the non-natural reaction.
Collapse
Affiliation(s)
- Zhenkun Shi
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Dehang Wang
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, PR China
| | - Yang Li
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
- University of Chinese Academy of Sciences, Beijing 101408, PR China
| | - Rui Deng
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, PR China
| | - Jiawei Lin
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
- College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, PR China
| | - Cui Liu
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Haoran Li
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Ruoyu Wang
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Muqiang Zhao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Zhitao Mao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Qianqian Yuan
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| | - Xiaoping Liao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
- Haihe Laboratory of Synthetic Biology, Tianjin 300308, PR China
| | - Hongwu Ma
- Biodesign Center, Key Laboratory of Engineering Biology for Low-carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin 300308, PR China
| |
Collapse
|
3
|
Sankaranarayanan K, Jensen KF. Similarity based functionalization for enumeration of synthetically plausible chemical libraries surrounding a target. Chem Sci 2024; 15:10221-10231. [PMID: 38966353 PMCID: PMC11220589 DOI: 10.1039/d4sc00523f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 05/22/2024] [Indexed: 07/06/2024] Open
Abstract
Functionalization of lead compounds to create analogs is a challenging step in discovering new molecules with desired properties and it is conducted throughout the chemical industry, including pharmaceuticals and agrochemicals. The process can be time-consuming and expensive, requiring expert intuition and experience. To help address synthesis planning challenges in late-stage functionalization, we have developed a molecular similarity approach that proposes single-step functionalization reactions based on analogy to precedent reactions. The developed approach mimics reaction strategies and suggests co-reactants defined implicitly by a corpus of known reactions. Using ca. 348 k reactions from the patent literature as a knowledge base, the recorded products or close analogs are among the top 20 proposed products in 74% of ∼44 k test reactions. The combinatorial growth inherent in recursive applications of the tool allows the enumeration of chemical libraries surrounding a target compound of interest. Moreover, each step of the resulting library synthesis leverages common chemical transformations reported in the literature accessible to most chemists.
Collapse
Affiliation(s)
- Karthik Sankaranarayanan
- Department of Agriculture and Biological Engineering, Purdue University West Lafayette Indiana 47907 USA
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge Massachusetts 02139 USA
| | - Klavs F Jensen
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge Massachusetts 02139 USA
| |
Collapse
|
4
|
Singh S, Hernández-Lobato JM. Deep Kernel learning for reaction outcome prediction and optimization. Commun Chem 2024; 7:136. [PMID: 38877182 PMCID: PMC11178803 DOI: 10.1038/s42004-024-01219-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Accepted: 06/05/2024] [Indexed: 06/16/2024] Open
Abstract
Recent years have seen a rapid growth in the application of various machine learning methods for reaction outcome prediction. Deep learning models have gained popularity due to their ability to learn representations directly from the molecular structure. Gaussian processes (GPs), on the other hand, provide reliable uncertainty estimates but are unable to learn representations from the data. We combine the feature learning ability of neural networks (NNs) with uncertainty quantification of GPs in a deep kernel learning (DKL) framework to predict the reaction outcome. The DKL model is observed to obtain very good predictive performance across different input representations. It significantly outperforms standard GPs and provides comparable performance to graph neural networks, but with uncertainty estimation. Additionally, the uncertainty estimates on predictions provided by the DKL model facilitated its incorporation as a surrogate model for Bayesian optimization (BO). The proposed method, therefore, has a great potential towards accelerating reaction discovery by integrating accurate predictive models that provide reliable uncertainty estimates with BO.
Collapse
Affiliation(s)
- Sukriti Singh
- Department of Engineering, University of Cambridge, Cambridge, UK.
| | | |
Collapse
|
5
|
Luong KD, Singh A. Application of Transformers in Cheminformatics. J Chem Inf Model 2024; 64:4392-4409. [PMID: 38815246 PMCID: PMC11167597 DOI: 10.1021/acs.jcim.3c02070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 04/05/2024] [Accepted: 05/06/2024] [Indexed: 06/01/2024]
Abstract
By accelerating time-consuming processes with high efficiency, computing has become an essential part of many modern chemical pipelines. Machine learning is a class of computing methods that can discover patterns within chemical data and utilize this knowledge for a wide variety of downstream tasks, such as property prediction or substance generation. The complex and diverse chemical space requires complex machine learning architectures with great learning power. Recently, learning models based on transformer architectures have revolutionized multiple domains of machine learning, including natural language processing and computer vision. Naturally, there have been ongoing endeavors in adopting these techniques to the chemical domain, resulting in a surge of publications within a short period. The diversity of chemical structures, use cases, and learning models necessitate a comprehensive summarization of existing works. In this paper, we review recent innovations in adapting transformers to solve learning problems in chemistry. Because chemical data is diverse and complex, we structure our discussion based on chemical representations. Specifically, we highlight the strengths and weaknesses of each representation, the current progress of adapting transformer architectures, and future directions.
Collapse
Affiliation(s)
- Kha-Dinh Luong
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| | - Ambuj Singh
- Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, United States
| |
Collapse
|
6
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
7
|
Kotlyarov R, Papachristos K, Wood GPF, Goodman JM. Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity. J Chem Inf Model 2024; 64:4286-4297. [PMID: 38708520 PMCID: PMC11134489 DOI: 10.1021/acs.jcim.4c00137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 04/05/2024] [Accepted: 04/23/2024] [Indexed: 05/07/2024]
Abstract
C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.
Collapse
Affiliation(s)
- Ruslan Kotlyarov
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| | | | - Geoffrey P. F. Wood
- Exscientia
Plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K.
| | - Jonathan M. Goodman
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| |
Collapse
|
8
|
van Gerwen P, Briling KR, Calvino Alonso Y, Franke M, Corminboeuf C. Benchmarking machine-readable vectors of chemical reactions on computed activation barriers. DIGITAL DISCOVERY 2024; 3:932-943. [PMID: 38756222 PMCID: PMC11094696 DOI: 10.1039/d3dd00175j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 02/28/2024] [Indexed: 05/18/2024]
Abstract
In recent years, there has been a surge of interest in predicting computed activation barriers, to enable the acceleration of the automated exploration of reaction networks. Consequently, various predictive approaches have emerged, ranging from graph-based models to methods based on the three-dimensional structure of reactants and products. In tandem, many representations have been developed to predict experimental targets, which may hold promise for barrier prediction as well. Here, we bring together all of these efforts and benchmark various methods (Morgan fingerprints, the DRFP, the CGR representation-based Chemprop, SLATMd, B2Rl2, EquiReact and language model BERT + RXNFP) for the prediction of computed activation barriers on three diverse datasets.
Collapse
Affiliation(s)
- Puck van Gerwen
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Ksenia R Briling
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Yannick Calvino Alonso
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Malte Franke
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
- National Center for Competence in Research-Catalysis (NCCR-Catalysis), École Polytechnique Fédérale de Lausanne 1015 Lausanne Switzerland
| |
Collapse
|
9
|
Schlosser L, Rana D, Pflüger P, Katzenburg F, Glorius F. EnTdecker - A Machine Learning-Based Platform for Guiding Substrate Discovery in Energy Transfer Catalysis. J Am Chem Soc 2024; 146:13266-13275. [PMID: 38695558 DOI: 10.1021/jacs.4c01352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Due to the magnitude of chemical space, the discovery of novel substrates in energy transfer (EnT) catalysis remains a daunting task. Experimental and computational strategies to identify compounds that successfully undergo EnT-mediated reactions are limited by their time and cost efficiency. To accelerate the discovery process in EnT catalysis, we herein present the EnTdecker platform, which facilitates the large-scale virtual screening of potential substrates using machine-learning (ML) based predictions of their excited state properties. To achieve this, a data set is created containing more than 34,000 molecules aiming to cover a vast fraction of synthetically relevant compound space for EnT catalysis. Using this data predictive models are trained, and their aptitude for an in-lab application is demonstrated by rediscovering successful substrates from literature as well as experimental validation through luminescence-based screening. By reducing the computational effort needed to obtain excited state properties, the EnTdecker platform represents a tool to efficiently guide substrate selection and increase the experimental success rate for EnT catalysis. Moreover, through an easy-to-use web application, EnTdecker is made publicly accessible under entdecker.uni-muenster.de.
Collapse
Affiliation(s)
- Leon Schlosser
- Organisch-Chemisches Institut, University of Münster, Corrensstraße 36, 48149 Münster, Germany
| | - Debanjan Rana
- Organisch-Chemisches Institut, University of Münster, Corrensstraße 36, 48149 Münster, Germany
| | - Philipp Pflüger
- Organisch-Chemisches Institut, University of Münster, Corrensstraße 36, 48149 Münster, Germany
| | - Felix Katzenburg
- Organisch-Chemisches Institut, University of Münster, Corrensstraße 36, 48149 Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, University of Münster, Corrensstraße 36, 48149 Münster, Germany
| |
Collapse
|
10
|
M. Bran A, Cox S, Schilter O, Baldassari C, White AD, Schwaller P. Augmenting large language models with chemistry tools. NAT MACH INTELL 2024; 6:525-535. [PMID: 38799228 PMCID: PMC11116106 DOI: 10.1038/s42256-024-00832-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 03/27/2024] [Indexed: 05/29/2024]
Abstract
Large language models (LLMs) have shown strong performance in tasks across domains but struggle with chemistry-related problems. These models also lack access to external knowledge sources, limiting their usefulness in scientific applications. We introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery and materials design. By integrating 18 expert-designed tools and using GPT-4 as the LLM, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned and executed the syntheses of an insect repellent and three organocatalysts and guided the discovery of a novel chromophore. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow's effectiveness in automating a diverse set of chemical tasks. Our work not only aids expert chemists and lowers barriers for non-experts but also fosters scientific advancement by bridging the gap between experimental and computational chemistry.
Collapse
Affiliation(s)
- Andres M. Bran
- Laboratory of Artificial Chemical Intelligence (LIAC), ISIC, EPFL, Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, EPFL, Lausanne, Switzerland
| | - Sam Cox
- Department of Chemical Engineering, University of Rochester, Rochester, NY USA
- FutureHouse, San Francisco, CA USA
| | - Oliver Schilter
- Laboratory of Artificial Chemical Intelligence (LIAC), ISIC, EPFL, Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, EPFL, Lausanne, Switzerland
- Accelerated Discovery, IBM Research – Europe, Rüschlikon, Switzerland
| | - Carlo Baldassari
- Accelerated Discovery, IBM Research – Europe, Rüschlikon, Switzerland
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY USA
- FutureHouse, San Francisco, CA USA
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), ISIC, EPFL, Lausanne, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, EPFL, Lausanne, Switzerland
| |
Collapse
|
11
|
Rana D, Pflüger PM, Hölter NP, Tan G, Glorius F. Standardizing Substrate Selection: A Strategy toward Unbiased Evaluation of Reaction Generality. ACS CENTRAL SCIENCE 2024; 10:899-906. [PMID: 38680564 PMCID: PMC11046462 DOI: 10.1021/acscentsci.3c01638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 03/14/2024] [Accepted: 03/18/2024] [Indexed: 05/01/2024]
Abstract
With over 10,000 new reaction protocols arising every year, only a handful of these procedures transition from academia to application. A major reason for this gap stems from the lack of comprehensive knowledge about a reaction's scope, i.e., to which substrates the protocol can or cannot be applied. Even though chemists invest substantial effort to assess the scope of new protocols, the resulting scope tables involve significant biases, reducing their expressiveness. Herein we report a standardized substrate selection strategy designed to mitigate these biases and evaluate the applicability, as well as the limits, of any chemical reaction. Unsupervised learning is utilized to map the chemical space of industrially relevant molecules. Subsequently, potential substrate candidates are projected onto this universal map, enabling the selection of a structurally diverse set of substrates with optimal relevance and coverage. By testing our methodology on different chemical reactions, we were able to demonstrate its effectiveness in finding general reactivity trends by using a few highly representative examples. The developed methodology empowers chemists to showcase the unbiased applicability of novel methodologies, facilitating their practical applications. We hope that this work will trigger interdisciplinary discussions about biases in synthetic chemistry, leading to improved data quality.
Collapse
Affiliation(s)
- Debanjan Rana
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Philipp M. Pflüger
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Niklas P. Hölter
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Guangying Tan
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Frank Glorius
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| |
Collapse
|
12
|
Ding Y, Qiang B, Chen Q, Liu Y, Zhang L, Liu Z. Exploring Chemical Reaction Space with Machine Learning Models: Representation and Feature Perspective. J Chem Inf Model 2024; 64:2955-2970. [PMID: 38489239 DOI: 10.1021/acs.jcim.4c00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
Chemical reactions serve as foundational building blocks for organic chemistry and drug design. In the era of large AI models, data-driven approaches have emerged to innovate the design of novel reactions, optimize existing ones for higher yields, and discover new pathways for synthesizing chemical structures comprehensively. To effectively address these challenges with machine learning models, it is imperative to derive robust and informative representations or engage in feature engineering using extensive data sets of reactions. This work aims to provide a comprehensive review of established reaction featurization approaches, offering insights into the selection of representations and the design of features for a wide array of tasks. The advantages and limitations of employing SMILES, molecular fingerprints, molecular graphs, and physics-based properties are meticulously elaborated. Solutions to bridge the gap between different representations will also be critically evaluated. Additionally, we introduce a new frontier in chemical reaction pretraining, holding promise as an innovative yet unexplored avenue.
Collapse
Affiliation(s)
- Yuheng Ding
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Bo Qiang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Qixuan Chen
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Yiqiao Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Liangren Zhang
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| | - Zhenming Liu
- Department of Pharmaceutical Science, Peking University, Beijing 100191, China
| |
Collapse
|
13
|
Su A, Cheng Y, Zhang C, Yang YF, She YB, Rajan K. An artificial intelligence platform for automated PFAS subgroup classification: A discovery tool for PFAS screening. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 921:171229. [PMID: 38402985 DOI: 10.1016/j.scitotenv.2024.171229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/27/2024] [Accepted: 02/21/2024] [Indexed: 02/27/2024]
Abstract
Since structural analyses and toxicity assessments have not been able to keep up with the discovery of unknown per- and polyfluoroalkyl substances (PFAS), there is an urgent need for effective categorization and grouping of PFAS. In this study, we presented PFAS-Atlas, an artificial intelligence-based platform containing a rule-based automatic classification system and a machine learning-based grouping model. Compared with previously developed classification software, the platform's classification system follows the latest Organization for Economic Co-operation and Development (OECD) definition of PFAS and reduces the number of uncategorized PFAS. In addition, the platform incorporates deep unsupervised learning models to visualize the chemical space of PFAS by clustering similar structures and linking related classes. Through real-world use cases, we demonstrate that PFAS-Atlas can rapidly screen for relationships between chemical structure and persistence, bioaccumulation, or toxicity data for PFAS. The platform can also guide the planning of the PFAS testing strategy by showing which PFAS classes urgently require further attention. Ultimately, the release of PFAS-Atlas will benefit both the PFAS research and regulation communities.
Collapse
Affiliation(s)
- An Su
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China; Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, PR China.
| | - Yingying Cheng
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China; Key Laboratory of Pharmaceutical Engineering of Zhejiang Province, Collaborative Innovation Center of Yangtze River Delta Region Green Pharmaceuticals, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, PR China
| | - Chengwei Zhang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Yun-Fang Yang
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China
| | - Yuan-Bin She
- State Key Laboratory Breeding Base of Green Chemistry-Synthesis Technology, Key Laboratory of Green Chemistry-Synthesis Technology of Zhejiang Province, College of Chemical Engineering, Zhejiang University of Technology, Hangzhou, Zhejiang 310014, China.
| | - Krishna Rajan
- Department of Materials Design and Innovation, University at Buffalo, Buffalo, NY 14260-1660, United States.
| |
Collapse
|
14
|
Dobbelaere MR, Lengyel I, Stevens CV, Van Geem KM. Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices. J Cheminform 2024; 16:37. [PMID: 38553720 PMCID: PMC10980627 DOI: 10.1186/s13321-024-00834-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/23/2024] [Indexed: 04/02/2024] Open
Abstract
The challenge of devising pathways for organic synthesis remains a central issue in the field of medicinal chemistry. Over the span of six decades, computer-aided synthesis planning has given rise to a plethora of potent tools for formulating synthetic routes. Nevertheless, a significant expert task still looms: determining the appropriate solvent, catalyst, and reagents when provided with a set of reactants to achieve and optimize the desired product for a specific step in the synthesis process. Typically, chemists identify key functional groups and rings that exert crucial influences at the reaction center, classify reactions into categories, and may assign them names. This research introduces Rxn-INSIGHT, an open-source algorithm based on the bond-electron matrix approach, with the purpose of automating this endeavor. Rxn-INSIGHT not only streamlines the process but also facilitates extensive querying of reaction databases, effectively replicating the thought processes of an organic chemist. The core functions of the algorithm encompass the classification and naming of reactions, extraction of functional groups, rings, and scaffolds from the involved chemical entities. The provision of reaction condition recommendations based on the similarity and prevalence of reactions eventually arises as a side application. The performance of our rule-based model has been rigorously assessed against a carefully curated benchmark dataset, exhibiting an accuracy rate exceeding 90% in reaction classification and surpassing 95% in reaction naming. Notably, it has been discerned that a pivotal factor in selecting analogous reactions lies in the analysis of ring structures participating in the reactions. An examination of ring structures within the USPTO chemical reaction database reveals that with just 35 unique rings, a remarkable 75% of all rings found in nearly 1 million products can be encompassed. Furthermore, Rxn-INSIGHT is proficient in suggesting appropriate choices for solvents, catalysts, and reagents in entirely novel reactions, all within the span of a second, utilizing nothing more than an everyday laptop.
Collapse
Affiliation(s)
- Maarten R Dobbelaere
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
| | - István Lengyel
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium
- ChemInsights LLC, Dover, DE, 19901, USA
| | - Christian V Stevens
- SynBioC Research Group, Department of Green Chemistry and Technology, Faculty of Bioscience Engineering, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Kevin M Van Geem
- Laboratory for Chemical Technology, Department of Materials, Textiles and Chemical Engineering, Faculty of Engineering and Architecture, Ghent University, Technologiepark 125, 9052, Ghent, Belgium.
| |
Collapse
|
15
|
Malashin I, Tynchenko V, Gantimurov A, Nelyub V, Borodulin A. Optimizing Neural Networks for Chemical Reaction Prediction: Insights from Methylene Blue Reduction Reactions. Int J Mol Sci 2024; 25:3860. [PMID: 38612671 PMCID: PMC11011334 DOI: 10.3390/ijms25073860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 03/24/2024] [Accepted: 03/28/2024] [Indexed: 04/14/2024] Open
Abstract
This paper offers a thorough investigation of hyperparameter tuning for neural network architectures using datasets encompassing various combinations of Methylene Blue (MB) Reduction by Ascorbic Acid (AA) reactions with different solvents and concentrations. The aim is to predict coefficients of decay plots for MB absorbance, shedding light on the complex dynamics of chemical reactions. Our findings reveal that the optimal model, determined through our investigation, consists of five hidden layers, each with sixteen neurons and employing the Swish activation function. This model yields an NMSE of 0.05, 0.03, and 0.04 for predicting the coefficients A, B, and C, respectively, in the exponential decay equation A + B · e-x/C. These findings contribute to the realm of drug design based on machine learning, providing valuable insights into optimizing chemical reaction predictions.
Collapse
Affiliation(s)
| | - Vadim Tynchenko
- Artificial Intelligence Technology Scientific and Education Center, Bauman Moscow State Technical University, 105005 Moscow, Russia; (I.M.); (A.G.); (V.N.); (A.B.)
| | | | | | | |
Collapse
|
16
|
Xie J, Wang Y, Rao J, Zheng S, Yang Y. Self-Supervised Contrastive Molecular Representation Learning with a Chemical Synthesis Knowledge Graph. J Chem Inf Model 2024; 64:1945-1954. [PMID: 38484468 DOI: 10.1021/acs.jcim.4c00157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.
Collapse
Affiliation(s)
- Jiancong Xie
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| | - Yi Wang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| | - Jiahua Rao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| | - Shuangjia Zheng
- Global Institute of Future Technology, Shanghai Jiao Tong University, Shanghai 200030, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-sen University, Guangzhou 510006, China
| |
Collapse
|
17
|
Han J, Kwon Y, Choi YS, Kang S. Improving chemical reaction yield prediction using pre-trained graph neural networks. J Cheminform 2024; 16:25. [PMID: 38429787 PMCID: PMC10905905 DOI: 10.1186/s13321-024-00818-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/19/2024] [Indexed: 03/03/2024] Open
Abstract
Graph neural networks (GNNs) have proven to be effective in the prediction of chemical reaction yields. However, their performance tends to deteriorate when they are trained using an insufficient training dataset in terms of quantity or diversity. A promising solution to alleviate this issue is to pre-train a GNN on a large-scale molecular database. In this study, we investigate the effectiveness of GNN pre-training in chemical reaction yield prediction. We present a novel GNN pre-training method for performance improvement.Given a molecular database consisting of a large number of molecules, we calculate molecular descriptors for each molecule and reduce the dimensionality of these descriptors by applying principal component analysis. We define a pre-text task by assigning a vector of principal component scores as the pseudo-label to each molecule in the database. A GNN is then pre-trained to perform the pre-text task of predicting the pseudo-label for the input molecule. For chemical reaction yield prediction, a prediction model is initialized using the pre-trained GNN and then fine-tuned with the training dataset containing chemical reactions and their yields. We demonstrate the effectiveness of the proposed method through experimental evaluation on benchmark datasets.
Collapse
Affiliation(s)
- Jongmin Han
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea
| | - Youngchun Kwon
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
| | - Youn-Suk Choi
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea.
| | - Seokho Kang
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea.
| |
Collapse
|
18
|
Kim S, Mollaei P, Antony A, Magar R, Barati Farimani A. GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models. J Chem Inf Model 2024; 64:1134-1144. [PMID: 38340054 PMCID: PMC10900288 DOI: 10.1021/acs.jcim.3c01706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Revised: 01/29/2024] [Accepted: 01/29/2024] [Indexed: 02/12/2024]
Abstract
With the rise of transformers and large language models (LLMs) in chemistry and biology, new avenues for the design and understanding of therapeutics have been opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence data sets. In this letter, we developed the GPCR-BERT model for understanding the sequential design of G protein-coupled receptors (GPCRs). GPCRs are the target of over one-third of Food and Drug Administration-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship among amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, and E/DRY). By utilizing the pretrained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.
Collapse
Affiliation(s)
- Seongwon Kim
- Department
of Chemical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
| | - Parisa Mollaei
- Department
of Mechanical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
| | - Akshay Antony
- Department
of Mechanical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
| | - Rishikesh Magar
- Department
of Mechanical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
| | - Amir Barati Farimani
- Department
of Mechanical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
- Department
of Biomedical Engineering, Carnegie Mellon
University, Pittsburgh, Pennsylvania 15213, United States
- Machine
Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States
| |
Collapse
|
19
|
Shi R, Yu G, Huo X, Yang Y. Prediction of chemical reaction yields with large-scale multi-view pre-training. J Cheminform 2024; 16:22. [PMID: 38403627 PMCID: PMC10895839 DOI: 10.1186/s13321-024-00815-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 02/14/2024] [Indexed: 02/27/2024] Open
Abstract
Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.
Collapse
Affiliation(s)
- Runhan Shi
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Gufeng Yu
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiaohong Huo
- Shanghai Key Laboratory for Molecular Engineering of Chiral Drugs, Frontiers Science Center for Transformative Molecules, School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
20
|
Liu Y, Liu X, Cao B. Graph attention neural networks for mapping materials and molecules beyond short-range interatomic correlations. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2024; 36:215901. [PMID: 38306704 DOI: 10.1088/1361-648x/ad2584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2023] [Accepted: 02/02/2024] [Indexed: 02/04/2024]
Abstract
Bringing advances in machine learning to chemical science is leading to a revolutionary change in the way of accelerating materials discovery and atomic-scale simulations. Currently, most successful machine learning schemes can be largely traced to the use of localized atomic environments in the structural representation of materials and molecules. However, this may undermine the reliability of machine learning models for mapping complex systems and describing long-range physical effects because of the lack of non-local correlations between atoms. To overcome such limitations, here we report a graph attention neural network as a unified framework to map materials and molecules into a generalizable and interpretable representation that combines local and non-local information of atomic environments from multiple scales. As an exemplary study, our model is applied to predict the electronic structure properties of metal-organic frameworks (MOFs) which have notable diversity in compositions and structures. The results show that our model achieves the state-of-the-art performance. The clustering analysis further demonstrates that our model enables high-level identification of MOFs with spatial and chemical resolution, which would facilitate the rational design of promising reticular materials. Furthermore, the application of our model in predicting the heat capacity of complex nanoporous materials, a critical property in a carbon capture process, showcases its versatility and accuracy in handling diverse physical properties beyond electronic structures.
Collapse
Affiliation(s)
- Yuanbin Liu
- Key Laboratory for Thermal Science and Power Engineering of Ministry of Education, Department of Engineering Mechanics, Tsinghua University, Beijing 100084, People's Republic of China
- Inorganic Chemistry Laboratory, Department of Chemistry, University of Oxford, Oxford, OX1 3QR, United Kingdom
| | - Xin Liu
- School of Chemical Engineering and Advanced Materials, The University of Adelaide, Adelaide, SA 5005, Australia
- Key Laboratory of Engineering Dielectric and Applications of Ministry of Education, School of Electrical and Electronic Engineering, Harbin University of Science and Technology, Harbin 150080, People's Republic of China
| | - Bingyang Cao
- Key Laboratory for Thermal Science and Power Engineering of Ministry of Education, Department of Engineering Mechanics, Tsinghua University, Beijing 100084, People's Republic of China
| |
Collapse
|
21
|
Chung Y, Green WH. Machine learning from quantum chemistry to predict experimental solvent effects on reaction rates. Chem Sci 2024; 15:2410-2424. [PMID: 38362410 PMCID: PMC10866337 DOI: 10.1039/d3sc05353a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 01/04/2024] [Indexed: 02/17/2024] Open
Abstract
Fast and accurate prediction of solvent effects on reaction rates are crucial for kinetic modeling, chemical process design, and high-throughput solvent screening. Despite the recent advance in machine learning, a scarcity of reliable data has hindered the development of predictive models that are generalizable for diverse reactions and solvents. In this work, we generate a large set of data with the COSMO-RS method for over 28 000 neutral reactions and 295 solvents and train a machine learning model to predict the solvation free energy and solvation enthalpy of activation (ΔΔG‡solv, ΔΔH‡solv) for a solution phase reaction. On unseen reactions, the model achieves mean absolute errors of 0.71 and 1.03 kcal mol-1 for ΔΔG‡solv and ΔΔH‡solv, respectively, relative to the COSMO-RS calculations. The model also provides reliable predictions of relative rate constants within a factor of 4 when tested on experimental data. The presented model can provide nearly instantaneous predictions of kinetic solvent effects or relative rate constants for a broad range of neutral closed-shell or free radical reactions and solvents only based on atom-mapped reaction SMILES and solvent SMILES strings.
Collapse
Affiliation(s)
- Yunsie Chung
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
22
|
Xing H, Cai P, Liu D, Han M, Liu J, Le Y, Zhang D, Hu QN. High-throughput prediction of enzyme promiscuity based on substrate-product pairs. Brief Bioinform 2024; 25:bbae089. [PMID: 38487850 PMCID: PMC10940840 DOI: 10.1093/bib/bbae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 01/20/2024] [Accepted: 02/03/2024] [Indexed: 03/18/2024] Open
Abstract
The screening of enzymes for catalyzing specific substrate-product pairs is often constrained in the realms of metabolic engineering and synthetic biology. Existing tools based on substrate and reaction similarity predominantly rely on prior knowledge, demonstrating limited extrapolative capabilities and an inability to incorporate custom candidate-enzyme libraries. Addressing these limitations, we have developed the Substrate-product Pair-based Enzyme Promiscuity Prediction (SPEPP) model. This innovative approach utilizes transfer learning and transformer architecture to predict enzyme promiscuity, thereby elucidating the intricate interplay between enzymes and substrate-product pairs. SPEPP exhibited robust predictive ability, eliminating the need for prior knowledge of reactions and allowing users to define their own candidate-enzyme libraries. It can be seamlessly integrated into various applications, including metabolic engineering, de novo pathway design, and hazardous material degradation. To better assist metabolic engineers in designing and refining biochemical pathways, particularly those without programming skills, we also designed EnzyPick, an easy-to-use web server for enzyme screening based on SPEPP. EnzyPick is accessible at http://www.biosynther.com/enzypick/.
Collapse
Affiliation(s)
- Huadong Xing
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Pengli Cai
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Dongliang Liu
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Mengying Han
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Juan Liu
- Institute of Artificial Intelligence, School of Computer Science, Wuhan University, Wuhan 430072, China
| | - Yingying Le
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Dachuan Zhang
- Institute of Environmental Engineering, ETH Zurich, Laura-Hezner-Weg 7, 8093 Zurich, Switzerland
| | - Qian-Nan Hu
- CAS Key Laboratory of Computational Biology, CAS Key Laboratory of Nutrition, Metabolism and Food Safety, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
23
|
Baygi SF, Barupal DK. IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra. J Cheminform 2024; 16:8. [PMID: 38238779 PMCID: PMC10797927 DOI: 10.1186/s13321-024-00804-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 01/14/2024] [Indexed: 01/22/2024] Open
Abstract
The majority of tandem mass spectrometry (MS/MS) spectra in untargeted metabolomics and exposomics studies lack any annotation. Our deep learning framework, Integrated Data Science Laboratory for Metabolomics and Exposomics-Mass INTerpreter (IDSL_MINT) can translate MS/MS spectra into molecular fingerprint descriptors. IDSL_MINT allows users to leverage the power of the transformer model for mass spectrometry data, similar to the large language models. Models are trained on user-provided reference MS/MS libraries via any customizable molecular fingerprint descriptors. IDSL_MINT was benchmarked using the LipidMaps database and improved the annotation rate of a test study for MS/MS spectra that were not originally annotated using existing mass spectral libraries. IDSL_MINT may improve the overall annotation rates in untargeted metabolomics and exposomics studies. The IDSL_MINT framework and tutorials are available in the GitHub repository at https://github.com/idslme/IDSL_MINT .Scientific contribution statement.Structural annotation of MS/MS spectra from untargeted metabolomics and exposomics datasets is a major bottleneck in gaining new biological insights. Machine learning models to convert spectra into molecular fingerprints can help in the annotation process. Here, we present IDSL_MINT, a new, easy-to-use and customizable deep-learning framework to train and utilize new models to predict molecular fingerprints from spectra for the compound annotation workflows.
Collapse
Affiliation(s)
- Sadjad Fakouri Baygi
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, CAM Building, 3rd Floor, 17 E 102 St, New York, NY, 10029, USA
| | - Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, CAM Building, 3rd Floor, 17 E 102 St, New York, NY, 10029, USA.
| |
Collapse
|
24
|
Yin X, Hsieh CY, Wang X, Wu Z, Ye Q, Bao H, Deng Y, Chen H, Luo P, Liu H, Hou T, Yao X. Enhancing Generic Reaction Yield Prediction through Reaction Condition-Based Contrastive Learning. RESEARCH (WASHINGTON, D.C.) 2024; 7:0292. [PMID: 38213662 PMCID: PMC10777739 DOI: 10.34133/research.0292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 12/06/2023] [Indexed: 01/13/2024]
Abstract
Deep learning (DL)-driven efficient synthesis planning may profoundly transform the paradigm for designing novel pharmaceuticals and materials. However, the progress of many DL-assisted synthesis planning (DASP) algorithms has suffered from the lack of reliable automated pathway evaluation tools. As a critical metric for evaluating chemical reactions, accurate prediction of reaction yields helps improve the practicality of DASP algorithms in the real-world scenarios. Currently, accurately predicting yields of interesting reactions still faces numerous challenges, mainly including the absence of high-quality generic reaction yield datasets and robust generic yield predictors. To compensate for the limitations of high-throughput yield datasets, we curated a generic reaction yield dataset containing 12 reaction categories and rich reaction condition information. Subsequently, by utilizing 2 pretraining tasks based on chemical reaction masked language modeling and contrastive learning, we proposed a powerful bidirectional encoder representations from transformers (BERT)-based reaction yield predictor named Egret. It achieved comparable or even superior performance to the best previous models on 4 benchmark datasets and established state-of-the-art performance on the newly curated dataset. We found that reaction-condition-based contrastive learning enhances the model's sensitivity to reaction conditions, and Egret is capable of capturing subtle differences between reactions involving identical reactants and products but different reaction conditions. Furthermore, we proposed a new scoring function that incorporated Egret into the evaluation of multistep synthesis routes. Test results showed that yield-incorporated scoring facilitated the prioritization of literature-supported high-yield reaction pathways for target molecules. In addition, through meta-learning strategy, we further improved the reliability of the model's prediction for reaction types with limited data and lower data quality. Our results suggest that Egret holds the potential to become an essential component of the next-generation DASP tools.
Collapse
Affiliation(s)
- Xiaodan Yin
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine,
Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao 999078, China
- CarbonSilicon AI Technology Co. Ltd, Hangzhou, Zhejiang 310018, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiaorui Wang
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine,
Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao 999078, China
- CarbonSilicon AI Technology Co. Ltd, Hangzhou, Zhejiang 310018, China
| | - Zhenxing Wu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
- CarbonSilicon AI Technology Co. Ltd, Hangzhou, Zhejiang 310018, China
| | - Qing Ye
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
- CarbonSilicon AI Technology Co. Ltd, Hangzhou, Zhejiang 310018, China
| | - Honglei Bao
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine,
Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao 999078, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co. Ltd, Hangzhou, Zhejiang 310018, China
| | - Hongming Chen
- Center of Chemistry and Chemical Biology,
Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510530, China
| | - Pei Luo
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine,
Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao 999078, China
| | - Huanxiang Liu
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao 999078, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiaojun Yao
- Faculty of Applied Sciences,
Macao Polytechnic University, Macao 999078, China
| |
Collapse
|
25
|
Bi Z. Cognition of Time and Thinking Beyond. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2024; 1455:171-195. [PMID: 38918352 DOI: 10.1007/978-3-031-60183-5_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/27/2024]
Abstract
A common research protocol in cognitive neuroscience is to train subjects to perform deliberately designed experiments while recording brain activity, with the aim of understanding the brain mechanisms underlying cognition. However, how the results of this protocol of research can be applied in technology is seldom discussed. Here, I review the studies on time processing of the brain as examples of this research protocol, as well as two main application areas of neuroscience (neuroengineering and brain-inspired artificial intelligence). Time processing is a fundamental dimension of cognition, and time is also an indispensable dimension of any real-world signal to be processed in technology. Therefore, one may expect that the studies of time processing in cognition profoundly influence brain-related technology. Surprisingly, I found that the results from cognitive studies on timing processing are hardly helpful in solving practical problems. This awkward situation may be due to the lack of generalizability of the results of cognitive studies, which are under well-controlled laboratory conditions, to real-life situations. This lack of generalizability may be rooted in the fundamental unknowability of the world (including cognition). Overall, this paper questions and criticizes the usefulness and prospect of the abovementioned research protocol of cognitive neuroscience. I then give three suggestions for future research. First, to improve the generalizability of research, it is better to study brain activity under real-life conditions instead of in well-controlled laboratory experiments. Second, to overcome the unknowability of the world, we can engineer an easily accessible surrogate of the object under investigation, so that we can predict the behavior of the object under investigation by experimenting on the surrogate. Third, the paper calls for technology-oriented research, with the aim of technology creation instead of knowledge discovery.
Collapse
Affiliation(s)
- Zedong Bi
- Lingang Laboratory, Shanghai, China.
- Institute for Future, Qingdao University, Qingdao, China.
- School of Automation, Shandong Key Laboratory of Industrial Control Technology, Qingdao University, Qingdao, China.
| |
Collapse
|
26
|
Day EC, Chittari SS, Bogen MP, Knight AS. Navigating the Expansive Landscapes of Soft Materials: A User Guide for High-Throughput Workflows. ACS POLYMERS AU 2023; 3:406-427. [PMID: 38107416 PMCID: PMC10722570 DOI: 10.1021/acspolymersau.3c00025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/02/2023] [Accepted: 11/07/2023] [Indexed: 12/19/2023]
Abstract
Synthetic polymers are highly customizable with tailored structures and functionality, yet this versatility generates challenges in the design of advanced materials due to the size and complexity of the design space. Thus, exploration and optimization of polymer properties using combinatorial libraries has become increasingly common, which requires careful selection of synthetic strategies, characterization techniques, and rapid processing workflows to obtain fundamental principles from these large data sets. Herein, we provide guidelines for strategic design of macromolecule libraries and workflows to efficiently navigate these high-dimensional design spaces. We describe synthetic methods for multiple library sizes and structures as well as characterization methods to rapidly generate data sets, including tools that can be adapted from biological workflows. We further highlight relevant insights from statistics and machine learning to aid in data featurization, representation, and analysis. This Perspective acts as a "user guide" for researchers interested in leveraging high-throughput screening toward the design of multifunctional polymers and predictive modeling of structure-property relationships in soft materials.
Collapse
Affiliation(s)
| | | | - Matthew P. Bogen
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Abigail S. Knight
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
27
|
Suvarna M, Vaucher AC, Mitchell S, Laino T, Pérez-Ramírez J. Language models and protocol standardization guidelines for accelerating synthesis planning in heterogeneous catalysis. Nat Commun 2023; 14:7964. [PMID: 38042926 PMCID: PMC10693572 DOI: 10.1038/s41467-023-43836-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 11/22/2023] [Indexed: 12/04/2023] Open
Abstract
Synthesis protocol exploration is paramount in catalyst discovery, yet keeping pace with rapid literature advances is increasingly time intensive. Automated synthesis protocol analysis is attractive for swiftly identifying opportunities and informing predictive models, however such applications in heterogeneous catalysis remain limited. In this proof-of-concept, we introduce a transformer model for this task, exemplified using single-atom heterogeneous catalysts (SACs), a rapidly expanding catalyst family. Our model adeptly converts SAC protocols into action sequences, and we use this output to facilitate statistical inference of their synthesis trends and applications, potentially expediting literature review and analysis. We demonstrate the model's adaptability across distinct heterogeneous catalyst families, underscoring its versatility. Finally, our study highlights a critical issue: the lack of standardization in reporting protocols hampers machine-reading capabilities. Embracing digital advances in catalysis demands a shift in data reporting norms, and to this end, we offer guidelines for writing protocols, significantly improving machine-readability. We release our model as an open-source web application, inviting a fresh approach to accelerate heterogeneous catalysis synthesis planning.
Collapse
Affiliation(s)
- Manu Suvarna
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | | | - Sharon Mitchell
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland
| | - Teodoro Laino
- IBM Research Europe, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.
| | - Javier Pérez-Ramírez
- Institute for Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Vladimir-Prelog-Weg 1, 8093, Zurich, Switzerland.
| |
Collapse
|
28
|
Toniato A, Vaucher AC, Lehmann MM, Luksch T, Schwaller P, Stenta M, Laino T. Fast Customization of Chemical Language Models to Out-of-Distribution Data Sets. CHEMISTRY OF MATERIALS : A PUBLICATION OF THE AMERICAN CHEMICAL SOCIETY 2023; 35:8806-8815. [PMID: 38027545 PMCID: PMC10653079 DOI: 10.1021/acs.chemmater.3c01406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 10/09/2023] [Accepted: 10/09/2023] [Indexed: 12/01/2023]
Abstract
The world is on the verge of a new industrial revolution, and language models are poised to play a pivotal role in this transformative era. Their ability to offer intelligent insights and forecasts has made them a valuable asset for businesses seeking a competitive advantage. The chemical industry, in particular, can benefit significantly from harnessing their power. Since 2016 already, language models have been applied to tasks such as predicting reaction outcomes or retrosynthetic routes. While such models have demonstrated impressive abilities, the lack of publicly available data sets with universal coverage is often the limiting factor for achieving even higher accuracies. This makes it imperative for organizations to incorporate proprietary data sets into their model training processes to improve their performance. So far, however, these data sets frequently remain untapped as there are no established criteria for model customization. In this work, we report a successful methodology for retraining language models on reaction outcome prediction and single-step retrosynthesis tasks, using proprietary, nonpublic data sets. We report a considerable boost in accuracy by combining patent and proprietary data in a multidomain learning formulation. This exercise, inspired by a real-world use case, enables us to formulate guidelines that can be adopted in different corporate settings to customize chemical language models easily.
Collapse
Affiliation(s)
- Alessandra Toniato
- IBM
Research Europe, Rüschlikon 8803, Switzerland
- National
Center for Competence in Research-Catalysis (NCCR-Catalysis), 8093 Zürich, Switzerland
| | - Alain C. Vaucher
- IBM
Research Europe, Rüschlikon 8803, Switzerland
- National
Center for Competence in Research-Catalysis (NCCR-Catalysis), 8093 Zürich, Switzerland
| | | | | | - Philippe Schwaller
- IBM
Research Europe, Rüschlikon 8803, Switzerland
- National
Center for Competence in Research-Catalysis (NCCR-Catalysis), 8093 Zürich, Switzerland
| | - Marco Stenta
- Syngenta
Crop Protection AG, Stein 4332, Switzerland
| | - Teodoro Laino
- IBM
Research Europe, Rüschlikon 8803, Switzerland
- National
Center for Competence in Research-Catalysis (NCCR-Catalysis), 8093 Zürich, Switzerland
| |
Collapse
|
29
|
Shilpa S, Kashyap G, Sunoj RB. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J Phys Chem A 2023; 127:8253-8271. [PMID: 37769193 DOI: 10.1021/acs.jpca.3c04779] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2023]
Abstract
Burgeoning developments in machine learning (ML) and its rapidly growing adaptations in chemistry are noteworthy. Motivated by the successful deployments of ML in the realm of molecular property prediction (MPP) and chemical reaction prediction (CRP), herein we highlight some of its most recent applications in predictive chemistry. We present a nonmathematical and concise overview of the progression of ML implementations, ranging from an ensemble-based random forest model to advanced graph neural network algorithms. Similarly, the prospects of various feature engineering and feature learning approaches that work in conjunction with ML models are described. Highly accurate predictions reported in MPP tasks (e.g., lipophilicity, solubility, distribution coefficient), using methods such as D-MPNN, MolCLR, SMILES-BERT, and MolBERT, offer promising avenues in molecular design and drug discovery. Whereas MPP pertains to a given molecule, ML applications in chemical reactions present a different level of challenge, primarily arising from the simultaneous involvement of multiple molecules and their diverse roles in a reaction setting. The reported RMSEs in MPP tasks range from 0.287 to 2.20, while those for yield predictions are well over 4.9 in the lower end, reaching thresholds of >10.0 in several examples. Our Review concludes with a set of persisting challenges in dealing with reaction data sets and an overall optimistic outlook on benefits of ML-driven workflows for various MPP as well as CRP tasks.
Collapse
Affiliation(s)
- Shilpa Shilpa
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Gargee Kashyap
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
30
|
Cremer J, Medrano Sandonas L, Tkatchenko A, Clevert DA, De Fabritiis G. Equivariant Graph Neural Networks for Toxicity Prediction. Chem Res Toxicol 2023; 36. [PMID: 37690056 PMCID: PMC10583285 DOI: 10.1021/acs.chemrestox.3c00032] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Indexed: 09/12/2023]
Abstract
Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.
Collapse
Affiliation(s)
- Julian Cremer
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
- Machine
Learning Research, Pfizer Worldwide Research
Development and Medical, Linkstr. 10, 10785 Berlin, Germany
| | - Leonardo Medrano Sandonas
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Alexandre Tkatchenko
- Department
of Physics and Materials Science, University
of Luxembourg, L-1511 Luxembourg City, Luxembourg
| | - Djork-Arné Clevert
- Machine
Learning Research, Pfizer Worldwide Research
Development and Medical, Linkstr. 10, 10785 Berlin, Germany
| | - Gianni De Fabritiis
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
- ICREA, Passeig Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
31
|
Li B, Su S, Zhu C, Lin J, Hu X, Su L, Yu Z, Liao K, Chen H. A deep learning framework for accurate reaction prediction and its application on high-throughput experimentation data. J Cheminform 2023; 15:72. [PMID: 37568183 PMCID: PMC10422736 DOI: 10.1186/s13321-023-00732-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 06/30/2023] [Indexed: 08/13/2023] Open
Abstract
In recent years, it has been seen that artificial intelligence (AI) starts to bring revolutionary changes to chemical synthesis. However, the lack of suitable ways of representing chemical reactions and the scarceness of reaction data has limited the wider application of AI to reaction prediction. Here, we introduce a novel reaction representation, GraphRXN, for reaction prediction. It utilizes a universal graph-based neural network framework to encode chemical reactions by directly taking two-dimension reaction structures as inputs. The GraphRXN model was evaluated by three publically available chemical reaction datasets and gave on-par or superior results compared with other baseline models. To further evaluate the effectiveness of GraphRXN, wet-lab experiments were carried out for the purpose of generating reaction data. GraphRXN model was then built on high-throughput experimentation data and a decent accuracy (R2 of 0.712) was obtained on our in-house data. This highlights that the GraphRXN model can be deployed in an integrated workflow which combines robotics and AI technologies for forward reaction prediction.
Collapse
Affiliation(s)
- Baiqing Li
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Shimin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Chan Zhu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Jie Lin
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Xinyue Hu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Lebin Su
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Zhunzhun Yu
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China
| | - Kuangbiao Liao
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| | - Hongming Chen
- Guangzhou Laboratory, Guangzhou, 510005, Guangdong, China.
| |
Collapse
|
32
|
Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z, Chandak P, Liu S, Van Katwyk P, Deac A, Anandkumar A, Bergen K, Gomes CP, Ho S, Kohli P, Lasenby J, Leskovec J, Liu TY, Manrai A, Marks D, Ramsundar B, Song L, Sun J, Tang J, Veličković P, Welling M, Zhang L, Coley CW, Bengio Y, Zitnik M. Scientific discovery in the age of artificial intelligence. Nature 2023; 620:47-60. [PMID: 37532811 DOI: 10.1038/s41586-023-06221-2] [Citation(s) in RCA: 76] [Impact Index Per Article: 76.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 05/16/2023] [Indexed: 08/04/2023]
Abstract
Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.
Collapse
Affiliation(s)
- Hanchen Wang
- Department of Engineering, University of Cambridge, Cambridge, UK
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- Department of Research and Early Development, Genentech Inc, South San Francisco, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Tianfan Fu
- Department of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Yuanqi Du
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Wenhao Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Ziming Liu
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Payal Chandak
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, USA
| | - Shengchao Liu
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Peter Van Katwyk
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Andreea Deac
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Anima Anandkumar
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- NVIDIA, Santa Clara, CA, USA
| | - Karianne Bergen
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Carla P Gomes
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Shirley Ho
- Center for Computational Astrophysics, Flatiron Institute, New York, NY, USA
- Department of Astrophysical Sciences, Princeton University, Princeton, NJ, USA
- Department of Physics, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Physics and Center for Data Science, New York University, New York, NY, USA
| | | | - Joan Lasenby
- Department of Engineering, University of Cambridge, Cambridge, UK
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | - Arjun Manrai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Debora Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Le Song
- BioMap, Beijing, China
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Jian Tang
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- HEC Montréal, Montreal, Quebec, Canada
- CIFAR AI Chair, Toronto, Ontario, Canada
| | - Petar Veličković
- Google DeepMind, London, UK
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Max Welling
- University of Amsterdam, Amsterdam, Netherlands
- Microsoft Research Amsterdam, Amsterdam, Netherlands
| | - Linfeng Zhang
- DP Technology, Beijing, China
- AI for Science Institute, Beijing, China
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yoshua Bengio
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
33
|
Zhong W, Yang Z, Chen CYC. Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing. Nat Commun 2023; 14:3009. [PMID: 37230985 DOI: 10.1038/s41467-023-38851-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/17/2023] [Indexed: 05/27/2023] Open
Abstract
Retrosynthesis planning, the process of identifying a set of available reactions to synthesize the target molecules, remains a major challenge in organic synthesis. Recently, computer-aided synthesis planning has gained renewed interest and various retrosynthesis prediction algorithms based on deep learning have been proposed. However, most existing methods are limited to the applicability and interpretability of model predictions, and further improvement of predictive accuracy to a more practical level is still required. In this work, inspired by the arrow-pushing formalism in chemical reaction mechanisms, we present an end-to-end architecture for retrosynthesis prediction called Graph2Edits. Specifically, Graph2Edits is based on graph neural network to predict the edits of the product graph in an auto-regressive manner, and sequentially generates transformation intermediates and final reactants according to the predicted edits sequence. This strategy combines the two-stage processes of semi-template-based methods into one-pot learning, improving the applicability in some complicated reactions, and also making its predictions more interpretable. Evaluated on the standard benchmark dataset USPTO-50k, our model achieves the state-of-the-art performance for semi-template-based retrosynthesis with a promising 55.1% top-1 accuracy.
Collapse
Affiliation(s)
- Weihe Zhong
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- School of Biomedical Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
| | - Ziduo Yang
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
| | - Calvin Yu-Chian Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China.
- Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan.
| |
Collapse
|
34
|
Chen K, Chen G, Li J, Huang Y, Wang E, Hou T, Heng PA. MetaRF: attention-based random forest for reaction yield prediction with a few trails. J Cheminform 2023; 15:43. [PMID: 37038222 PMCID: PMC10084704 DOI: 10.1186/s13321-023-00715-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Accepted: 03/21/2023] [Indexed: 04/12/2023] Open
Abstract
Artificial intelligence has deeply revolutionized the field of medicinal chemistry with many impressive applications, but the success of these applications requires a massive amount of training samples with high-quality annotations, which seriously limits the wide usage of data-driven methods. In this paper, we focus on the reaction yield prediction problem, which assists chemists in selecting high-yield reactions in a new chemical space only with a few experimental trials. To attack this challenge, we first put forth MetaRF, an attention-based random forest model specially designed for the few-shot yield prediction, where the attention weight of a random forest is automatically optimized by the meta-learning framework and can be quickly adapted to predict the performance of new reagents while given a few additional samples. To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method to determine valuable samples to be experimentally tested and then learned. Our methodology is evaluated on three different datasets and acquires satisfactory performance on few-shot prediction. In high-throughput experimentation (HTE) datasets, the average yield of our methodology's top 10 high-yield reactions is relatively close to the results of ideal yield selection.
Collapse
Affiliation(s)
- Kexin Chen
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, New Territories, Hong Kong SAR
| | | | | | - Yuansheng Huang
- College of Pharmaceutical Sciences, Zhejiang University, Zhejiang, China
| | - Ercheng Wang
- Zhejiang Lab, Zhejiang, China
- College of Pharmaceutical Sciences, Zhejiang University, Zhejiang, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, Zhejiang University, Zhejiang, China
| | - Pheng-Ann Heng
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, New Territories, Hong Kong SAR
- Zhejiang Lab, Zhejiang, China
| |
Collapse
|
35
|
Brinkhaus HO, Rajan K, Schaub J, Zielesny A, Steinbeck C. Open data and algorithms for open science in AI-driven molecular informatics. Curr Opin Struct Biol 2023; 79:102542. [PMID: 36805192 DOI: 10.1016/j.sbi.2023.102542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/10/2023] [Accepted: 01/13/2023] [Indexed: 02/19/2023]
Abstract
Recent years have seen a sharp increase in the development of deep learning and artificial intelligence-based molecular informatics. There has been a growing interest in applying deep learning to several subfields, including the digital transformation of synthetic chemistry, extraction of chemical information from the scientific literature, and AI in natural product-based drug discovery. The application of AI to molecular informatics is still constrained by the fact that most of the data used for training and testing deep learning models are not available as FAIR and open data. As open science practices continue to grow in popularity, initiatives which support FAIR and open data as well as open-source software have emerged. It is becoming increasingly important for researchers in the field of molecular informatics to embrace open science and to submit data and software in open repositories. With the advent of open-source deep learning frameworks and cloud computing platforms, academic researchers are now able to deploy and test their own deep learning models with ease. With the development of new and faster hardware for deep learning and the increasing number of initiatives towards digital research data management infrastructures, as well as a culture promoting open data, open source, and open science, AI-driven molecular informatics will continue to grow. This review examines the current state of open data and open algorithms in molecular informatics, as well as ways in which they could be improved in future.
Collapse
Affiliation(s)
- Henning Otto Brinkhaus
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Kohulan Rajan
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Jonas Schaub
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany
| | - Achim Zielesny
- Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665 Recklinghausen, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Lessingstr. 8, 07743 Jena, Germany.
| |
Collapse
|
36
|
Jaume-Santero F, Bornet A, Valery A, Naderi N, Vicente Alvarez D, Proios D, Yazdani A, Bournez C, Fessard T, Teodoro D. Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios. J Chem Inf Model 2023; 63:1914-1924. [PMID: 36952584 PMCID: PMC10091402 DOI: 10.1021/acs.jcim.2c01407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
The prediction of chemical reaction pathways has been accelerated by the development of novel machine learning architectures based on the deep learning paradigm. In this context, deep neural networks initially designed for language translation have been used to accurately predict a wide range of chemical reactions. Among models suited for the task of language translation, the recently introduced molecular transformer reached impressive performance in terms of forward-synthesis and retrosynthesis predictions. In this study, we first present an analysis of the performance of transformer models for product, reactant, and reagent prediction tasks under different scenarios of data availability and data augmentation. We find that the impact of data augmentation depends on the prediction task and on the metric used to evaluate the model performance. Second, we probe the contribution of different combinations of input formats, tokenization schemes, and embedding strategies to model performance. We find that less stable input settings generally lead to better performance. Lastly, we validate the superiority of round-trip accuracy over simpler evaluation metrics, such as top-k accuracy, using a committee of human experts and show a strong agreement for predictions that pass the round-trip test. This demonstrates the usefulness of more elaborate metrics in complex predictive scenarios and highlights the limitations of direct comparisons to a predefined database, which may include a limited number of chemical reaction pathways.
Collapse
Affiliation(s)
- Fernando Jaume-Santero
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | - Alban Bornet
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | | | - Nona Naderi
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - David Vicente Alvarez
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | - Dimitrios Proios
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
| | - Anthony Yazdani
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
| | | | | | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
37
|
Chen Y, Ou Y, Zheng P, Huang Y, Ge F, Dral PO. Benchmark of general-purpose machine learning-based quantum mechanical method AIQM1 on reaction barrier heights. J Chem Phys 2023; 158:074103. [PMID: 36813722 DOI: 10.1063/5.0137101] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Artificial intelligence-enhanced quantum mechanical method 1 (AIQM1) is a general-purpose method that was shown to achieve high accuracy for many applications with a speed close to its baseline semiempirical quantum mechanical (SQM) method ODM2*. Here, we evaluate the hitherto unknown performance of out-of-the-box AIQM1 without any refitting for reaction barrier heights on eight datasets, including a total of ∼24 thousand reactions. This evaluation shows that AIQM1's accuracy strongly depends on the type of transition state and ranges from excellent for rotation barriers to poor for, e.g., pericyclic reactions. AIQM1 clearly outperforms its baseline ODM2* method and, even more so, a popular universal potential, ANI-1ccx. Overall, however, AIQM1 accuracy largely remains similar to SQM methods (and B3LYP/6-31G* for most reaction types) suggesting that it is desirable to focus on improving AIQM1 performance for barrier heights in the future. We also show that the built-in uncertainty quantification helps in identifying confident predictions. The accuracy of confident AIQM1 predictions is approaching the level of popular density functional theory methods for most reaction types. Encouragingly, AIQM1 is rather robust for transition state optimizations, even for the type of reactions it struggles with the most. Single-point calculations with high-level methods on AIQM1-optimized geometries can be used to significantly improve barrier heights, which cannot be said for its baseline ODM2* method.
Collapse
Affiliation(s)
- Yuxinxin Chen
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Yanchi Ou
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Peikun Zheng
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Yaohuang Huang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Fuchun Ge
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, Department of Chemistry, and College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, China
| |
Collapse
|
38
|
Neves P, McClure K, Verhoeven J, Dyubankova N, Nugmanov R, Gedich A, Menon S, Shi Z, Wegner JK. Global reactivity models are impactful in industrial synthesis applications. J Cheminform 2023; 15:20. [PMID: 36774523 PMCID: PMC9921076 DOI: 10.1186/s13321-023-00685-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 01/22/2023] [Indexed: 02/13/2023] Open
Abstract
Artificial Intelligence is revolutionizing many aspects of the pharmaceutical industry. Deep learning models are now routinely applied to guide drug discovery projects leading to faster and improved findings, but there are still many tasks with enormous unrealized potential. One such task is the reaction yield prediction. Every year more than one fifth of all synthesis attempts result in product yields which are either zero or too low. This equates to chemical and human resources being spent on activities which ultimately do not progress the programs, leading to a triple loss when accounting for the cost of opportunity in time wasted. In this work we pre-train a BERT model on more than 16 million reactions from 4 different data sources, and fine tune it to achieve an uncertainty calibrated global yield prediction model. This model is an improvement upon state of the art not just from the increase in pre-train data but also by introducing a new embedding layer which solves a few limitations of SMILES and enables integration of additional information such as equivalents and molecule role into the reaction encoding, the model is called BERT Enriched Embedding (BEE). The model is benchmarked on an open-source dataset against a state-of-the-art synthesis focused BERT showing a near 20-point improvement in r2 score. The model is fine-tuned and tested on an internal company data benchmark, and a prospective study shows that the application of the model can reduce the total number of negative reactions (yield under 5%) ran in Janssen by at least 34%. Lastly, we corroborate the previous results through experimental validation, by directly deploying the model in an on-going drug discovery project and showing that it can also be used successfully as a reagent recommender due to its fast inference speed and reliable confidence estimation, a critical feature for industry application.
Collapse
Affiliation(s)
- Paulo Neves
- In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium.
| | - Kelly McClure
- Discovery Chemistry LJ, Janssen Research & Development, Janssen Pharmaceutica N.V, Philadelphia, United States of America
| | - Jonas Verhoeven
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Natalia Dyubankova
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Ramil Nugmanov
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | | | - Sairam Menon
- grid.419619.20000 0004 0623 0341Pharma R&D Information Tech, Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Zhicai Shi
- Discovery Chemistry LJ, Janssen Research & Development, Janssen Pharmaceutica N.V, Philadelphia, United States of America
| | - Jörg K. Wegner
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| |
Collapse
|
39
|
Cao Z, Magar R, Wang Y, Barati Farimani A. MOFormer: Self-Supervised Transformer Model for Metal-Organic Framework Property Prediction. J Am Chem Soc 2023; 145:2958-2967. [PMID: 36706365 PMCID: PMC10041520 DOI: 10.1021/jacs.2c11420] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Metal-organic frameworks (MOFs) are materials with a high degree of porosity that can be used for many applications. However, the chemical space of MOFs is enormous due to the large variety of possible combinations of building blocks and topology. Discovering the optimal MOFs for specific applications requires an efficient and accurate search over countless potential candidates. Previous high-throughput screening methods using computational simulations like DFT can be time-consuming. Such methods also require the 3D atomic structures of MOFs, which adds one extra step when evaluating hypothetical MOFs. In this work, we propose a structure-agnostic deep learning method based on the Transformer model, named as MOFormer, for property predictions of MOFs. MOFormer takes a text string representation of MOF (MOFid) as input, thus circumventing the need of obtaining the 3D structure of a hypothetical MOF and accelerating the screening process. By comparing to other descriptors such as Stoichiometric-120 and revised autocorrelations, we demonstrate that MOFormer can achieve state-of-the-art structure-agnostic prediction accuracy on all benchmarks. Furthermore, we introduce a self-supervised learning framework that pretrains the MOFormer via maximizing the cross-correlation between its structure-agnostic representations and structure-based representations of the crystal graph convolutional neural network (CGCNN) on >400k publicly available MOF data. Benchmarks show that pretraining improves the prediction accuracy of both models on various downstream prediction tasks. Furthermore, we revealed that MOFormer can be more data-efficient on quantum-chemical property prediction than structure-based CGCNN when training data is limited. Overall, MOFormer provides a novel perspective on efficient MOF property prediction using deep learning.
Collapse
Affiliation(s)
- Zhonglin Cao
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| | - Rishikesh Magar
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| | - Yuyang Wang
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| | - Amir Barati Farimani
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States.,Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States.,Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania15213, United States
| |
Collapse
|
40
|
Zhang SQ, Xu LC, Li SW, Oliveira JCA, Li X, Ackermann L, Hong X. Bridging Chemical Knowledge and Machine Learning for Performance Prediction of Organic Synthesis. Chemistry 2023; 29:e202202834. [PMID: 36206170 PMCID: PMC10099903 DOI: 10.1002/chem.202202834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Indexed: 11/29/2022]
Abstract
Recent years have witnessed a boom of machine learning (ML) applications in chemistry, which reveals the potential of data-driven prediction of synthesis performance. Digitalization and ML modelling are the key strategies to fully exploit the unique potential within the synergistic interplay between experimental data and the robust prediction of performance and selectivity. A series of exciting studies have demonstrated the importance of chemical knowledge implementation in ML, which improves the model's capability for making predictions that are challenging and often go beyond the abilities of human beings. This Minireview summarizes the cutting-edge embedding techniques and model designs in synthetic performance prediction, elaborating how chemical knowledge can be incorporated into machine learning until June 2022. By merging organic synthesis tactics and chemical informatics, we hope this Review can provide a guide map and intrigue chemists to revisit the digitalization and computerization of organic chemistry principles.
Collapse
Affiliation(s)
- Shuo-Qing Zhang
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Li-Cheng Xu
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Shu-Wen Li
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - João C A Oliveira
- Institut für Organische und Biomolekulare Chemie, Wöhler Research Institute for Sustainable Chemistry (WISCh), Georg-August-Universität, Tammannstraße 2, 37077, Göttingen, Germany
| | - Xin Li
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China
| | - Lutz Ackermann
- Institut für Organische und Biomolekulare Chemie, Wöhler Research Institute for Sustainable Chemistry (WISCh), Georg-August-Universität, Tammannstraße 2, 37077, Göttingen, Germany
| | - Xin Hong
- Center of Chemistry for Frontier Technologies, Department of Chemistry, State Key Laboratory of Clean Energy Utilization, Zhejiang University, 38 Zheda Road, Hangzhou, 310027, P. R. China.,Beijing National Laboratory for Molecular Sciences, Zhongguancun North First Street No. 2, Beijing, 100190, P. R. China.,Key Laboratory of Precise Synthesis of, Functional Molecules of Zhejiang Province, School of Science, Westlake University, 18 Shilongshan Road, Hangzhou, 310024, Zhejiang Province, P. R. China
| |
Collapse
|
41
|
Tu Z, Stuyver T, Coley CW. Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery. Chem Sci 2023; 14:226-244. [PMID: 36743887 PMCID: PMC9811563 DOI: 10.1039/d2sc05089g] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 11/25/2022] [Indexed: 11/29/2022] Open
Abstract
The field of predictive chemistry relates to the development of models able to describe how molecules interact and react. It encompasses the long-standing task of computer-aided retrosynthesis, but is far more reaching and ambitious in its goals. In this review, we summarize several areas where predictive chemistry models hold the potential to accelerate the deployment, development, and discovery of organic reactions and advance synthetic chemistry.
Collapse
Affiliation(s)
- Zhengkai Tu
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Thijs Stuyver
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Connor W Coley
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| |
Collapse
|
42
|
Su A, Zhang X, Zhang C, Ding D, Yang YF, Wang K, She YB. Deep transfer learning for predicting frontier orbital energies of organic materials using small data and its application to porphyrin photocatalysts. Phys Chem Chem Phys 2023; 25:10536-10549. [PMID: 36987933 DOI: 10.1039/d3cp00917c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
A deep transfer learning approach is used to predict HOMO/LUMO energies of organic materials with a small amount of training data.
Collapse
Affiliation(s)
- An Su
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Xin Zhang
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Chengwei Zhang
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Debo Ding
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Yun-Fang Yang
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Keke Wang
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| | - Yuan-Bin She
- College of Chemical Engineering, Zhejiang University of Technology, Hangzhou 310014, P. R. China.
| |
Collapse
|
43
|
Exploring Deep Learning for Metalloporphyrins: Databases, Molecular Representations, and Model Architectures. Catalysts 2022. [DOI: 10.3390/catal12111485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Metalloporphyrins have been studied as biomimetic catalysts for more than 120 years and have accumulated a large amount of data, which provides a solid foundation for deep learning to discover chemical trends and structure–function relationships. In this study, key components of deep learning of metalloporphyrins, including databases, molecular representations, and model architectures, were systematically investigated. A protocol to construct canonical SMILES for metalloporphyrins was proposed, which was then used to represent the two-dimensional structures of over 10,000 metalloporphyrins in an existing computational database. Subsequently, several state-of-the-art chemical deep learning models, including graph neural network-based models and natural language processing-based models, were employed to predict the energy gaps of metalloporphyrins. Two models showed satisfactory predictive performance (R2 0.94) with canonical SMILES as the only source of structural information. In addition, an unsupervised visualization algorithm was used to interpret the molecular features learned by the deep learning models.
Collapse
|
44
|
Chemistry-informed molecular graph as reaction descriptor for machine-learned retrosynthesis planning. Proc Natl Acad Sci U S A 2022; 119:e2212711119. [PMID: 36191228 PMCID: PMC9564830 DOI: 10.1073/pnas.2212711119] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Infusing "chemical wisdom" should improve the data-driven approaches that rely exclusively on historical synthetic data for automatic retrosynthesis planning. For this purpose, we designed a chemistry-informed molecular graph (CIMG) to describe chemical reactions. A collection of key information that is most relevant to chemical reactions is integrated in CIMG:NMR chemical shifts as vertex features, bond dissociation energies as edge features, and solvent/catalyst information as global features. For any given compound as a target, a product CIMG is generated and exploited by a graph neural network (GNN) model to choose reaction template(s) leading to this product. A reactant CIMG is then inferred and used in two GNN models to select appropriate catalyst and solvent, respectively. Finally, a fourth GNN model compares the two CIMG descriptors to check the plausibility of the proposed reaction. A reaction vector is obtained for every molecule in training these models. The chemical wisdom of reaction propensity contained in the pretrained reaction vectors is exploited to autocategorize molecules/reactions and to accelerate Monte Carlo tree search (MCTS) for multistep retrosynthesis planning. Full synthetic routes with recommended catalysts/solvents are predicted efficiently using this CIMG-based approach.
Collapse
|
45
|
Ismail I, Chantreau Majerus R, Habershon S. Graph-Driven Reaction Discovery: Progress, Challenges, and Future Opportunities. J Phys Chem A 2022; 126:7051-7069. [PMID: 36190262 PMCID: PMC9574932 DOI: 10.1021/acs.jpca.2c06408] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Graph-based descriptors, such as bond-order matrices and adjacency matrices, offer a simple and compact way of categorizing molecular structures; furthermore, such descriptors can be readily used to catalog chemical reactions (i.e., bond-making and -breaking). As such, a number of graph-based methodologies have been developed with the goal of automating the process of generating chemical reaction network models describing the possible mechanistic chemistry in a given set of reactant species. Here, we outline the evolution of these graph-based reaction discovery schemes, with particular emphasis on more recent methods incorporating graph-based methods with semiempirical and ab initio electronic structure calculations, minimum-energy path refinements, and transition state searches. Using representative examples from homogeneous catalysis and interstellar chemistry, we highlight how these schemes increasingly act as "virtual reaction vessels" for interrogating mechanistic questions. Finally, we highlight where challenges remain, including issues of chemical accuracy and calculation speeds, as well as the inherent challenge of dealing with the vast size of accessible chemical reaction space.
Collapse
Affiliation(s)
- Idil Ismail
- Department of Chemistry, University of Warwick, CoventryCV4 7AL, United Kingdom
| | | | - Scott Habershon
- Department of Chemistry, University of Warwick, CoventryCV4 7AL, United Kingdom
| |
Collapse
|
46
|
Wang X, Yao C, Zhang Y, Yu J, Qiao H, Zhang C, Wu Y, Bai R, Duan H. From theory to experiment: transformer-based generation enables rapid discovery of novel reactions. J Cheminform 2022; 14:60. [PMID: 36056425 PMCID: PMC9438336 DOI: 10.1186/s13321-022-00638-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Deep learning methods, such as reaction prediction and retrosynthesis analysis, have demonstrated their significance in the chemical field. However, the de novo generation of novel reactions using artificial intelligence technology requires further exploration. Inspired by molecular generation, we proposed a novel task of reaction generation. Herein, Heck reactions were applied to train the transformer model, a state-of-art natural language process model, to generate 4717 reactions after sampling and processing. Then, 2253 novel Heck reactions were confirmed by organizing chemists to judge the generated reactions. More importantly, further organic synthesis experiments were performed to verify the accuracy and feasibility of representative reactions. The total process, from Heck reaction generation to experimental verification, required only 15 days, demonstrating that our model has well-learned reaction rules in-depth and can contribute to novel reaction discovery and chemical space exploration.
Collapse
Affiliation(s)
- Xinqiao Wang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Chuansheng Yao
- College of Pharmacy, School of Medicine, Hangzhou Normal University, Hangzhou, People's Republic of China.,Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, People's Republic of China
| | - Yun Zhang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Jiahui Yu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Haoran Qiao
- College of Mathematics and Physics, Shanghai University of Electric Power, Shanghai, 201203, People's Republic of China
| | - Chengyun Zhang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Yejian Wu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China
| | - Renren Bai
- College of Pharmacy, School of Medicine, Hangzhou Normal University, Hangzhou, People's Republic of China. .,Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, People's Republic of China.
| | - Hongliang Duan
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, 310014, People's Republic of China. .,State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica (SIMM), Chinese Academy of Sciences, Shanghai, 201203, China.
| |
Collapse
|
47
|
Schleinitz J, Langevin M, Smail Y, Wehnert B, Grimaud L, Vuilleumier R. Machine Learning Yield Prediction from NiCOlit, a Small-Size Literature Data Set of Nickel Catalyzed C-O Couplings. J Am Chem Soc 2022; 144:14722-14730. [PMID: 35939717 DOI: 10.1021/jacs.2c05302] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Synthetic yield prediction using machine learning is intensively studied. Previous work has focused on two categories of data sets: high-throughput experimentation data, as an ideal case study, and data sets extracted from proprietary databases, which are known to have a strong reporting bias toward high yields. However, predicting yields using published reaction data remains elusive. To fill the gap, we built a data set on nickel-catalyzed cross-couplings extracted from organic reaction publications, including scope and optimization information. We demonstrate the importance of including optimization data as a source of failed experiments and emphasize how publication constraints shape the exploration of the chemical space by the synthetic community. While machine learning models still fail to perform out-of-sample predictions, this work shows that adding chemical knowledge enables fair predictions in a low-data regime. Eventually, we hope that this unique public database will foster further improvements of machine learning methods for reaction yield prediction in a more realistic context.
Collapse
Affiliation(s)
- Jules Schleinitz
- LBM, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France
| | - Maxime Langevin
- PASTEUR, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France.,Molecular Design Sciences─Integrated Drug Discovery, Sanofi R&D, 94400 Vitry-Sur-Seine, France
| | - Yanis Smail
- UPMC, PSL University, Sorbonne Université, CNRS, 75005 Paris, France
| | - Benjamin Wehnert
- UPMC, PSL University, Sorbonne Université, CNRS, 75005 Paris, France
| | - Laurence Grimaud
- LBM, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France
| | - Rodolphe Vuilleumier
- PASTEUR, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France
| |
Collapse
|
48
|
Asahara R, Miyao T. Extended Connectivity Fingerprints as a Chemical Reaction Representation for Enantioselective Organophosphorus-Catalyzed Asymmetric Reaction Prediction. ACS OMEGA 2022; 7:26952-26964. [PMID: 35936487 PMCID: PMC9352214 DOI: 10.1021/acsomega.2c03812] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 07/07/2022] [Indexed: 06/15/2023]
Abstract
Predicting the outcomes of organic reactions using data-driven approaches aids in the acceleration of research. In laboratory-scale experiments, only a small number of reaction data can be accessed for machine learning model construction, where reaction representations play a pivotal role in the success of model construction. Nevertheless, representation comparison for a small data set is not adequate. Herein, focusing on the enantioselectivity of phosphoric-acid-catalyzed reactions, various two-dimensional and three-dimensional reaction representations (descriptors) were compared. Overall, the concatenated form of the extended connectivity fingerprints showed the best predictive capability for the two types of data sets: high-throughput experimental data and manually collected literature data sets. Furthermore, highlighting the substructure contribution to the prediction outcome was shown to be informative for guiding catalyst development.
Collapse
Affiliation(s)
- Ryosuke Asahara
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5
Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
49
|
Strieth-Kalthoff F, Sandfort F, Kühnemund M, Schäfer FR, Kuchen H, Glorius F. Machine Learning for Chemical Reactivity: The Importance of Failed Experiments. Angew Chem Int Ed Engl 2022; 61:e202204647. [PMID: 35512117 DOI: 10.1002/anie.202204647] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Indexed: 12/27/2022]
Abstract
Assessing the outcomes of chemical reactions in a quantitative fashion has been a cornerstone across all synthetic disciplines. Classically approached through empirical optimization, data-driven modelling bears an enormous potential to streamline this process. However, such predictive models require significant quantities of high-quality data, the availability of which is limited: Main reasons for this include experimental errors and, importantly, human biases regarding experiment selection and result reporting. In a series of case studies, we investigate the impact of these biases for drawing general conclusions from chemical reaction data, revealing the utmost importance of "negative" examples. Eventually, case studies into data expansion approaches showcase directions to circumvent these limitations-and demonstrate perspectives towards a long-term data quality enhancement in chemistry.
Collapse
Affiliation(s)
- Felix Strieth-Kalthoff
- Westfälische Wilhelms-Universität Münster, Organisch-Chemisches Institut, Corrensstr. 40, 48149, Münster, Germany
| | - Frederik Sandfort
- Westfälische Wilhelms-Universität Münster, Organisch-Chemisches Institut, Corrensstr. 40, 48149, Münster, Germany
| | - Marius Kühnemund
- Westfälische Wilhelms-Universität Münster, Department for Information Systems, Leonardo-Campus 3, 48149, Münster, Germany
| | - Felix R Schäfer
- Westfälische Wilhelms-Universität Münster, Organisch-Chemisches Institut, Corrensstr. 40, 48149, Münster, Germany
| | - Herbert Kuchen
- Westfälische Wilhelms-Universität Münster, Department for Information Systems, Leonardo-Campus 3, 48149, Münster, Germany
| | - Frank Glorius
- Westfälische Wilhelms-Universität Münster, Organisch-Chemisches Institut, Corrensstr. 40, 48149, Münster, Germany
| |
Collapse
|
50
|
Lewis‐Atwell T, Townsend PA, Grayson MN. Machine learning activation energies of chemical reactions. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1593] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Toby Lewis‐Atwell
- Department of Computer Science, Faculty of Science University of Bath Bath UK
| | - Piers A. Townsend
- Department of Chemistry, Faculty of Science University of Bath Bath UK
| | | |
Collapse
|