1
|
Liu Z, Moroz YS, Isayev O. The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions. Chem Sci 2023; 14:10835-10846. [PMID: 37829036 PMCID: PMC10566507 DOI: 10.1039/d3sc03902a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 09/12/2023] [Indexed: 10/14/2023] Open
Abstract
Accurate prediction of reaction yield is the holy grail for computer-assisted synthesis prediction, but current models have failed to generalize to large literature datasets. To understand the causes and inspire future design, we systematically benchmarked the yield prediction task. We carefully curated and augmented a literature dataset of 41 239 amide coupling reactions, each with information on reactants, products, intermediates, yields, and reaction contexts, and provided 3D structures for the molecules. We calculated molecular features related to 2D and 3D structure information, as well as physical and electronic properties. These descriptors were paired with 4 categories of machine learning methods (linear, kernel, ensemble, and neural network), yielding valuable benchmarks about feature and model performance. Despite the excellent performance on a high-throughput experiment (HTE) dataset (R2 around 0.9), no method gave satisfactory results on the literature data. The best performance was an R2 of 0.395 ± 0.020 using the stack technique. Error analysis revealed that reactivity cliff and yield uncertainty are among the main reasons for incorrect predictions. Removing reactivity cliffs and uncertain reactions boosted the R2 to 0.457 ± 0.006. These results highlight that yield prediction models must be sensitive to the reactivity change due to the subtle structure variance, as well as be robust to the uncertainty associated with yield measurements.
Collapse
Affiliation(s)
- Zhen Liu
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University Pittsburgh PA 15213 USA
| | - Yurii S Moroz
- Enamine Ltd Kyïv 02660 Ukraine
- Chemspace LLC Kyïv 02094 Ukraine
- Taras Shevchenko National University of Kyïv Kyïv 01601 Ukraine
| | - Olexandr Isayev
- Department of Chemistry, Mellon College of Science, Carnegie Mellon University Pittsburgh PA 15213 USA
| |
Collapse
|
2
|
Drug-Induced Immune Thrombocytopenia Toxicity Prediction Based on Machine Learning. Pharmaceutics 2022; 14:pharmaceutics14050943. [PMID: 35631529 PMCID: PMC9143325 DOI: 10.3390/pharmaceutics14050943] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 04/20/2022] [Accepted: 04/22/2022] [Indexed: 11/29/2022] Open
Abstract
Drug-induced immune thrombocytopenia (DITP) often occurs in patients receiving many drug treatments simultaneously. However, clinicians usually fail to accurately distinguish which drugs can be plausible culprits. Despite significant advances in laboratory-based DITP testing, in vitro experimental assays have been expensive and, in certain cases, cannot provide a timely diagnosis to patients. To address these shortcomings, this paper proposes an efficient machine learning-based method for DITP toxicity prediction. A small dataset consisting of 225 molecules was constructed. The molecules were represented by six fingerprints, three descriptors, and their combinations. Seven classical machine learning-based models were examined to determine an optimal model. The results show that the RDMD + PubChem-k-NN model provides the best prediction performance among all the models, achieving an area under the curve of 76.9% and overall accuracy of 75.6% on the external validation set. The application domain (AD) analysis demonstrates the prediction reliability of the RDMD + PubChem-k-NN model. Five structural fragments related to the DITP toxicity are identified through information gain (IG) method along with fragment frequency analysis. Overall, as far as known, it is the first machine learning-based classification model for recognizing chemicals with DITP toxicity and can be used as an efficient tool in drug design and clinical therapy.
Collapse
|
3
|
Machine learning & deep learning in data-driven decision making of drug discovery and challenges in high-quality data acquisition in the pharmaceutical industry. Future Med Chem 2021; 14:245-270. [PMID: 34939433 DOI: 10.4155/fmc-2021-0243] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
Predicting novel small molecule bioactivities for the target deconvolution, hit-to-lead optimization in drug discovery research, requires molecular representation. Previous reports have demonstrated that machine learning (ML) and deep learning (DL) have substantial implications in virtual screening, peptide synthesis, drug ADMET screening and biomarker discovery. These strategies can increase the positive outcomes in the drug discovery process without false-positive rates and can be achieved in a cost-effective way with a minimum duration of time by high-quality data acquisition. This review substantially discusses the recent updates in AI tools as cheminformatics application in medicinal chemistry for the data-driven decision making of drug discovery and challenges in high-quality data acquisition in the pharmaceutical industry while improving small-molecule bioactivities and properties.
Collapse
|
4
|
Kim Y, Ryu JY, Kim HU, Jang WD, Lee SY. A deep learning approach to evaluate the feasibility of enzymatic reactions generated by retrobiosynthesis. Biotechnol J 2021; 16:e2000605. [PMID: 33386776 DOI: 10.1002/biot.202000605] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 12/30/2020] [Indexed: 12/29/2022]
Abstract
Retrobiosynthesis allows the designing of novel biosynthetic pathways for the production of chemicals and materials through metabolic engineering, but generates a large number of reactions beyond the experimental feasibility. Thus, an effective method that can reduce a large number of the initially predicted enzymatic reactions has been needed. Here, we present Deep learning-based Reaction Feasibility Checker (DeepRFC) to classify the feasibility of a given enzymatic reaction with high performance and speed. DeepRFC is designed to receive Simplified Molecular-Input Line-Entry System (SMILES) strings of a reactant pair, which is defined as a substrate and a product of a reaction, as an input, and evaluates whether the input reaction is feasible. A deep neural network is selected for DeepRFC as it leads to better classification performance than five other representative machine learning methods examined. For validation, the performance of DeepRFC is compared with another in-house reaction feasibility checker that uses the concept of reaction similarity. Finally, the use of DeepRFC is demonstrated for the retrobiosynthesis-based design of novel one-carbon assimilation pathways. DeepRFC will allow retrobiosynthesis to be more practical for metabolic engineering applications by efficiently screening a large number of retrobiosynthesis-derived enzymatic reactions. DeepRFC is freely available at https://bitbucket.org/kaistsystemsbiology/deeprfc.
Collapse
Affiliation(s)
- Yeji Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering, KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.,Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, Republic of Korea.,KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea
| | - Jae Yong Ryu
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, Daejeon, Republic of Korea
| | - Hyun Uk Kim
- Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, Republic of Korea.,KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea.,Systems Biology and Medicine Laboratory, Department of Chemical and Biomolecular Engineering, KAIST, Daejeon, Republic of Korea
| | - Woo Dae Jang
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering, KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.,Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, Republic of Korea.,KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering, KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.,Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, Republic of Korea.,KAIST Institute for Artificial Intelligence, BioProcess Engineering Research Center and Bioinformatics Research Center, KAIST, Daejeon, Republic of Korea
| |
Collapse
|
5
|
Gan H, Peng L, Gu FL. Mechanistic understanding of the Cu( i)-catalyzed domino reaction constructing 1-aryl-1,2,3-triazole from electron-rich aryl bromide, alkyne, and sodium azide: a DFT study. Catal Sci Technol 2021. [DOI: 10.1039/d1cy00123j] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The mechanism of the Cu(i)-catalyzed domino reaction furnishing 1-aryl-1,2,3-triazole assisted by CuI and 1,8-diazabicyclo[5.4.0]undec-7-ene (DBU) is explored with density functional theory (DFT) calculations.
Collapse
Affiliation(s)
- Hanlin Gan
- Key Laboratory of Theoretical Chemistry of Environment
- Ministry of Education
- School of Chemistry
- South China Normal University
- Guangzhou 51006
| | - Liang Peng
- Key Laboratory of Theoretical Chemistry of Environment
- Ministry of Education
- School of Chemistry
- South China Normal University
- Guangzhou 51006
| | - Feng Long Gu
- Key Laboratory of Theoretical Chemistry of Environment
- Ministry of Education
- School of Chemistry
- South China Normal University
- Guangzhou 51006
| |
Collapse
|
6
|
Yang Y, Zheng S, Su S, Zhao C, Xu J, Chen H. SyntaLinker: automatic fragment linking with deep conditional transformer neural networks. Chem Sci 2020; 11:8312-8322. [PMID: 34123096 PMCID: PMC8163338 DOI: 10.1039/d0sc03126g] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 07/21/2020] [Indexed: 12/18/2022] Open
Abstract
Linking fragments to generate a focused compound library for a specific drug target is one of the challenges in fragment-based drug design (FBDD). Hereby, we propose a new program named SyntaLinker, which is based on a syntactic pattern recognition approach using deep conditional transformer neural networks. This state-of-the-art transformer can link molecular fragments automatically by learning from the knowledge of structures in medicinal chemistry databases (e.g. ChEMBL database). Conventionally, linking molecular fragments was viewed as connecting substructures that were predefined by empirical rules. In SyntaLinker, however, the rules of linking fragments can be learned implicitly from known chemical structures by recognizing syntactic patterns embedded in SMILES notations. With deep conditional transformer neural networks, SyntaLinker can generate molecular structures based on a given pair of fragments and additional restrictions. Case studies have demonstrated the advantages and usefulness of SyntaLinker in FBDD.
Collapse
Affiliation(s)
- Yuyao Yang
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City Guangzhou 510006 China
- Center of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health Guangdong Laboratory Guangzhou 510530 China
| | - Shuangjia Zheng
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City Guangzhou 510006 China
| | - Shimin Su
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City Guangzhou 510006 China
- Center of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health Guangdong Laboratory Guangzhou 510530 China
| | - Chao Zhao
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City Guangzhou 510006 China
| | - Jun Xu
- Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City Guangzhou 510006 China
| | - Hongming Chen
- Center of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health Guangdong Laboratory Guangzhou 510530 China
| |
Collapse
|