1
|
Ue T, Sato A, Miyao T. Analog Accessibility Score (AAscore) for Rational Compound Selection. J Chem Inf Model 2024; 64:9350-9360. [PMID: 39639743 DOI: 10.1021/acs.jcim.4c01691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2024]
Abstract
Various in silico scores have been proposed to objectively assess the characteristics and properties of a compound. However, there is still no score that represents the analog accessibility of a compound. Such a score would be valuable for selecting compounds proposed by virtual screening or for prioritizing hit compounds for the hit-to-lead phase. This study proposes an analog accessibility score (AAscore), where retrosynthesis prediction and forward product prediction models were utilized to generate virtual analogs. The AAscore is defined as the number of unique analogs and virtual synthetic routes. To evaluate the AAscore in terms of the number of actually synthesized analog compounds, analog compounds were prepared by using the compound-core relationship (CCR) method. It was found that the AAscore was little correlated with the number of CCR-based analogs. Furthermore, AAscores were found to be significantly influenced by the number of extracted candidate reactants from a reactant database. A case study targeting compounds active against carbonic anhydrase 2 showed that the AAscore could identify compounds that were synthesized into analogs.
Collapse
Affiliation(s)
- Takato Ue
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| | - Akinori Sato
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| | - Tomoyuki Miyao
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| |
Collapse
|
2
|
Yang X, Chen N, Yu H, Liu X, Feng Y, Xing D, Tian Y. Applying machine learning and genetic algorithms accelerated for optimizing ethanol production. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 955:177027. [PMID: 39437908 DOI: 10.1016/j.scitotenv.2024.177027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 09/28/2024] [Accepted: 10/16/2024] [Indexed: 10/25/2024]
Abstract
Corn straws can produce bioethanol via simultaneous saccharification and co-fermentation (SSCF). However, identifying optimal combinations of operating parameters from numerous possibilities through a cost-effective strategy to improve SSCF efficiency and yield remains challenging. The eXtreme Gradient Boost (XGB) and deep neural network (DNN) models were constructed to accurately predict ethanol yield from only five input variables, achieving >83 % accuracy. Subsequently, the XGB and the DNN models were merged with the genetic algorithm (GA) as the new optimization strategies. Experimental validation showed that the new strategy optimize the efficiency and yield of the SSCF ethanol production system quickly and accurately. Moreover, the potential optimization mechanism was investigated through the comprehensive interpretability analysis for XGB and the microbial ecology analysis. Enzyme Solution Volume (61.7 %) dominated, followed by time (12.9 %), substrate concentration (10.4 %), temperature (7.7 %), and inoculum volume (7.3 %). This efficient and accurate algorithm design strategy can significantly reduce the time required to optimize biochemical systems.
Collapse
Affiliation(s)
- Xu Yang
- School of Resource and Environment, Northeast Agriculture University, Harbin 150030, PR China
| | - Nianhua Chen
- School of Resource and Environment, Northeast Agriculture University, Harbin 150030, PR China
| | - Hui Yu
- School of Resource and Environment, Northeast Agriculture University, Harbin 150030, PR China
| | - Xinyue Liu
- School of Resource and Environment, Northeast Agriculture University, Harbin 150030, PR China
| | - Yujie Feng
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, No.73 Huanghe Road, Nangang District, Harbin 150090, PR China
| | - Defeng Xing
- State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, No.73 Huanghe Road, Nangang District, Harbin 150090, PR China
| | - Yushi Tian
- School of Resource and Environment, Northeast Agriculture University, Harbin 150030, PR China.
| |
Collapse
|
3
|
Gao Y, Zhang X, Sun Z, Chandak P, Bu J, Wang H. Precision Adverse Drug Reactions Prediction with Heterogeneous Graph Neural Network. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024:e2404671. [PMID: 39630592 DOI: 10.1002/advs.202404671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 06/11/2024] [Indexed: 12/07/2024]
Abstract
Accurate prediction of Adverse Drug Reactions (ADRs) at the patient level is essential for ensuring patient safety and optimizing healthcare outcomes. Traditional machine learning-based methods primarily focus on predicting potential ADRs for drugs, but they often fall short of capturing the complexity of individual demographics and the variations in ADRs experienced by different people. In this study, a novel framework called Precise Adverse Drug Reaction (PreciseADR) for patient-level ADR prediction is proposed. The approach effectively integrates relations between patients and ADRs, and harnesses the power of heterogeneous Graph Neural Networks (GNNs) to address the limitations of traditional methods. Specifically, a heterogeneous graph representation of patients is constructed, encompassing nodes that represent patients, diseases, drugs, and ADRs. By leveraging edges in the graph, crucial connections are captured such as a patient being affected by diseases, taking specific drugs, and experiencing ADRs. Next, a GNN-based model is utilized to learn latent representations of the patient nodes and facilitate the propagation of information throughout the graph structure. By employing patient embeddings that consider their diseases and drugs, potential ADRs can be accurately predicted. The PreciseADR is dedicated to effectively capturing both local and global dependencies within the heterogeneous graph, allowing for the identification of subtle patterns and interactions that play a significant role in ADRs. To evaluate the performance of the approach, extensive experiments are conducted on a large-scale real-world healthcare dataset with adverse reports from the FDA Adverse Event Reporting System (FAERS). Experimental results demonstrate that the PreciseADR achieves superior predictive performance in identifying patient-level ADRs, surpassing the strongest baseline by 3.2% in AUC score and by 4.9% in Hit@10.
Collapse
Affiliation(s)
- Yang Gao
- Department of Hepatobiliary and Pancreatic Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310009, China
- College of Computer Science, Zhejiang University, Hangzhou, 310058, China
| | - Xiang Zhang
- Department of Computer Science, The University of North Carolina at Charlotte, Charlotte, NC, 28223-0001, USA
| | - Zhongquan Sun
- Department of Hepatobiliary and Pancreatic Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310009, China
| | - Payal Chandak
- Harvard-MIT Health Sciences and Technology, Cambridge, MA, 02139, USA
| | - Jiajun Bu
- College of Computer Science, Zhejiang University, Hangzhou, 310058, China
| | - Haishuai Wang
- Department of Hepatobiliary and Pancreatic Surgery, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310009, China
- College of Computer Science, Zhejiang University, Hangzhou, 310058, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
| |
Collapse
|
4
|
Hu X, Chen Z, Peng B, Adu-Ampratwum D, Ning X. log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling. ARXIV 2024:arXiv:2411.03320v3. [PMID: 39606718 PMCID: PMC11601803] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. Our approach implements a unique local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions, ensuring that the impact of varying-sizes molecular fragments on yield is accurately accounted for. Another key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM outperforms existing methods in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. Its advanced modeling of reactant-reagent interactions and sensitivity to small molecular fragments make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/YieldlogRRIM.
Collapse
Affiliation(s)
- Xiao Hu
- Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
| | - Ziqi Chen
- Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
| | - Bo Peng
- Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
| | - Daniel Adu-Ampratwum
- Division of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, The Ohio State University, Columbus, Ohio 43210
| | - Xia Ning
- Computer Science and Engineering, The Ohio State University, Columbus, OH 43210
- Division of Medicinal Chemistry and Pharmacognosy, College of Pharmacy, The Ohio State University, Columbus, Ohio 43210
- Biomedical Informatics, The Ohio State University, Columbus, OH 43210
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH, 43210
| |
Collapse
|
5
|
Joung JF, Fong MH, Roh J, Tu Z, Bradshaw J, Coley CW. Reproducing Reaction Mechanisms with Machine-Learning Models Trained on a Large-Scale Mechanistic Dataset. Angew Chem Int Ed Engl 2024; 63:e202411296. [PMID: 38995205 DOI: 10.1002/anie.202411296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 07/11/2024] [Accepted: 07/12/2024] [Indexed: 07/13/2024]
Abstract
Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps. We explore the performance and capabilities of these models, focusing on their ability to predict reaction pathways and recapitulate the roles of catalysts and reagents. Additionally, we demonstrate the potential of mechanistic models in predicting impurities, often overlooked by conventional models. We conclude by evaluating the generalizability of mechanistic models to new reaction types, revealing challenges related to dataset diversity, consecutive predictions, and violations of atom conservation.
Collapse
Affiliation(s)
- Joonyoung F Joung
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| | - Mun Hong Fong
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| | - Jihye Roh
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| | - Zhengkai Tu
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| | - John Bradshaw
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts, 02139, United States
| |
Collapse
|
6
|
Chen LY, Li YP. Machine learning-guided strategies for reaction conditions design and optimization. Beilstein J Org Chem 2024; 20:2476-2492. [PMID: 39376489 PMCID: PMC11457048 DOI: 10.3762/bjoc.20.212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 09/19/2024] [Indexed: 10/09/2024] Open
Abstract
This review surveys the recent advances and challenges in predicting and optimizing reaction conditions using machine learning techniques. The paper emphasizes the importance of acquiring and processing large and diverse datasets of chemical reactions, and the use of both global and local models to guide the design of synthetic processes. Global models exploit the information from comprehensive databases to suggest general reaction conditions for new reactions, while local models fine-tune the specific parameters for a given reaction family to improve yield and selectivity. The paper also identifies the current limitations and opportunities in this field, such as the data quality and availability, and the integration of high-throughput experimentation. The paper demonstrates how the combination of chemical engineering, data science, and ML algorithms can enhance the efficiency and effectiveness of reaction conditions design, and enable novel discoveries in synthetic chemistry.
Collapse
Affiliation(s)
- Lung-Yi Chen
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
| | - Yi-Pei Li
- Department of Chemical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan
- Taiwan International Graduate Program on Sustainable Chemical Science and Technology (TIGP-SCST), No. 128, Sec. 2, Academia Road, Taipei 11529, Taiwan
| |
Collapse
|
7
|
Hoque A, Surve M, Kalyanakrishnan S, Sunoj RB. Reinforcement Learning for Improving Chemical Reaction Performance. J Am Chem Soc 2024. [PMID: 39356950 DOI: 10.1021/jacs.4c08866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/04/2024]
Abstract
Deep learning (DL) methods have gained notable prominence in predictive and generative tasks in molecular space. However, their application in chemical reactions remains grossly underutilized. Chemical reactions are intrinsically complex: typically involving multiple molecules besides bond-breaking/forming events. In reaction discovery, one aims to maximize yield and/or selectivity that depends on a number of factors, mostly centered on reacting partners and reaction conditions. Herein, we introduce RE-EXPLORE, a novel approach that integrates deep reinforcement learning (RL) with an RNN-based deep generative model to identify prospective new reactants/catalysts, whose yield/selectivity is estimated using a pretrained regressor. Three chemical databases (ChEMBL, ZINC, and COCONUT containing half a million to one million unlabeled molecules) are independently used for pretraining the generators to enrich them with valuable information from diverse chemical space. Standard RL methods are found to be insufficient, as learners tend to prioritize exploitation for immediate gains, resulting in repetitive generation of same/similar molecules. Our engineered reward function includes a Tanimoto-based uniqueness factor within the RL loop that improved the exploration of the environment and has helped accrue larger returns. Integration of a user-defined core fragment into the generated molecules facilitated learning of specific reaction types. Together, RE-EXPLORE can navigate the reaction space toward practically meaningful regions and offers notable improvements across the three distinct reaction types considered in this study. It identifies high-yielding substrates and highly enantioselective chiral catalysts. This RL-based approach has the potential to expedite reaction discovery and aid in the synthesis planning of important compounds, including drugs and pharmaceuticals.
Collapse
Affiliation(s)
- Ajnabiul Hoque
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Mihir Surve
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Shivaram Kalyanakrishnan
- Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Center for Machine Intelligence and Data Science (CMInDS), Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
8
|
Sato A, Asahara R, Miyao T. Chemical Graph-Based Transformer Models for Yield Prediction of High-Throughput Cross-Coupling Reaction Datasets. ACS OMEGA 2024; 9:40907-40919. [PMID: 39372005 PMCID: PMC11447720 DOI: 10.1021/acsomega.4c06113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/28/2024] [Accepted: 09/03/2024] [Indexed: 10/08/2024]
Abstract
The chemical reaction yield is an important factor to determine the reaction conditions. Recently, many data-driven models for yield prediction using high-throughput experimentation datasets have been reported. In this study, we propose a neural network architecture based on the chemical graphs of the reaction components to predict the reaction yield. The proposed model is the sequential combination of a message-passing neural network and a transformer encoder (MPNN-Transformer). The reaction components are converted to molecular matrices by the first network, followed by the interplay of the reaction components in the second network after adding the embeddings of the compound roles in the chemical reaction. The predictive ability of the proposed models was compared with state-of-the-art yield prediction models using two high-throughput experimental datasets: the Buchwald-Hartwig cross-coupling (BHC) and Suzuki-Miyaura cross-coupling (SMC) reaction datasets. Overall, the MPNN-Transformer models showed high prediction accuracy for the BHC reaction datasets and some of the extrapolation-oriented SMC reaction datasets. These models also performed well when the training dataset size was relatively large. Furthermore, analyzing the poorly predicted reactions for the BHC reaction dataset revealed a limitation of the data-driven yield prediction approach based on the chemical structural similarity.
Collapse
Affiliation(s)
- Akinori Sato
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Ryosuke Asahara
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Tomoyuki Miyao
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
9
|
Li A, Casiraghi E, Rousu J. Chemical reaction enhanced graph learning for molecule representation. Bioinformatics 2024; 40:btae558. [PMID: 39271156 PMCID: PMC11639130 DOI: 10.1093/bioinformatics/btae558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 08/28/2024] [Accepted: 09/11/2024] [Indexed: 09/15/2024] Open
Abstract
MOTIVATION Molecular representation learning (MRL) models molecules with low-dimensional vectors to support biological and chemical applications. Current methods primarily rely on intrinsic molecular information to learn molecular representations, but they often overlook effectively integrating domain knowledge into MRL. RESULTS In this article, we develop a reaction-enhanced graph learning (RXGL) framework for MRL, utilizing chemical reactions as domain knowledge. RXGL introduces dual graph learning modules to model molecule representation. One module employs graph convolutions on molecular graphs to capture molecule structures. The other module constructs a reaction-aware graph from chemical reactions and designs a novel graph attention network on this graph to integrate reaction-level relations into molecular modeling. To refine molecule representations, we design a reaction-based relation learning task, which considers the relations between the reactant and product sides in reactions. In addition, we introduce a cross-view contrastive task to strengthen the cooperative associations between molecular and reaction-aware graph learning. Experiment results show that our RXGL achieves strong performance in various downstream tasks, including product prediction, reaction classification, and molecular property prediction. AVAILABILITY AND IMPLEMENTATION The code is publicly available at https://github.com/coder-ACAC/RLM.
Collapse
Affiliation(s)
- Anchen Li
- Department of Computer Science, Aalto University, Espoo, 02150, Finland
| | - Elena Casiraghi
- Department of Computer Science, Aalto University, Espoo, 02150, Finland
- AnacletoLab, Dipartimento di Informatica "Giovanni degli Antoni", University of Milan, Milan, 20133, Italy
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, United States
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit (University of Milan), Milan, 20133, Italy
| | - Juho Rousu
- Department of Computer Science, Aalto University, Espoo, 02150, Finland
| |
Collapse
|
10
|
Pang J, Vulić I. Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction. Faraday Discuss 2024. [PMID: 39308330 DOI: 10.1039/d4fd00104d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2024]
Abstract
Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
Collapse
Affiliation(s)
- Jiayun Pang
- School of Science, Faculty of Engineering and Science, University of Greenwich, Medway Campus, Central Avenue, Chatham Maritime, ME4 3RL, UK.
| | - Ivan Vulić
- Language Technology Lab, University of Cambridge, 9 West Road, Cambridge CB3 9DA, UK.
| |
Collapse
|
11
|
Ekins S, Lane TR, Urbina F, Puhl AC. In silico ADME/tox comes of age: twenty years later. Xenobiotica 2024; 54:352-358. [PMID: 37539466 PMCID: PMC10850432 DOI: 10.1080/00498254.2023.2245049] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/01/2023] [Accepted: 08/02/2023] [Indexed: 08/05/2023]
Abstract
In the early 2000s pharmaceutical drug discovery was beginning to use computational approaches for absorption, distribution, metabolism, excretion and toxicity (ADME/Tox, also known as ADMET) prediction. This emphasis on prediction was an effort to reduce the risk of later stage failures from ADME/Tox.Much has been written in the intervening twenty plus years and significant expenditure has occurred in companies developing these in silico capabilities which can be gleaned from publications. It is therefore an appropriate time to briefly reflect on what was proposed then and what the reality is today.20 years ago, we tended to optimise bioactivity and perhaps one ADME/Tox property at a time. Previously pharmaceutical companies needed a whole infrastructure for models - in silico and in vitro experts, IT, champions on a project team, educators and management support. Now we are in the age of generative de novo design where bioactivity and many ADME/Tox properties can be optimised and large language model technologies are available.There are also some challenges such as the focus on very large molecules which may be outside of current ADME/Tox models.We provide an opportunity to look forward with the increasing public data for ADME/Tox as well as expanded types of algorithms available.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Thomas R. Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Ana C. Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|
12
|
Keto A, Guo T, Underdue M, Stuyver T, Coley CW, Zhang X, Krenske EH, Wiest O. Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels-Alder Reaction Outcomes. J Am Chem Soc 2024; 146:16052-16061. [PMID: 38822795 DOI: 10.1021/jacs.4c03131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2024]
Abstract
The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.
Collapse
Affiliation(s)
- Angus Keto
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Taicheng Guo
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Morgan Underdue
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Thijs Stuyver
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Xiangliang Zhang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Elizabeth H Krenske
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| |
Collapse
|
13
|
Telenti A, Auli M, Hie BL, Maher C, Saria S, Ioannidis JPA. Large language models for science and medicine. Eur J Clin Invest 2024; 54:e14183. [PMID: 38381530 DOI: 10.1111/eci.14183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 02/06/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
Large language models (LLMs) are a type of machine learning model that learn statistical patterns over text, such as predicting the next words in a sequence of text. Both general purpose and task-specific LLMs have demonstrated potential across diverse applications. Science and medicine have many data types that are highly suitable for LLMs, such as scientific texts (publications, patents and textbooks), electronic medical records, large databases of DNA and protein sequences and chemical compounds. Carefully validated systems that can understand and reason across all these modalities may maximize benefits. Despite the inevitable limitations and caveats of any new technology and some uncertainties specific to LLMs, LLMs have the potential to be transformative in science and medicine.
Collapse
Affiliation(s)
- Amalio Telenti
- Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, California, USA
- Vir Biotechnology, Inc., San Francisco, California, USA
| | | | - Brian L Hie
- FAIR, Meta, Menlo Park, California, USA
- Department of Chemical Engineering, Stanford University, Stanford, California, USA
| | - Cyrus Maher
- Vir Biotechnology, Inc., San Francisco, California, USA
| | - Suchi Saria
- Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, Maryland, USA
| | - John P A Ioannidis
- Department of Medicine, Stanford University, Stanford, California, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, California, USA
- Department of Biomedical Data Science, Stanford University, Stanford, California, USA
- Department of Statistics, Stanford University, Stanford, California, USA
- Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, California, USA
| |
Collapse
|
14
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
15
|
Kotlyarov R, Papachristos K, Wood GPF, Goodman JM. Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity. J Chem Inf Model 2024; 64:4286-4297. [PMID: 38708520 PMCID: PMC11134489 DOI: 10.1021/acs.jcim.4c00137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 04/05/2024] [Accepted: 04/23/2024] [Indexed: 05/07/2024]
Abstract
C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.
Collapse
Affiliation(s)
- Ruslan Kotlyarov
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| | | | - Geoffrey P. F. Wood
- Exscientia
Plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K.
| | - Jonathan M. Goodman
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| |
Collapse
|
16
|
Oniani D, Hilsman J, Zang C, Wang J, Cai L, Zawala J, Wang Y. Emerging opportunities of using large language models for translation between drug molecules and indications. Sci Rep 2024; 14:10738. [PMID: 38730226 PMCID: PMC11087469 DOI: 10.1038/s41598-024-61124-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
A drug molecule is a substance that changes an organism's mental or physical state. Every approved drug has an indication, which refers to the therapeutic use of that drug for treating a particular medical condition. While the Large Language Model (LLM), a generative Artificial Intelligence (AI) technique, has recently demonstrated effectiveness in translating between molecules and their textual descriptions, there remains a gap in research regarding their application in facilitating the translation between drug molecules and indications (which describes the disease, condition or symptoms for which the drug is used), or vice versa. Addressing this challenge could greatly benefit the drug discovery process. The capability of generating a drug from a given indication would allow for the discovery of drugs targeting specific diseases or targets and ultimately provide patients with better treatments. In this paper, we first propose a new task, the translation between drug molecules and corresponding indications, and then test existing LLMs on this new task. Specifically, we consider nine variations of the T5 LLM and evaluate them on two public datasets obtained from ChEMBL and DrugBank. Our experiments show the early results of using LLMs for this task and provide a perspective on the state-of-the-art. We also emphasize the current limitations and discuss future work that has the potential to improve the performance on this task. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases and significantly reduce the cost of drug discovery, with the potential to revolutionize the field of drug discovery in the era of generative AI.
Collapse
Affiliation(s)
- David Oniani
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jordan Hilsman
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chengxi Zang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
| | - Junmei Wang
- Department of Pharmaceutical Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Lianjin Cai
- Department of Pharmaceutical Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jan Zawala
- Jerzy Haber Institute of Catalysis and Surface Chemistry, Polish Academy of Sciences, Kraków, Poland
| | - Yanshan Wang
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
17
|
Shi R, Yu G, Huo X, Yang Y. Prediction of chemical reaction yields with large-scale multi-view pre-training. J Cheminform 2024; 16:22. [PMID: 38403627 PMCID: PMC10895839 DOI: 10.1186/s13321-024-00815-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Accepted: 02/14/2024] [Indexed: 02/27/2024] Open
Abstract
Developing machine learning models with high generalization capability for predicting chemical reaction yields is of significant interest and importance. The efficacy of such models depends heavily on the representation of chemical reactions, which has commonly been learned from SMILES or graphs of molecules using deep neural networks. However, the progression of chemical reactions is inherently determined by the molecular 3D geometric properties, which have been recently highlighted as crucial features in accurately predicting molecular properties and chemical reactions. Additionally, large-scale pre-training has been shown to be essential in enhancing the generalization capability of complex deep learning models. Based on these considerations, we propose the Reaction Multi-View Pre-training (ReaMVP) framework, which leverages self-supervised learning techniques and a two-stage pre-training strategy to predict chemical reaction yields. By incorporating multi-view learning with 3D geometric information, ReaMVP achieves state-of-the-art performance on two benchmark datasets. Notably, the experimental results indicate that ReaMVP has a significant advantage in predicting out-of-sample data, suggesting an enhanced generalization ability to predict new reactions. Scientific Contribution: This study presents the ReaMVP framework, which improves the generalization capability of machine learning models for predicting chemical reaction yields. By integrating sequential and geometric views and leveraging self-supervised learning techniques with a two-stage pre-training strategy, ReaMVP achieves state-of-the-art performance on benchmark datasets. The framework demonstrates superior predictive ability for out-of-sample data and enhances the prediction of new reactions.
Collapse
Affiliation(s)
- Runhan Shi
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Gufeng Yu
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Xiaohong Huo
- Shanghai Key Laboratory for Molecular Engineering of Chiral Drugs, Frontiers Science Center for Transformative Molecules, School of Chemistry and Chemical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
18
|
Qiu H, Liu L, Qiu X, Dai X, Ji X, Sun ZY. PolyNC: a natural and chemical language model for the prediction of unified polymer properties. Chem Sci 2024; 15:534-544. [PMID: 38179518 PMCID: PMC10763023 DOI: 10.1039/d3sc05079c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 12/04/2023] [Indexed: 01/06/2024] Open
Abstract
Language models exhibit a profound aptitude for addressing multimodal and multidomain challenges, a competency that eludes the majority of off-the-shelf machine learning models. Consequently, language models hold great potential for comprehending the intricate interplay between material compositions and diverse properties, thereby accelerating material design, particularly in the realm of polymers. While past limitations in polymer data hindered the use of data-intensive language models, the growing availability of standardized polymer data and effective data augmentation techniques now opens doors to previously uncharted territories. Here, we present a revolutionary model to enable rapid and precise prediction of Polymer properties via the power of Natural language and Chemical language (PolyNC). To showcase the efficacy of PolyNC, we have meticulously curated a labeled prompt-structure-property corpus encompassing 22 970 polymer data points on a series of essential polymer properties. Through the use of natural language prompts, PolyNC gains a comprehensive understanding of polymer properties, while employing chemical language (SMILES) to describe polymer structures. In a unified text-to-text manner, PolyNC consistently demonstrates exceptional performance on both regression tasks (such as property prediction) and the classification task (polymer classification). Simultaneous and interactive multitask learning enables PolyNC to holistically grasp the structure-property relationships of polymers. Through a combination of experiments and characterizations, the generalization ability of PolyNC has been demonstrated, with attention analysis further indicating that PolyNC effectively learns structural information about polymers from multimodal inputs. This work provides compelling evidence of the potential for deploying end-to-end language models in polymer research, representing a significant advancement in the AI community's dedicated pursuit of advancing polymer science.
Collapse
Affiliation(s)
- Haoke Qiu
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| | - Lunyang Liu
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xuepeng Qiu
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
- CAS Key Laboratory of High-Performance Synthetic Rubber and its Composite Materials, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xuemin Dai
- CAS Key Laboratory of High-Performance Synthetic Rubber and its Composite Materials, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
| | - Xiangling Ji
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| | - Zhao-Yan Sun
- State Key Laboratory of Polymer Physics and Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences Changchun 130022 China
- School of Applied Chemistry and Engineering, University of Science and Technology of China Hefei 230026 China
| |
Collapse
|
19
|
Liu T, Cao Z, Huang Y, Wan Y, Wu J, Hsieh CY, Hou T, Kang Y. SynCluster: Reaction Type Clustering and Recommendation Framework for Synthesis Planning. JACS AU 2023; 3:3446-3461. [PMID: 38155655 PMCID: PMC10751778 DOI: 10.1021/jacsau.3c00607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/07/2023] [Accepted: 11/08/2023] [Indexed: 12/30/2023]
Abstract
AI-assisted synthesis planning has emerged as a valuable tool in accelerating synthetic chemistry for the discovery of new drugs and materials. The template-free approach, which showcases superior generalization capabilities, is seen as the mainstream direction in this field. However, it remains unclear whether such an end-to-end approach can achieve problem-solving performance on par with experienced chemists without fully revealing insights into the chemical mechanisms involved. Moreover, there is a lack of unified and chemically inspired frameworks for improving multitask reaction predictions in this area. In this study, we have addressed these challenges by investigating the impact of fine-grained reaction-type labels on multiple downstream tasks and propose a novel framework named SynCluster. This framework incorporates unsupervised clustering cues into the baseline models and identifies plausible chemical subspaces which is compatible with multitask extensions and can serve as model-independent indicators to effectively enhance the performance of multiple downstream tasks. In retrosynthesis prediction, SynCluster achieves significant improvements of 4.1 and 11.0% in top-1 and top-10 prediction accuracy, respectively, compared to the baseline Molecular Transformer, and achieves a notable enhancement of 13.9% in top-10 accuracy when combined with Retroformer. By incorporating simplified molecular-input line-entry system augmentation, our framework achieves higher top-10 accuracy compared to state-of-the-art sequence-based retrosynthesis models and improves over the baseline on the diversity and validity of reactants. SynCluster also achieves 94.9% top-10 accuracy in forward synthesis prediction and 51.5% top-10 Maxfrag accuracy in reagent prediction. Overall, SynCluster provides a fresh perspective with chemical interpretability and reinforcement of domain knowledge in the synthesis design. It offers a promising solution for improving the accuracy and efficiency of AI-assisted synthesis planning and bridges the gap between template-free approaches and the problem-solving abilities of experienced chemists.
Collapse
Affiliation(s)
- Tiantao Liu
- Innovation
Institute for Artificial Intelligence in Medicine of Zhejiang University,
College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Zheng Cao
- College
of Computer Science and Technology, Zhejiang
University, Hangzhou 310027, Zhejiang, China
| | - Yuansheng Huang
- Innovation
Institute for Artificial Intelligence in Medicine of Zhejiang University,
College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yue Wan
- Tencent
Quantum Laboratory, Shenzhen 518057, Guangdong, China
| | - Jian Wu
- Second
Affiliated Hospital School of Medicine, and School of Public Health, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Chang-Yu Hsieh
- Innovation
Institute for Artificial Intelligence in Medicine of Zhejiang University,
College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Tingjun Hou
- Innovation
Institute for Artificial Intelligence in Medicine of Zhejiang University,
College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Yu Kang
- Innovation
Institute for Artificial Intelligence in Medicine of Zhejiang University,
College of Pharmaceutical Sciences and Cancer Center, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
20
|
Xia S, Chen E, Zhang Y. Integrated Molecular Modeling and Machine Learning for Drug Design. J Chem Theory Comput 2023; 19:7478-7495. [PMID: 37883810 PMCID: PMC10653122 DOI: 10.1021/acs.jctc.3c00814] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Modern therapeutic development often involves several stages that are interconnected, and multiple iterations are usually required to bring a new drug to the market. Computational approaches have increasingly become an indispensable part of helping reduce the time and cost of the research and development of new drugs. In this Perspective, we summarize our recent efforts on integrating molecular modeling and machine learning to develop computational tools for modulator design, including a pocket-guided rational design approach based on AlphaSpace to target protein-protein interactions, delta machine learning scoring functions for protein-ligand docking as well as virtual screening, and state-of-the-art deep learning models to predict calculated and experimental molecular properties based on molecular mechanics optimized geometries. Meanwhile, we discuss remaining challenges and promising directions for further development and use a retrospective example of FDA approved kinase inhibitor Erlotinib to demonstrate the use of these newly developed computational tools.
Collapse
Affiliation(s)
- Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Eric Chen
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
21
|
Ryu G, Kim GB, Yu T, Lee SY. Deep learning for metabolic pathway design. Metab Eng 2023; 80:130-141. [PMID: 37734652 DOI: 10.1016/j.ymben.2023.09.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 09/17/2023] [Accepted: 09/19/2023] [Indexed: 09/23/2023]
Abstract
The establishment of a bio-based circular economy is imperative in tackling the climate crisis and advancing sustainable development. In this realm, the creation of microbial cell factories is central to generating a variety of chemicals and materials. The design of metabolic pathways is crucial in shaping these microbial cell factories, especially when it comes to producing chemicals with yet-to-be-discovered biosynthetic routes. To aid in navigating the complexities of chemical and metabolic domains, computer-supported tools for metabolic pathway design have emerged. In this paper, we evaluate how digital strategies can be employed for pathway prediction and enzyme discovery. Additionally, we touch upon the recent strides made in using deep learning techniques for metabolic pathway prediction. These computational tools and strategies streamline the design of metabolic pathways, facilitating the development of microbial cell factories. Leveraging the capabilities of deep learning in metabolic pathway design is profoundly promising, potentially hastening the advent of a bio-based circular economy.
Collapse
Affiliation(s)
- Gahyeon Ryu
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Taeho Yu
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 Four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, KAIST, Daejeon, 34141, Republic of Korea; BioProcess Engineering Research Center and BioInformatics Research Center, KAIST, Daejeon, 34141, Republic of Korea; Graduate School of Engineering Biology, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
22
|
Luo J, Qian C, Wang X, Glass L, Ma F. pADR: Towards Personalized Adverse Drug Reaction Prediction by Modeling Multi-sourced Data. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT. ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT 2023; 2023:4724-4730. [PMID: 38601743 PMCID: PMC11005853 DOI: 10.1145/3583780.3615490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Predicting adverse drug reactions (ADRs) of drugs is one of the most critical steps in drug development. By pre-estimating the adverse reactions, researchers and drug development companies can greatly prevent the potential ADR risks and tragedies. However, the current ADR prediction methods suffer from several limitations. First, the prediction results are based on pure drug-related information, which makes them impossible to be directly applied for the personalized ADR prediction task. The lack of personalization of models also makes rare adverse events hard to be predicted. Therefore, it is of great interest to develop a new personalized ADR prediction method by introducing additional sources, e.g., patient health records. However, few methods have tried to use additional sources. In the meantime, the variety of different source formats and structures makes this task more challenging. To address the above challenges, we propose a novel personalized multi-sourced-based drug adverse reaction prediction model named pADR. pADR first works on every single source to transform them into proper representations. Next, a hierarchical multi-sourced Transformer is designed to automatically model the interactions between different sources and fuse them together for the final adverse event prediction. Experimental results on a new multi-sourced ADR prediction dataset show that PADR outperforms state-of-the-art drug-based baselines. Moreover, the case and ablation studies also illustrate the effectiveness of our proposed fusion strategies and the reasonableness of each module design.
Collapse
Affiliation(s)
- Junyu Luo
- The Pennsylvania State University, University Park, USA
| | | | - Xiaochen Wang
- The Pennsylvania State University, University Park, USA
| | | | - Fenglong Ma
- The Pennsylvania State University, University Park, USA
| |
Collapse
|
23
|
Mazuz E, Shtar G, Kutsky N, Rokach L, Shapira B. Pretrained transformer models for predicting the withdrawal of drugs from the market. Bioinformatics 2023; 39:btad519. [PMID: 37610328 PMCID: PMC10469107 DOI: 10.1093/bioinformatics/btad519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 07/24/2023] [Accepted: 08/22/2023] [Indexed: 08/24/2023] Open
Abstract
MOTIVATION The process of drug discovery is notoriously complex, costing an average of 2.6 billion dollars and taking ∼13 years to bring a new drug to the market. The success rate for new drugs is alarmingly low (around 0.0001%), and severe adverse drug reactions (ADRs) frequently occur, some of which may even result in death. Early identification of potential ADRs is critical to improve the efficiency and safety of the drug development process. RESULTS In this study, we employed pretrained large language models (LLMs) to predict the likelihood of a drug being withdrawn from the market due to safety concerns. Our method achieved an area under the curve (AUC) of over 0.75 through cross-database validation, outperforming classical machine learning models and graph-based models. Notably, our pretrained LLMs successfully identified over 50% drugs that were subsequently withdrawn, when predictions were made on a subset of drugs with inconsistent labeling between the training and test sets. AVAILABILITY AND IMPLEMENTATION The code and datasets are available at https://github.com/eyalmazuz/DrugWithdrawn.
Collapse
Affiliation(s)
- Eyal Mazuz
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva, 8410501, Israel
| | - Guy Shtar
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva, 8410501, Israel
| | - Nir Kutsky
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva, 8410501, Israel
| | - Lior Rokach
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva, 8410501, Israel
| | - Bracha Shapira
- Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, P.O.B. 653, Beer-Sheva, 8410501, Israel
| |
Collapse
|
24
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 44] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
25
|
Bhattacharya S, Sahoo A, Baitalik S. Human brain-inspired chemical artificial intelligence tools for the analysis and prediction of the anion-sensing characteristics of an imidazole-based luminescent Os(II)-bipyridine complex. Dalton Trans 2023; 52:6749-6762. [PMID: 37129261 DOI: 10.1039/d3dt00327b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Neural network and decision tree-based soft computing techniques are implemented in this work for the thorough analysis of the multichannel anion-sensing characteristics of an Os(II)-bipyridine complex derived from imidazole-4,5-bis(benzimidazole) ligand. With the aid of three imidazole NH protons in its outer coordination sphere, a substantial change in the spectral response as well as OsII/OsIII potential is made possible upon treating with anions of varying basicity. Initial hydrogen bonding between NH protons and anions and thereafter complete proton transfer from the complex backbone probably take place in the process. The deprotonation of the complex by specific anions and restoration to its original form by acid is also reversible. The responsiveness of the new compound is complex enough to imitate multiple sophisticated binary and ternary Boolean logic (BL) functions (NOT logic, combinational logic, traffic signal, set-reset flip-flop logic, and ternary NOR logic) by employing its spectral and redox outputs upon the action of suitable anions and acid in a proper sequence. Executing sensing investigations on altering the amount of the anions within a widespread range is often time-consuming and tedious. To overcome the lacuna, we implemented multiple soft computing techniques, viz., fuzzy logic (FL), artificial neural networks (ANNs), adaptive neuro-fuzzy inference system (ANFIS), and decision tree (DT) regression, for the thorough analysis and prediction of the experimentally observed results. The outcomes obtained from different techniques were compared among themselves as well as with the experimental data and utilized for the proper modeling of the anion-sensing behaviors of the complex.
Collapse
Affiliation(s)
- Sohini Bhattacharya
- Department of Chemistry, Inorganic Chemistry Section, Jadavpur University, Kolkata-700032, India.
| | - Anik Sahoo
- Department of Chemistry, Inorganic Chemistry Section, Jadavpur University, Kolkata-700032, India.
| | - Sujoy Baitalik
- Department of Chemistry, Inorganic Chemistry Section, Jadavpur University, Kolkata-700032, India.
| |
Collapse
|
26
|
Andronov M, Voinarovska V, Andronova N, Wand M, Clevert DA, Schmidhuber J. Reagent prediction with a molecular transformer improves reaction data quality. Chem Sci 2023; 14:3235-3246. [PMID: 36970100 PMCID: PMC10034139 DOI: 10.1039/d2sc06798f] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 02/12/2023] [Indexed: 03/05/2023] Open
Abstract
Automated synthesis planning is key for efficient generative chemistry. Since reactions of given reactants may yield different products depending on conditions such as the chemical context imposed by specific reagents, computer-aided synthesis planning should benefit from recommendations of reaction conditions. Traditional synthesis planning software, however, typically proposes reactions without specifying such conditions, relying on human organic chemists who know the conditions to carry out suggested reactions. In particular, reagent prediction for arbitrary reactions, a crucial aspect of condition recommendation, has been largely overlooked in cheminformatics until recently. Here we employ the Molecular Transformer, a state-of-the-art model for reaction prediction and single-step retrosynthesis, to tackle this problem. We train the model on the US patents dataset (USPTO) and test it on Reaxys to demonstrate its out-of-distribution generalization capabilities. Our reagent prediction model also improves the quality of product prediction: the Molecular Transformer is able to substitute the reagents in the noisy USPTO data with reagents that enable product prediction models to outperform those trained on plain USPTO. This makes it possible to improve upon the state-of-the-art in reaction product prediction on the USPTO MIT benchmark.
Collapse
Affiliation(s)
- Mikhail Andronov
- IDSIA, USI, SUPSI 6900 Lugano Switzerland
- Machine Learning Research, Pfizer Worldwide Research Development and Medical Linkstr.10 Berlin Germany
| | - Varvara Voinarovska
- Institute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) 85764 Neuherberg Germany
| | | | - Michael Wand
- IDSIA, USI, SUPSI 6900 Lugano Switzerland
- Institute for Digital Technologies for Personalized Healthcare, SUPSI 6900 Lugano Switzerland
| | - Djork-Arné Clevert
- Machine Learning Research, Pfizer Worldwide Research Development and Medical Linkstr.10 Berlin Germany
| | | |
Collapse
|
27
|
Tu Z, Stuyver T, Coley CW. Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery. Chem Sci 2023; 14:226-244. [PMID: 36743887 PMCID: PMC9811563 DOI: 10.1039/d2sc05089g] [Citation(s) in RCA: 29] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 11/25/2022] [Indexed: 11/29/2022] Open
Abstract
The field of predictive chemistry relates to the development of models able to describe how molecules interact and react. It encompasses the long-standing task of computer-aided retrosynthesis, but is far more reaching and ambitious in its goals. In this review, we summarize several areas where predictive chemistry models hold the potential to accelerate the deployment, development, and discovery of organic reactions and advance synthetic chemistry.
Collapse
Affiliation(s)
- Zhengkai Tu
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Thijs Stuyver
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| | - Connor W Coley
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
- Department of Chemical Engineering, Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge MA 02139 USA
| |
Collapse
|